Estimated time: 2 hours
Consider a textbook page with a diagram:
PDF Pages
↓
[1] Convert to Images → PNG files at 150 DPI
↓
[2] Send to Claude Vision → Describe diagrams, charts, formulas
↓
[3] Generate Embeddings → Convert descriptions to vectors
↓
[4] Store in document_images → Searchable image database
↓
[5] Unified Search → Query both text_chunks and image_descriptions
↓
✅ Complete Visual + Text Search
npm install pdf2pic graphicsmagick
macOS:
brew install graphicsmagick
Ubuntu/Debian:
sudo apt-get install graphicsmagick
Update scripts/ingest.ts:
import { fromPath } from 'pdf2pic'
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
})
async function extractAndDescribeImages(
filePath: string,
numPages: number,
documentId: string
): Promise<void> {
console.log('🖼️ Extracting images from PDF...')
const tempDir = path.join(process.cwd(), '.temp-images')
const converter = fromPath(filePath, {
density: 150,
format: 'png',
width: 1024,
height: 1024,
})
// Process first 50 pages to control costs
const maxPages = Math.min(numPages, 50)
for (let pageNum = 1; pageNum <= maxPages; pageNum++) {
const result = await converter(pageNum)
const imageBuffer = fs.readFileSync(result.path)
const base64Image = imageBuffer.toString('base64')
// Describe with Claude Vision
const description = await describeImageWithClaude(base64Image, pageNum)
if (description.toLowerCase().includes('no diagrams')) {
continue // Skip text-only pages
}
// Generate embedding and store
const embedding = await generateEmbedding(description)
await supabase.from('document_images').insert({
document_id: documentId,
page_number: pageNum,
image_description: description,
embedding: `[${embedding.join(',')}]`,
})
}
}
async function describeImageWithClaude(
base64Image: string,
pageNumber: number
): Promise<string> {
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20240620',
max_tokens: 1024,
messages: [{
role: 'user',
content: [{
type: 'image',
source: {
type: 'base64',
media_type: 'image/png',
data: base64Image,
},
}, {
type: 'text',
text: `This is page ${pageNumber} from a textbook.
Describe any diagrams, charts, tables, or formulas.
If no visual elements, respond: "No diagrams on this page."
For visuals, describe:
1. Type (diagram, chart, table, formula)
2. What concept it illustrates
3. Key components and relationships
4. Labels and annotations`,
}],
}],
})
return response.content[0].text
}
Update the Supabase function:
CREATE OR REPLACE FUNCTION match_document_content(
query_embedding vector(1536),
match_threshold float DEFAULT 0.5,
match_count int DEFAULT 5
)
RETURNS TABLE (
id UUID,
content TEXT,
page_number INTEGER,
similarity FLOAT,
content_type TEXT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
-- Search text chunks
SELECT
dc.id,
dc.content,
dc.page_number,
1 - (dc.embedding <=> query_embedding) AS similarity,
'text'::TEXT AS content_type
FROM document_chunks dc
WHERE 1 - (dc.embedding <=> query_embedding) > match_threshold
UNION ALL
-- Search image descriptions
SELECT
di.id,
di.image_description AS content,
di.page_number,
1 - (di.embedding <=> query_embedding) AS similarity,
'image'::TEXT AS content_type
FROM document_images di
WHERE 1 - (di.embedding <=> query_embedding) > match_threshold
ORDER BY similarity DESC
LIMIT match_count;
END;
$$;
Handle both text and image results:
function buildContext(results: SearchResult[]): string {
const contextParts = results.map((result, index) => {
const contentType = result.content_type === 'image'
? '[IMAGE DESCRIPTION]'
: '[TEXT]'
return `${contentType} [${index + 1}] ${result.content}
(Page ${result.page_number})`
})
return contextParts.join('\n\n---\n\n')
}
[TEXT] [1] A closure is a function that has access...
(Page 44)
---
[IMAGE DESCRIPTION] [2] The diagram shows nested function boxes...
(Page 45)
Text-only ingestion (300-page book):
350 chunks × $0.0001 = $0.035
Text + Vision (50 pages processed):
Process images selectively:
// Only process pages with likely diagrams
const diagramPages = [5, 12, 23, 45, 67, 89]
for (const pageNum of diagramPages) {
await processPage(pageNum)
}
// New cost: 6 images × $0.015 = $0.09 (vs $0.75)
User: "Explain the closure diagram on page 45"
Response:
The diagram on page 45 illustrates how closures work in JavaScript.
It shows a nested function structure where an inner function is defined
inside an outer function. The diagram demonstrates that the inner function
maintains access to variables from the outer function's scope...
(Source: Page 45 - IMAGE DESCRIPTION)
User: "How do closures work?"
Response:
A closure is a function that has access to variables in its outer scope
(JavaScript Guide, page 44).
The diagram on page 45 shows this concept visually with nested function
boxes, where the inner function box has access to variables from the
outer function box...
if (description.toLowerCase().includes('no diagrams')) {
console.log(`⏭️ Skipping page ${pageNum} - no visuals`)
continue
}
async function describeImageWithRetry(
base64Image: string,
maxRetries = 3
): Promise<string> {
for (let i = 0; i < maxRetries; i++) {
try {
return await describeImageWithClaude(base64Image)
} catch (error) {
if (error.status === 429 && i < maxRetries - 1) {
const delay = Math.pow(2, i) * 2000 // 2s, 4s, 8s
await new Promise(resolve => setTimeout(resolve, delay))
continue
}
throw error
}
}
}