Building a Production RAG System: Part 6 - Adding Vision Support for Diagrams and Charts

Series: Building a Production-Ready Textbook Q&A System with RAG
Part: 6 of 7 Read Time: 20 minutes Level: Intermediate to Advanced

What We'll Build in This Part

PDF page-to-image conversion
Claude Vision for diagram description
Image description embeddings
Unified search across text and images
Visual Q&A capabilities
Cost-optimized image processing

Estimated time: 2 hours

The Problem: Text-Only RAG Misses Visual Content

Consider a textbook page with a diagram:

User asks: "Explain the diagram on page 45"

❌ Text-only RAG:
"I don't have access to images."

✅ Vision-enabled RAG:
"The diagram shows a nested function structure where the inner function has access to variables from the outer function's scope..."

The Vision Pipeline

PDF Pages
    ↓
[1] Convert to Images → PNG files at 150 DPI
    ↓
[2] Send to Claude Vision → Describe diagrams, charts, formulas
    ↓
[3] Generate Embeddings → Convert descriptions to vectors
    ↓
[4] Store in document_images → Searchable image database
    ↓
[5] Unified Search → Query both text_chunks and image_descriptions
    ↓
✅ Complete Visual + Text Search

Step 1: Install Vision Dependencies

npm install pdf2pic graphicsmagick

System Requirements

macOS:

brew install graphicsmagick

Ubuntu/Debian:

sudo apt-get install graphicsmagick

Step 2: Add Image Extraction

Update scripts/ingest.ts:

import { fromPath } from 'pdf2pic'
import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
})

async function extractAndDescribeImages(
  filePath: string,
  numPages: number,
  documentId: string
): Promise<void> {
  console.log('🖼️  Extracting images from PDF...')

  const tempDir = path.join(process.cwd(), '.temp-images')
  const converter = fromPath(filePath, {
    density: 150,
    format: 'png',
    width: 1024,
    height: 1024,
  })

  // Process first 50 pages to control costs
  const maxPages = Math.min(numPages, 50)

  for (let pageNum = 1; pageNum <= maxPages; pageNum++) {
    const result = await converter(pageNum)
    const imageBuffer = fs.readFileSync(result.path)
    const base64Image = imageBuffer.toString('base64')

    // Describe with Claude Vision
    const description = await describeImageWithClaude(base64Image, pageNum)

    if (description.toLowerCase().includes('no diagrams')) {
      continue // Skip text-only pages
    }

    // Generate embedding and store
    const embedding = await generateEmbedding(description)
    await supabase.from('document_images').insert({
      document_id: documentId,
      page_number: pageNum,
      image_description: description,
      embedding: `[${embedding.join(',')}]`,
    })
  }
}

Step 3: Describe Images with Claude Vision

async function describeImageWithClaude(
  base64Image: string,
  pageNumber: number
): Promise<string> {
  const response = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20240620',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: [{
        type: 'image',
        source: {
          type: 'base64',
          media_type: 'image/png',
          data: base64Image,
        },
      }, {
        type: 'text',
        text: `This is page ${pageNumber} from a textbook.
Describe any diagrams, charts, tables, or formulas.

If no visual elements, respond: "No diagrams on this page."

For visuals, describe:
1. Type (diagram, chart, table, formula)
2. What concept it illustrates
3. Key components and relationships
4. Labels and annotations`,
      }],
    }],
  })

  return response.content[0].text
}

Step 4: Create Unified Search Function

Update the Supabase function:

CREATE OR REPLACE FUNCTION match_document_content(
    query_embedding vector(1536),
    match_threshold float DEFAULT 0.5,
    match_count int DEFAULT 5
)
RETURNS TABLE (
    id UUID,
    content TEXT,
    page_number INTEGER,
    similarity FLOAT,
    content_type TEXT
)
LANGUAGE plpgsql
AS $$
BEGIN
    RETURN QUERY
    -- Search text chunks
    SELECT
        dc.id,
        dc.content,
        dc.page_number,
        1 - (dc.embedding <=> query_embedding) AS similarity,
        'text'::TEXT AS content_type
    FROM document_chunks dc
    WHERE 1 - (dc.embedding <=> query_embedding) > match_threshold

    UNION ALL

    -- Search image descriptions
    SELECT
        di.id,
        di.image_description AS content,
        di.page_number,
        1 - (di.embedding <=> query_embedding) AS similarity,
        'image'::TEXT AS content_type
    FROM document_images di
    WHERE 1 - (di.embedding <=> query_embedding) > match_threshold

    ORDER BY similarity DESC
    LIMIT match_count;
END;
$$;

What this does:

Searches both text chunks AND image descriptions
Returns unified results sorted by similarity
Indicates content type (text or image)
Preserves page numbers for citations

Step 5: Update Context Builder

Handle both text and image results:

function buildContext(results: SearchResult[]): string {
  const contextParts = results.map((result, index) => {
    const contentType = result.content_type === 'image'
      ? '[IMAGE DESCRIPTION]'
      : '[TEXT]'

    return `${contentType} [${index + 1}] ${result.content}
(Page ${result.page_number})`
  })

  return contextParts.join('\n\n---\n\n')
}

Example Context:

[TEXT] [1] A closure is a function that has access...
(Page 44)

---

[IMAGE DESCRIPTION] [2] The diagram shows nested function boxes...
(Page 45)

Cost Analysis for Vision

Real Costs

Text-only ingestion (300-page book):

350 chunks × $0.0001 = $0.035

Text + Vision (50 pages processed):

350 chunks × $0.0001 = $0.035
50 images × $0.015 = $0.75
Total: $0.785

Optimization Strategy

Process images selectively:

// Only process pages with likely diagrams
const diagramPages = [5, 12, 23, 45, 67, 89]

for (const pageNum of diagramPages) {
  await processPage(pageNum)
}

// New cost: 6 images × $0.015 = $0.09 (vs $0.75)

Testing Vision Support

Test 1: Ask About a Diagram

User: "Explain the closure diagram on page 45"

Response:
The diagram on page 45 illustrates how closures work in JavaScript.
It shows a nested function structure where an inner function is defined
inside an outer function. The diagram demonstrates that the inner function
maintains access to variables from the outer function's scope...

(Source: Page 45 - IMAGE DESCRIPTION)

Test 2: Mixed Text and Visual Results

User: "How do closures work?"

Response:
A closure is a function that has access to variables in its outer scope
(JavaScript Guide, page 44).

The diagram on page 45 shows this concept visually with nested function
boxes, where the inner function box has access to variables from the
outer function box...

Edge Cases and Solutions

Edge Case 1: Page Has Only Text

Problem: Claude returns "No diagrams or visual elements on this page."

Solution: Skip storing this description

if (description.toLowerCase().includes('no diagrams')) {
  console.log(`⏭️  Skipping page ${pageNum} - no visuals`)
  continue
}

Edge Case 2: Vision API Rate Limits

async function describeImageWithRetry(
  base64Image: string,
  maxRetries = 3
): Promise<string> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await describeImageWithClaude(base64Image)
    } catch (error) {
      if (error.status === 429 && i < maxRetries - 1) {
        const delay = Math.pow(2, i) * 2000 // 2s, 4s, 8s
        await new Promise(resolve => setTimeout(resolve, delay))
        continue
      }
      throw error
    }
  }
}

What We Accomplished

Vision-Enabled RAG - PDF to image, description, unified search
Cost Optimization - Selective processing, smart filtering
Production Features - Error handling, retries, cleanup

Coming Up in Part 7

In Part 7: Production Features, Subscriptions, and Deployment, we'll:

Implement subscription tiers (Free/Pro/Unlimited)
Add Stripe payment integration
Create rate limiting based on usage
Build analytics dashboard
Deploy to production

Estimated time: 3-4 hours to complete Part 7

Homework Challenge

Add diagram-only search: Let users search specifically for visual content
Generate image thumbnails: Store small preview images
Implement OCR: Extract text from images using Tesseract
Create visual diff: Compare diagram versions across editions

Tags:
#Vision #ClaudeVision #ImageProcessing #MultimodalRAG #PDF