Building a Production RAG System: Part 5 - Implementing the RAG Query Pipeline

Series: Building a Production-Ready Textbook Q&A System with RAG
Part: 5 of 7 Read Time: 25 minutes Level: Intermediate

What We'll Build in This Part

A chat API endpoint with streaming responses
Semantic search using pgvector
Context building from retrieved chunks
Answer generation with Claude
Citations with page numbers
Error handling and validation

Estimated time: 2-3 hours

The RAG Query Flow

When a user asks a question, here's what happens:

User Question: "What is a closure in JavaScript?"
    ↓
[1] Generate Query Embedding → [0.234, -0.567, ..., 0.123]
    ↓
[2] Search Similar Chunks → Top 5 matches with similarity > 0.5
    ↓
[3] Build Context → Combine chunks with metadata
    ↓
[4] Create System Prompt → Instructions + context
    ↓
[5] Call Claude API → Generate answer with citations
    ↓
[6] Stream Response → Real-time text streaming to UI
    ↓
Answer: "A closure is a function that has access to variables..."
        (See page 45, JavaScript Guide)

Step 1: Create the Chat API Route

Create app/api/chat/route.ts:

import { NextRequest, NextResponse } from 'next/server'
import OpenAI from 'openai'
import Anthropic from '@anthropic-ai/sdk'
import { createClient } from '@/lib/supabase/server'

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

export async function POST(request: NextRequest) {
  try {
    const { question, documentId } = await request.json()

    // Validate inputs
    if (!question || typeof question !== 'string') {
      return NextResponse.json(
        { error: 'Question is required' },
        { status: 400 }
      )
    }

    // Get authenticated user
    const supabase = createClient()
    const { data: { user } } = await supabase.auth.getUser()

    if (!user) {
      return NextResponse.json({ error: 'Unauthorized' }, { status: 401 })
    }

    // Continue with RAG pipeline...
  } catch (error) {
    return NextResponse.json({ error: 'Internal error' }, { status: 500 })
  }
}

Step 2: Search for Similar Chunks

async function searchSimilarChunks(
  supabase: any,
  queryEmbedding: number[],
  documentId?: string
): Promise<SearchResult[]> {
  // Convert to pgvector format
  const embeddingString = `[${queryEmbedding.join(',')}]`

  // Call the match function
  const { data, error } = await supabase.rpc('match_document_chunks', {
    query_embedding: embeddingString,
    match_threshold: 0.5,
    match_count: 5,
    filter_document_id: documentId || null,
  })

  if (error) throw new Error('Failed to search documents')

  return data || []
}

Key points:

Convert embedding to pgvector string: [0.1,0.2,0.3]
Threshold of 0.5 (50% similarity) filters weak matches
Retrieve top 5 chunks for context
Optional document filter for single-book search

Step 3: Build Context from Results

function buildContext(results: SearchResult[]): string {
  if (results.length === 0) {
    return 'No relevant information found in the textbooks.'
  }

  const contextParts = results.map((result, index) => {
    const source = result.document_title
      ? `${result.document_title} by ${result.document_author || 'Unknown'}`
      : 'Document'

    return `[${index + 1}] ${result.content}
(Source: ${source}, Page ${result.page_number})`
  })

  return contextParts.join('\n\n---\n\n')
}

Example Output:

[1] A closure is a function that has access to variables in its outer scope...
(Source: JavaScript Guide by John Doe, Page 45)

---

[2] Functions in JavaScript are first-class citizens...
(Source: JavaScript Guide by John Doe, Page 47)

Step 4: Create System Prompt

function createSystemPrompt(context: string): string {
  return `You are an AI tutor helping students understand their textbooks.

Your task is to answer questions based ONLY on the provided context.

Rules:
1. Use only the context - no external knowledge
2. Always cite page numbers: "According to the guide (page 45)..."
3. Be clear and educational
4. If uncertain, admit it

---

CONTEXT FROM TEXTBOOKS:

${context}

---

Remember: Cite page numbers and document titles.`
}

Why this prompt works:

Clear constraints prevent hallucination
Citation requirement ensures verifiable answers
Educational tone helps students learn
Admission of uncertainty builds trust

Step 5: Implement Streaming Response

async function generateStreamingAnswer(
  question: string,
  systemPrompt: string
): Promise<ReadableStream> {
  const stream = await anthropic.messages.stream({
    model: 'claude-3-5-sonnet-20240620',
    max_tokens: 2048,
    messages: [{ role: 'user', content: question }],
    system: systemPrompt,
  })

  const encoder = new TextEncoder()

  return new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        if (chunk.type === 'content_block_delta' &&
            chunk.delta.type === 'text_delta') {
          controller.enqueue(encoder.encode(chunk.delta.text))
        }
      }
      controller.close()
    },
  })
}

Why streaming?

Better UX: Users see responses immediately
Reduces perceived latency
Works well for long answers

Step 6: Create the Chat UI

Create app/chat/page.tsx with a complete chat interface:

'use client'

import { useState } from 'react'

export default function ChatPage() {
  const [messages, setMessages] = useState<Message[]>([])
  const [input, setInput] = useState('')
  const [loading, setLoading] = useState(false)

  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault()
    if (!input.trim()) return

    setMessages(prev => [...prev, { role: 'user', content: input }])
    setInput('')
    setLoading(true)

    const response = await fetch('/api/chat', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ question: input }),
    })

    const reader = response.body?.getReader()
    const decoder = new TextDecoder()
    let assistantMessage = ''

    setMessages(prev => [...prev, { role: 'assistant', content: '' }])

    while (true) {
      const { done, value } = await reader.read()
      if (done) break

      assistantMessage += decoder.decode(value)
      setMessages(prev => {
        const newMessages = [...prev]
        newMessages[newMessages.length - 1].content = assistantMessage
        return newMessages
      })
    }

    setLoading(false)
  }

  return (
    // Chat UI JSX...
  )
}

Optimizing RAG Performance

Strategy 1: Adjust Match Threshold

// Lower threshold = more results (may include less relevant)
const results = await searchSimilarChunks(supabase, queryEmbedding, 0.3)

// Higher threshold = fewer, higher quality results
const results = await searchSimilarChunks(supabase, queryEmbedding, 0.7)

Recommendation: Start with 0.5, adjust based on quality.

Strategy 2: Hybrid Search (Keyword + Vector)

async function hybridSearch(
  supabase: any,
  query: string,
  queryEmbedding: number[]
) {
  // Get vector results
  const vectorResults = await searchSimilarChunks(supabase, queryEmbedding)

  // Get keyword results
  const { data: keywordResults } = await supabase
    .from('document_chunks')
    .select('*')
    .textSearch('content', query)
    .limit(5)

  // Combine and deduplicate
  const combined = [...vectorResults, ...keywordResults]
  return Array.from(new Map(combined.map(r => [r.id, r])).values())
}

Testing the Complete Pipeline

Test 1: Basic Question

Input: "What is a closure?"

Output:
A closure is a function that has access to variables in its outer scope,
even after the outer function has returned. According to the JavaScript
Guide (page 45), closures are created when a function is defined inside
another function...

Test 2: No Relevant Content

Input: "What is quantum computing?"

Output:
I don't have enough information in your textbooks to answer this question.
The available context doesn't cover quantum computing.

What We Accomplished

Complete RAG Query Pipeline - Embedding, search, context, generation
Production Features - Streaming, auth, logging, validation
Optimization Strategies - Thresholds, hybrid search, re-ranking

Coming Up in Part 6

In Part 6: Adding Vision Support for Diagrams and Charts, we'll:

Extract images from PDF pages
Use Claude Vision to describe diagrams
Embed image descriptions
Create unified search across text and images
Handle visual Q&A

Estimated time: 2 hours to complete Part 6

Homework Challenge

Add conversation history: Remember previous questions in the session
Implement follow-up questions: "Can you explain that in simpler terms?"
Add document selection UI: Let users choose which textbook to query
Create suggested questions: Show example questions based on content
Add feedback buttons: Let users rate answer quality

Tags:
#RAG #Query #Claude #Streaming #Chat #VectorSearch