Building a Production RAG System: Part 1 - Understanding RAG and System Architecture

Series: Building a Production-Ready Textbook Q&A System with RAG
Part: 1 of 7 Read Time: 15 minutes Level: Beginner to Intermediate

The Problem: When AI Gets It Wrong

Imagine you're studying for an exam and you ask ChatGPT: "What's the capital of France in 2024?"

ChatGPT confidently responds: "Paris."

Great! Now ask: "According to my textbook on page 45, what does it say about closures in JavaScript?"

ChatGPT responds: "I don't have access to your specific textbook, but generally, closures are..."

This is the fundamental limitation of Large Language Models (LLMs):

They only know what they were trained on (knowledge cutoff)
They can't access your private documents
They sometimes "hallucinate" (make up plausible-sounding but incorrect answers)
They can't cite sources or provide page numbers

For a student studying from textbooks, this is a dealbreaker.

The Solution: RAG (Retrieval-Augmented Generation)

RAG combines the best of two worlds:

Retrieval: Search your documents for relevant information
Generation: Use an LLM to generate answers based only on what was found

Think of it like this:

Traditional LLM: "Tell me what you know about X" (might hallucinate)
RAG: "Here are 5 relevant paragraphs from the textbook. Based on ONLY these paragraphs, answer the question" (grounded in facts)

How RAG Works (Simplified)

User Question: "What is a closure in JavaScript?"
       ↓
1. Convert question to vector embedding
       ↓
2. Search textbook for similar content (vector similarity)
       ↓
3. Retrieve top 5 most relevant chunks
       ↓
4. Send chunks + question to LLM
       ↓
5. LLM generates answer with citations
       ↓
Answer: "A closure is a function that has access to variables
         in its outer scope... (See page 45, paragraph 3)"

What We're Building

By the end of this series, you'll have built a production-ready SaaS application that:

✅ Core Features

Upload PDF textbooks and automatically process them
Ask questions in natural language
Get AI-generated answers with page number citations
Search across both text and images (diagrams, charts, formulas)
Real-time streaming responses

✅ Production Features

User authentication (sign up, sign in, password reset)
Three subscription tiers: Free, Pro, Unlimited
Stripe payment integration
Rate limiting based on subscription
Usage analytics and logging
Row-level security (RLS) for data protection

✅ Technical Highlights

Semantic search using vector embeddings
pgvector for fast similarity search (100x faster than brute force)
Vision AI to understand diagrams and charts
Cost-optimized (~$0.15 per 100 questions)
Scalable to millions of document chunks

The Tech Stack (And Why)

Let me explain our technology choices:

Frontend: Next.js 14 + React 19 + TailwindCSS

Why?

Next.js App Router for modern React development
Server-side rendering (SSR) for better SEO and performance
API Routes for backend logic (no separate server needed)
TailwindCSS for rapid UI development

Database: Supabase (PostgreSQL + pgvector)

Why?

PostgreSQL is battle-tested and reliable
pgvector extension enables vector similarity search
Built-in authentication and Row Level Security (RLS)
Real-time subscriptions (bonus feature potential)
Generous free tier for development

AI/ML Stack

Why this combination?

1. OpenAI text-embedding-ada-002 (Embeddings)

Industry-standard embedding model
1536 dimensions, optimized for cosine similarity
Cost: $0.0001 per 1000 tokens (~$0.10 per textbook)

2. Anthropic Claude 3.5 Sonnet (Answer Generation)

Superior reasoning and instruction following
200K context window (can fit lots of chunks)
Vision capabilities (can see diagrams!)
Cost: ~$3 per million input tokens

Payments: Stripe

Why?

Industry standard for subscriptions
Excellent developer experience
Handles all payment complexity (tax, invoicing, dunning)
Webhooks for real-time subscription updates

System Architecture Overview

Here's the complete architecture of what we'll build:

┌─────────────────────────────────────────────────────────────┐
│                          USER                               │
│                    (Web Browser)                            │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   NEXT.JS APPLICATION                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ Landing  │  │Dashboard │  │ Pricing  │  │   Chat   │   │
│  │   Page   │  │   Page   │  │   Page   │  │   Page   │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
└─────────────────────────────────────────────────────────────┘
                            │
                ┌───────────┼───────────┐
                ▼           ▼           ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │  OpenAI  │  │Anthropic │  │  Stripe  │
        │Embeddings│  │  Claude  │  │ Payments │
        └──────────┘  └──────────┘  └──────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              SUPABASE (PostgreSQL + Auth)                   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │ TABLES:                                              │  │
│  │  • profiles (users, subscription tiers)             │  │
│  │  • documents (textbook metadata)                    │  │
│  │  • document_chunks (text + embeddings)              │  │
│  │  • document_images (image descriptions + embeddings)│  │
│  │  • usage_logs (rate limiting & analytics)           │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │ pgvector EXTENSION:                                  │  │
│  │  • Vector similarity search                          │  │
│  │  • IVFFlat indexing (100x faster)                   │  │
│  │  • Cosine distance operator (<=>)                   │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

The RAG Pipeline (Detailed)

Let's break down exactly what happens when a user asks a question:

Step 1: User Authentication & Rate Limiting

// Check if user is authenticated
const user = await getUser()

// Check subscription tier and daily usage
const { queries_today, limit_reached } = await checkUsageLimit(user.id)

if (limit_reached) {
  return "Daily limit reached. Upgrade to Pro!"
}

Why this matters: Free users get 50 questions/day, Pro gets 500, Unlimited gets... unlimited. This is how we monetize.

Step 2: Generate Query Embedding

// Convert user's question into a 1536-dimensional vector
const embedding = await openai.embeddings.create({
  model: 'text-embedding-ada-002',
  input: "What is a closure in JavaScript?"
})

// Result: [0.234, -0.567, 0.891, ..., 0.123] (1536 numbers)

What's an embedding? It's a mathematical representation of meaning. Similar concepts have similar vectors. For example:

"closure" and "function scope" → vectors point in similar directions
"closure" and "banana" → vectors point in different directions

Step 3: Vector Similarity Search

-- Find chunks where cosine similarity > 0.5
SELECT
  content,
  page_number,
  1 - (embedding <=> query_embedding) AS similarity
FROM document_chunks
WHERE 1 - (embedding <=> query_embedding) > 0.5
ORDER BY embedding <=> query_embedding
LIMIT 5;

The magic: pgvector uses IVFFlat indexing to search millions of vectors in milliseconds:

Divide all vectors into ~100 clusters
Find which clusters are closest to the query
Search only within those clusters (not the entire database!)

Result: 100-1000x faster than brute-force search.

Cost Analysis (Real Numbers)

Let's talk money. Here's what it costs to run this system:

Ingestion (One-Time Per Textbook)

Component	Cost per Textbook	Details
Text Embeddings	$0.10	1000 chunks × $0.0001
Image Descriptions	$0.75	50 pages × $0.015
Total	$0.85	Per 300-page textbook

Query (Per Question)

Component	Cost per Query	Details
Query Embedding	$0.0001	1 embedding
Claude Response	$0.015	~5K tokens @ $3/1M
Total	$0.015	~1.5¢ per question

What Makes This System Production-Ready?

Many RAG tutorials show you how to build a proof-of-concept. This series builds a real SaaS product:

✅ Authentication: Secure user accounts with Supabase Auth
✅ Authorization: Row-level security to protect user data
✅ Monetization: Stripe subscriptions with three tiers
✅ Rate Limiting: Prevent abuse and manage costs
✅ Error Handling: Graceful failures with user feedback
✅ Logging: Track usage for debugging and analytics
✅ Cost Optimization: Techniques to reduce AI costs by 70%
✅ Scalability: Architecture that works from 10 to 10,000 users
✅ Vision Support: Beyond just text, understand images too

Coming Up in Part 2

In the next post, we'll roll up our sleeves and start coding. We'll set up:

Next.js 14 with App Router and TypeScript
Supabase project with PostgreSQL
User authentication (sign-up, sign-in, session management)
Protected routes and middleware
Basic dashboard UI with TailwindCSS

Estimated time: 1-2 hours to complete Part 2

Continue to Part 2 →

Tags:
#RAG #AI #NextJS #PostgreSQL #Supabase #OpenAI #Claude #BuildInPublic