Building a Production RAG System: Part 3 - Understanding Vector Embeddings and pgvector

Series: Building a Production-Ready Textbook Q&A System with RAG
Part: 3 of 7 Read Time: 15 minutes Level: Beginner to Intermediate

What We'll Learn in This Part

By the end of this tutorial, you'll understand:

Estimated time: 1 hour

The Problem: How Do Computers Understand Meaning?

Imagine you're building a search engine. A user searches for "puppy" and you have these documents:

  1. "I adopted a cute dog yesterday"
  2. "Python is a programming language"
  3. "The young canine was playful"

Traditional keyword search would return ZERO results because none contain the word "puppy".

But a human instantly knows:

How can we teach computers to understand that "puppy" ≈ "dog" ≈ "canine"?
The answer: Vector Embeddings

What Are Vector Embeddings?

A vector embedding is a list of numbers that represents the meaning of text.

Example: Simple 2D Embeddings

Imagine we could represent words in just 2 dimensions:

"puppy"    → [0.8, 0.9]   (high on "cute", high on "animal")
"dog"      → [0.7, 0.85]  (similar to puppy)
"kitten"   → [0.9, 0.6]   (high on "cute", medium on "animal")
"python"   → [0.1, 0.2]   (low on both)
"Ferrari"  → [0.3, 0.1]   (low on "cute", very low on "animal")

If we plot these on a graph:

1.0 │     puppy
    │      •
0.9 │    kitten
    │      •
0.8 │    dog
    │     •
    │
0.5 │
    │
0.3 │  Ferrari
    │    •
0.2 │ python
    │   •
0.0 └─────────────────
    0.0       1.0
            
Key insight: Words with similar meanings are close together in vector space!

Real Embeddings: 1536 Dimensions

OpenAI's text-embedding-ada-002 doesn't use 2 dimensions—it uses 1536 dimensions.

const embedding = await openai.embeddings.create({
  model: 'text-embedding-ada-002',
  input: 'What is a closure in JavaScript?'
})

console.log(embedding.data[0].embedding)
// [0.0234, -0.0567, 0.0891, ..., 0.0123]  (1536 numbers)

You can't visualize 1536 dimensions, but the same principle applies:

How Embeddings Capture Meaning

Embeddings learn relationships through training on massive text datasets. They capture:

1. Semantic Similarity

"JavaScript closure"    ≈ "function scope"
"buy groceries"         ≈ "purchase food"
"neural network"        ≈ "deep learning"

2. Analogies

"king" - "man" + "woman" ≈ "queen"
"Paris" - "France" + "Italy" ≈ "Rome"

3. Context

"bank" (financial institution) → different vector than
"bank" (river side)

The model understands context based on surrounding words!

Vector Similarity: Measuring Closeness

Given two vectors, how do we measure how "similar" they are?

Method 1: Cosine Similarity (Most Common)

Cosine similarity measures the angle between two vectors (ignoring magnitude).

cosine_similarity = (A · B) / (||A|| × ||B||)

Result ranges from -1 to 1:
  1   = identical direction (very similar)
  0   = perpendicular (unrelated)
 -1   = opposite direction (very different)
In pgvector, we use cosine distance = 1 - cosine_similarity
-- Lower distance = more similar
SELECT 1 - (embedding <=> query_embedding) AS similarity
FROM document_chunks
ORDER BY embedding <=> query_embedding
LIMIT 5;

Method 2: Euclidean Distance (L2)

Euclidean distance measures straight-line distance between points.

euclidean_distance = √((A₁-B₁)² + (A₂-B₂)² + ... + (Aₙ-Bₙ)²)

Shorter distance = more similar

Enter pgvector: Vector Search in PostgreSQL

pgvector is a PostgreSQL extension that adds vector data types and similarity search.

Why pgvector is Revolutionary

Before pgvector, to do vector search you needed:

With pgvector:

pgvector Data Types

pgvector adds a vector(n) data type:

CREATE TABLE items (
  id SERIAL PRIMARY KEY,
  embedding vector(1536)  -- 1536-dimensional vector
);
Important: The number in vector(1536) MUST match your embedding model's dimensions!

Distance Operators in pgvector

pgvector provides three distance operators:

Operator Distance Type SQL Example
<-> Euclidean (L2) embedding <-> '[0.1,0.2,0.3]'
<=> Cosine embedding <=> '[0.1,0.2,0.3]'
<#> Inner Product embedding <#> '[0.1,0.2,0.3]'
For text similarity, we use cosine distance (<=>)

The Performance Problem: Brute Force Search

Imagine you have 1 million document chunks. To find the most similar chunk, you need to:

  1. Calculate distance from query to chunk #1
  2. Calculate distance from query to chunk #2
  3. ...
  4. Calculate distance from query to chunk #1,000,000
  5. Sort and return top 5
This is SLOW 🐢
For 1M chunks × 1536 dimensions = 1.5 billion calculations per query!

The Solution: IVFFlat Indexing

IVFFlat (Inverted File with Flat compression) is pgvector's indexing strategy.

How IVFFlat Works

Think of it like organizing books in a library:

Without index (brute force):

With IVFFlat:

  1. Divide books into 100 sections (clusters) by topic
  2. When user asks for "JavaScript books":
    • Figure out which sections likely have JavaScript books (maybe 3 sections)
    • Search only those 3 sections (30,000 books instead of 1M)

Performance Gains

Vectors Brute Force IVFFlat Speedup
10,000 50ms 5ms 10x
100,000 500ms 8ms 62x
1,000,000 5000ms 15ms 333x
Trade-off: Slight accuracy loss (might miss some results) for massive speed gain.

Creating the Vector Database Schema

Let's set up our complete schema with pgvector support.

Step 1: Enable pgvector Extension

In Supabase SQL Editor:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

Step 2: Create Documents Table

-- Store textbook metadata
CREATE TABLE IF NOT EXISTS public.documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
    title TEXT NOT NULL,
    author TEXT,
    file_path TEXT NOT NULL,
    file_size INTEGER,
    page_count INTEGER,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Enable RLS
ALTER TABLE public.documents ENABLE ROW LEVEL SECURITY;

-- Users can only see their own documents
CREATE POLICY "Users can view own documents"
    ON public.documents FOR SELECT
    USING (auth.uid() = user_id);

Step 3: Create Document Chunks Table with Vectors

-- Store text chunks with embeddings
CREATE TABLE IF NOT EXISTS public.document_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL REFERENCES public.documents(id) ON DELETE CASCADE,
    content TEXT NOT NULL,
    page_number INTEGER NOT NULL,
    chunk_index INTEGER NOT NULL,
    embedding vector(1536),  -- OpenAI ada-002 embeddings
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(document_id, chunk_index)
);

-- Create IVFFlat index for vector similarity search
CREATE INDEX IF NOT EXISTS idx_document_chunks_embedding
    ON public.document_chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
Breaking down the index:

Choosing the lists parameter:

Rule of thumb: lists = rows / 1000

Examples:
- 10,000 rows   → lists = 10
- 100,000 rows  → lists = 100
- 1,000,000 rows → lists = 1000

Step 4: Create the Search Function

-- Function to search for similar chunks
CREATE OR REPLACE FUNCTION match_document_chunks(
    query_embedding vector(1536),
    match_threshold float DEFAULT 0.5,
    match_count int DEFAULT 5,
    filter_document_id uuid DEFAULT NULL
)
RETURNS TABLE (
    id UUID,
    document_id UUID,
    content TEXT,
    page_number INTEGER,
    similarity FLOAT,
    document_title TEXT,
    document_author TEXT
)
LANGUAGE plpgsql
AS $$
BEGIN
    RETURN QUERY
    SELECT
        dc.id,
        dc.document_id,
        dc.content,
        dc.page_number,
        1 - (dc.embedding <=> query_embedding) AS similarity,
        d.title AS document_title,
        d.author AS document_author
    FROM document_chunks dc
    JOIN documents d ON dc.document_id = d.id
    WHERE
        (filter_document_id IS NULL OR dc.document_id = filter_document_id)
        AND 1 - (dc.embedding <=> query_embedding) > match_threshold
        AND d.user_id = auth.uid()
    ORDER BY dc.embedding <=> query_embedding
    LIMIT match_count;
END;
$$;

How the Search Works (Step by Step)

Let's trace what happens when a user asks: "What is a closure?"

Step 1: Generate Query Embedding

const response = await openai.embeddings.create({
  model: 'text-embedding-ada-002',
  input: 'What is a closure?'
})

const queryEmbedding = response.data[0].embedding
// [0.0234, -0.0567, ..., 0.0123] (1536 numbers)

Step 2: Convert to pgvector Format

const embeddingString = `[${queryEmbedding.join(',')}]`
// "[0.0234,-0.0567,...,0.0123]"
Critical: pgvector expects a string format, not a JSON array!

Step 3: Call the Search Function

const { data, error } = await supabase.rpc('match_document_chunks', {
  query_embedding: embeddingString,
  match_threshold: 0.5,
  match_count: 5
})

Step 4: pgvector Searches Using IVFFlat

1. Find 3 closest clusters to query vector
2. Search only those 3 clusters (say, 3000 chunks)
3. Calculate exact cosine distance for those 3000
4. Filter where similarity > 0.5
5. Sort by similarity
6. Return top 5

Result in ~10-20ms instead of 500ms+ 🚀
            

Visualizing How It All Fits Together

USER QUESTION
    "What is a closure?"
         ↓
    [OpenAI API]
         ↓
  Query Embedding
    [0.234, -0.567, ..., 0.123] (1536D)
         ↓
    [pgvector]
         ↓
  IVFFlat Index
    • Finds 3 closest clusters
    • Searches ~3000 chunks
    • Calculates cosine similarity
         ↓
  Top 5 Matches
    1. "A closure is a function..." (similarity: 0.89)
    2. "Functions in JavaScript..." (similarity: 0.82)
    3. "Lexical scope means..." (similarity: 0.76)
    4. "Here's a closure example..." (similarity: 0.71)
    5. "Inner functions have access..." (similarity: 0.68)
         ↓
    [Build Context]
         ↓
    [Send to Claude]
         ↓
   ANSWER WITH CITATIONS
            

Common Pitfalls and Solutions

Pitfall 1: Wrong Embedding Dimensions

-- ❌ Error: dimension mismatch
CREATE TABLE chunks (embedding vector(768))  -- Wrong!

-- ✅ Correct
CREATE TABLE chunks (embedding vector(1536))  -- Matches ada-002

Pitfall 2: Storing as JSON Array

// ❌ Wrong: Postgres sees this as text, not vector
await supabase.from('chunks').insert({
  embedding: [0.1, 0.2, 0.3]  // JSON array
})

// ✅ Correct: Convert to pgvector string format
await supabase.from('chunks').insert({
  embedding: `[${embedding.join(',')}]`  // String format
})

Pitfall 3: Not Creating Index

❌ Without index: searches take 500ms+
✅ With index: searches take 10-20ms

Always create the index!

What We Accomplished

Let's review what we learned:

Homework Challenge

Before moving to Part 4, try these exercises:

  1. Experiment with similarity thresholds: What happens if you set match_threshold to 0.3 vs 0.7?
  2. Test different chunk sizes: Try 500 chars vs 1500 chars - which works better?
  3. Compare distance metrics: Run the same query with cosine (<=>) and euclidean (<->)
  4. Benchmark your index: Use EXPLAIN ANALYZE to compare query times

Key Takeaways

🎯 Embeddings transform text into math - enabling semantic search

🎯 pgvector brings vector search to PostgreSQL - no separate database needed

🎯 IVFFlat indexing makes search fast - 100-1000x speedup vs brute force

🎯 Cosine distance is best for text - measures similarity regardless of magnitude

Tags:
#Embeddings #pgvector #PostgreSQL #VectorSearch #OpenAI #RAG