Building a Production RAG System: Part 3 - Understanding Vector Embeddings and pgvector

Series: Building a Production-Ready Textbook Q&A System with RAG
Part: 3 of 7 Read Time: 15 minutes Level: Beginner to Intermediate

What We'll Learn in This Part

By the end of this tutorial, you'll understand:

What vector embeddings are and how they work
How OpenAI's text-embedding-ada-002 converts text to vectors
Vector similarity and distance metrics (cosine, euclidean, dot product)
What pgvector is and why it's revolutionary
How IVFFlat indexing makes search 100x faster
The complete database schema for vector storage

Estimated time: 1 hour

The Problem: How Do Computers Understand Meaning?

Imagine you're building a search engine. A user searches for "puppy" and you have these documents:

"I adopted a cute dog yesterday"
"Python is a programming language"
"The young canine was playful"

Traditional keyword search would return ZERO results because none contain the word "puppy".

But a human instantly knows:

Document 1 is VERY relevant (dog = puppy)
Document 3 is VERY relevant (young canine = puppy)
Document 2 is NOT relevant (Python ≠ puppy)

How can we teach computers to understand that "puppy" ≈ "dog" ≈ "canine"?
The answer: Vector Embeddings

What Are Vector Embeddings?

A vector embedding is a list of numbers that represents the meaning of text.

Example: Simple 2D Embeddings

Imagine we could represent words in just 2 dimensions:

"puppy"    → [0.8, 0.9]   (high on "cute", high on "animal")
"dog"      → [0.7, 0.85]  (similar to puppy)
"kitten"   → [0.9, 0.6]   (high on "cute", medium on "animal")
"python"   → [0.1, 0.2]   (low on both)
"Ferrari"  → [0.3, 0.1]   (low on "cute", very low on "animal")

If we plot these on a graph:

1.0 │     puppy
    │      •
0.9 │    kitten
    │      •
0.8 │    dog
    │     •
    │
0.5 │
    │
0.3 │  Ferrari
    │    •
0.2 │ python
    │   •
0.0 └─────────────────
    0.0       1.0

Key insight: Words with similar meanings are close together in vector space!

Real Embeddings: 1536 Dimensions

OpenAI's text-embedding-ada-002 doesn't use 2 dimensions—it uses 1536 dimensions.

const embedding = await openai.embeddings.create({
  model: 'text-embedding-ada-002',
  input: 'What is a closure in JavaScript?'
})

console.log(embedding.data[0].embedding)
// [0.0234, -0.0567, 0.0891, ..., 0.0123]  (1536 numbers)

You can't visualize 1536 dimensions, but the same principle applies:

Similar concepts → vectors point in similar directions
Different concepts → vectors point in different directions

How Embeddings Capture Meaning

Embeddings learn relationships through training on massive text datasets. They capture:

1. Semantic Similarity

"JavaScript closure"    ≈ "function scope"
"buy groceries"         ≈ "purchase food"
"neural network"        ≈ "deep learning"

2. Analogies

"king" - "man" + "woman" ≈ "queen"
"Paris" - "France" + "Italy" ≈ "Rome"

3. Context

"bank" (financial institution) → different vector than
"bank" (river side)

The model understands context based on surrounding words!

Vector Similarity: Measuring Closeness

Given two vectors, how do we measure how "similar" they are?

Method 1: Cosine Similarity (Most Common)

Cosine similarity measures the angle between two vectors (ignoring magnitude).

cosine_similarity = (A · B) / (||A|| × ||B||)

Result ranges from -1 to 1:
  1   = identical direction (very similar)
  0   = perpendicular (unrelated)
 -1   = opposite direction (very different)

In pgvector, we use cosine distance = 1 - cosine_similarity

-- Lower distance = more similar
SELECT 1 - (embedding <=> query_embedding) AS similarity
FROM document_chunks
ORDER BY embedding <=> query_embedding
LIMIT 5;

Method 2: Euclidean Distance (L2)

Euclidean distance measures straight-line distance between points.

euclidean_distance = √((A₁-B₁)² + (A₂-B₂)² + ... + (Aₙ-Bₙ)²)

Shorter distance = more similar

Enter pgvector: Vector Search in PostgreSQL

pgvector is a PostgreSQL extension that adds vector data types and similarity search.

Why pgvector is Revolutionary

Before pgvector, to do vector search you needed:

A separate vector database (Pinecone, Weaviate, Qdrant)
Data synchronization between databases
Extra infrastructure and costs

With pgvector:

Vector search directly in PostgreSQL
Keep all data in one place (vectors + metadata + auth)
Use SQL joins, transactions, and RLS
Battle-tested PostgreSQL reliability

pgvector Data Types

pgvector adds a vector(n) data type:

CREATE TABLE items (
  id SERIAL PRIMARY KEY,
  embedding vector(1536)  -- 1536-dimensional vector
);

Important: The number in vector(1536) MUST match your embedding model's dimensions!

Distance Operators in pgvector

pgvector provides three distance operators:

Operator	Distance Type	SQL Example
`<->`	Euclidean (L2)	`embedding <-> '[0.1,0.2,0.3]'`
`<=>`	Cosine	`embedding <=> '[0.1,0.2,0.3]'`
`<#>`	Inner Product	`embedding <#> '[0.1,0.2,0.3]'`

For text similarity, we use cosine distance (<=>)

The Performance Problem: Brute Force Search

Imagine you have 1 million document chunks. To find the most similar chunk, you need to:

Calculate distance from query to chunk #1
Calculate distance from query to chunk #2
...
Calculate distance from query to chunk #1,000,000
Sort and return top 5

This is SLOW 🐢
For 1M chunks × 1536 dimensions = 1.5 billion calculations per query!

The Solution: IVFFlat Indexing

IVFFlat (Inverted File with Flat compression) is pgvector's indexing strategy.

How IVFFlat Works

Think of it like organizing books in a library:

Without index (brute force):

Search through all 1 million books one by one

With IVFFlat:

Divide books into 100 sections (clusters) by topic
When user asks for "JavaScript books":
- Figure out which sections likely have JavaScript books (maybe 3 sections)
- Search only those 3 sections (30,000 books instead of 1M)

Performance Gains

Vectors	Brute Force	IVFFlat	Speedup
10,000	50ms	5ms	10x
100,000	500ms	8ms	62x
1,000,000	5000ms	15ms	333x

Trade-off: Slight accuracy loss (might miss some results) for massive speed gain.

Creating the Vector Database Schema

Let's set up our complete schema with pgvector support.

Step 1: Enable pgvector Extension

In Supabase SQL Editor:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

Step 2: Create Documents Table

-- Store textbook metadata
CREATE TABLE IF NOT EXISTS public.documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
    title TEXT NOT NULL,
    author TEXT,
    file_path TEXT NOT NULL,
    file_size INTEGER,
    page_count INTEGER,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Enable RLS
ALTER TABLE public.documents ENABLE ROW LEVEL SECURITY;

-- Users can only see their own documents
CREATE POLICY "Users can view own documents"
    ON public.documents FOR SELECT
    USING (auth.uid() = user_id);

Step 3: Create Document Chunks Table with Vectors

-- Store text chunks with embeddings
CREATE TABLE IF NOT EXISTS public.document_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL REFERENCES public.documents(id) ON DELETE CASCADE,
    content TEXT NOT NULL,
    page_number INTEGER NOT NULL,
    chunk_index INTEGER NOT NULL,
    embedding vector(1536),  -- OpenAI ada-002 embeddings
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(document_id, chunk_index)
);

-- Create IVFFlat index for vector similarity search
CREATE INDEX IF NOT EXISTS idx_document_chunks_embedding
    ON public.document_chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

Breaking down the index:

USING ivfflat → Use IVFFlat indexing algorithm
vector_cosine_ops → Optimize for cosine distance
lists = 100 → Divide vectors into 100 clusters

Choosing the lists parameter:

Rule of thumb: lists = rows / 1000

Examples:
- 10,000 rows   → lists = 10
- 100,000 rows  → lists = 100
- 1,000,000 rows → lists = 1000

Step 4: Create the Search Function

-- Function to search for similar chunks
CREATE OR REPLACE FUNCTION match_document_chunks(
    query_embedding vector(1536),
    match_threshold float DEFAULT 0.5,
    match_count int DEFAULT 5,
    filter_document_id uuid DEFAULT NULL
)
RETURNS TABLE (
    id UUID,
    document_id UUID,
    content TEXT,
    page_number INTEGER,
    similarity FLOAT,
    document_title TEXT,
    document_author TEXT
)
LANGUAGE plpgsql
AS $$
BEGIN
    RETURN QUERY
    SELECT
        dc.id,
        dc.document_id,
        dc.content,
        dc.page_number,
        1 - (dc.embedding <=> query_embedding) AS similarity,
        d.title AS document_title,
        d.author AS document_author
    FROM document_chunks dc
    JOIN documents d ON dc.document_id = d.id
    WHERE
        (filter_document_id IS NULL OR dc.document_id = filter_document_id)
        AND 1 - (dc.embedding <=> query_embedding) > match_threshold
        AND d.user_id = auth.uid()
    ORDER BY dc.embedding <=> query_embedding
    LIMIT match_count;
END;
$$;

How the Search Works (Step by Step)

Let's trace what happens when a user asks: "What is a closure?"

Step 1: Generate Query Embedding

const response = await openai.embeddings.create({
  model: 'text-embedding-ada-002',
  input: 'What is a closure?'
})

const queryEmbedding = response.data[0].embedding
// [0.0234, -0.0567, ..., 0.0123] (1536 numbers)

Step 2: Convert to pgvector Format

const embeddingString = `[${queryEmbedding.join(',')}]`
// "[0.0234,-0.0567,...,0.0123]"

Critical: pgvector expects a string format, not a JSON array!

Step 3: Call the Search Function

const { data, error } = await supabase.rpc('match_document_chunks', {
  query_embedding: embeddingString,
  match_threshold: 0.5,
  match_count: 5
})

Step 4: pgvector Searches Using IVFFlat

1. Find 3 closest clusters to query vector
2. Search only those 3 clusters (say, 3000 chunks)
3. Calculate exact cosine distance for those 3000
4. Filter where similarity > 0.5
5. Sort by similarity
6. Return top 5

Result in ~10-20ms instead of 500ms+ 🚀

Visualizing How It All Fits Together

USER QUESTION
    "What is a closure?"
         ↓
    [OpenAI API]
         ↓
  Query Embedding
    [0.234, -0.567, ..., 0.123] (1536D)
         ↓
    [pgvector]
         ↓
  IVFFlat Index
    • Finds 3 closest clusters
    • Searches ~3000 chunks
    • Calculates cosine similarity
         ↓
  Top 5 Matches
    1. "A closure is a function..." (similarity: 0.89)
    2. "Functions in JavaScript..." (similarity: 0.82)
    3. "Lexical scope means..." (similarity: 0.76)
    4. "Here's a closure example..." (similarity: 0.71)
    5. "Inner functions have access..." (similarity: 0.68)
         ↓
    [Build Context]
         ↓
    [Send to Claude]
         ↓
   ANSWER WITH CITATIONS

Common Pitfalls and Solutions

Pitfall 1: Wrong Embedding Dimensions

-- ❌ Error: dimension mismatch
CREATE TABLE chunks (embedding vector(768))  -- Wrong!

-- ✅ Correct
CREATE TABLE chunks (embedding vector(1536))  -- Matches ada-002

Pitfall 2: Storing as JSON Array

// ❌ Wrong: Postgres sees this as text, not vector
await supabase.from('chunks').insert({
  embedding: [0.1, 0.2, 0.3]  // JSON array
})

// ✅ Correct: Convert to pgvector string format
await supabase.from('chunks').insert({
  embedding: `[${embedding.join(',')}]`  // String format
})

Pitfall 3: Not Creating Index

❌ Without index: searches take 500ms+
✅ With index: searches take 10-20ms

Always create the index!

What We Accomplished

Let's review what we learned:

Vector Embeddings - How they represent meaning as numbers
pgvector - Revolutionary vector search in PostgreSQL
IVFFlat Indexing - 100x faster search through clustering
Database Schema - Complete setup with RLS and search function

Coming Up in Part 4

In Part 4: Building the PDF Ingestion Pipeline, we'll:

Extract text from PDF files using pdf-parse
Implement smart chunking with overlap
Generate embeddings in batches
Store vectors in the correct pgvector format
Handle edge cases (null bytes, duplicates)
Build a CLI tool for ingestion

Estimated time: 2-3 hours to complete Part 4

Homework Challenge

Before moving to Part 4, try these exercises:

Experiment with similarity thresholds: What happens if you set match_threshold to 0.3 vs 0.7?
Test different chunk sizes: Try 500 chars vs 1500 chars - which works better?
Compare distance metrics: Run the same query with cosine (<=>) and euclidean (<->)
Benchmark your index: Use EXPLAIN ANALYZE to compare query times

Key Takeaways

🎯 Embeddings transform text into math - enabling semantic search

🎯 pgvector brings vector search to PostgreSQL - no separate database needed

🎯 IVFFlat indexing makes search fast - 100-1000x speedup vs brute force

🎯 Cosine distance is best for text - measures similarity regardless of magnitude

Tags:
#Embeddings #pgvector #PostgreSQL #VectorSearch #OpenAI #RAG