By the end of this tutorial, you'll understand:
Estimated time: 1 hour
Imagine you're building a search engine. A user searches for "puppy" and you have these documents:
Traditional keyword search would return ZERO results because none contain the word "puppy".
But a human instantly knows:
A vector embedding is a list of numbers that represents the meaning of text.
Imagine we could represent words in just 2 dimensions:
"puppy" → [0.8, 0.9] (high on "cute", high on "animal")
"dog" → [0.7, 0.85] (similar to puppy)
"kitten" → [0.9, 0.6] (high on "cute", medium on "animal")
"python" → [0.1, 0.2] (low on both)
"Ferrari" → [0.3, 0.1] (low on "cute", very low on "animal")
If we plot these on a graph:
1.0 │ puppy
│ •
0.9 │ kitten
│ •
0.8 │ dog
│ •
│
0.5 │
│
0.3 │ Ferrari
│ •
0.2 │ python
│ •
0.0 └─────────────────
0.0 1.0
OpenAI's text-embedding-ada-002 doesn't use 2 dimensions—it uses 1536 dimensions.
const embedding = await openai.embeddings.create({
model: 'text-embedding-ada-002',
input: 'What is a closure in JavaScript?'
})
console.log(embedding.data[0].embedding)
// [0.0234, -0.0567, 0.0891, ..., 0.0123] (1536 numbers)
You can't visualize 1536 dimensions, but the same principle applies:
Embeddings learn relationships through training on massive text datasets. They capture:
"JavaScript closure" ≈ "function scope"
"buy groceries" ≈ "purchase food"
"neural network" ≈ "deep learning"
"king" - "man" + "woman" ≈ "queen"
"Paris" - "France" + "Italy" ≈ "Rome"
"bank" (financial institution) → different vector than
"bank" (river side)
The model understands context based on surrounding words!
Given two vectors, how do we measure how "similar" they are?
Cosine similarity measures the angle between two vectors (ignoring magnitude).
cosine_similarity = (A · B) / (||A|| × ||B||)
Result ranges from -1 to 1:
1 = identical direction (very similar)
0 = perpendicular (unrelated)
-1 = opposite direction (very different)
1 - cosine_similarity
-- Lower distance = more similar
SELECT 1 - (embedding <=> query_embedding) AS similarity
FROM document_chunks
ORDER BY embedding <=> query_embedding
LIMIT 5;
Euclidean distance measures straight-line distance between points.
euclidean_distance = √((A₁-B₁)² + (A₂-B₂)² + ... + (Aₙ-Bₙ)²)
Shorter distance = more similar
pgvector is a PostgreSQL extension that adds vector data types and similarity search.
Before pgvector, to do vector search you needed:
With pgvector:
pgvector adds a vector(n) data type:
CREATE TABLE items (
id SERIAL PRIMARY KEY,
embedding vector(1536) -- 1536-dimensional vector
);
vector(1536) MUST match your embedding model's dimensions!
pgvector provides three distance operators:
| Operator | Distance Type | SQL Example |
|---|---|---|
<-> |
Euclidean (L2) | embedding <-> '[0.1,0.2,0.3]' |
<=> |
Cosine | embedding <=> '[0.1,0.2,0.3]' |
<#> |
Inner Product | embedding <#> '[0.1,0.2,0.3]' |
<=>)
Imagine you have 1 million document chunks. To find the most similar chunk, you need to:
IVFFlat (Inverted File with Flat compression) is pgvector's indexing strategy.
Think of it like organizing books in a library:
Without index (brute force):
With IVFFlat:
| Vectors | Brute Force | IVFFlat | Speedup |
|---|---|---|---|
| 10,000 | 50ms | 5ms | 10x |
| 100,000 | 500ms | 8ms | 62x |
| 1,000,000 | 5000ms | 15ms | 333x |
Let's set up our complete schema with pgvector support.
In Supabase SQL Editor:
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Store textbook metadata
CREATE TABLE IF NOT EXISTS public.documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
title TEXT NOT NULL,
author TEXT,
file_path TEXT NOT NULL,
file_size INTEGER,
page_count INTEGER,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- Enable RLS
ALTER TABLE public.documents ENABLE ROW LEVEL SECURITY;
-- Users can only see their own documents
CREATE POLICY "Users can view own documents"
ON public.documents FOR SELECT
USING (auth.uid() = user_id);
-- Store text chunks with embeddings
CREATE TABLE IF NOT EXISTS public.document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES public.documents(id) ON DELETE CASCADE,
content TEXT NOT NULL,
page_number INTEGER NOT NULL,
chunk_index INTEGER NOT NULL,
embedding vector(1536), -- OpenAI ada-002 embeddings
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(document_id, chunk_index)
);
-- Create IVFFlat index for vector similarity search
CREATE INDEX IF NOT EXISTS idx_document_chunks_embedding
ON public.document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
USING ivfflat → Use IVFFlat indexing algorithmvector_cosine_ops → Optimize for cosine distancelists = 100 → Divide vectors into 100 clustersChoosing the lists parameter:
Rule of thumb: lists = rows / 1000
Examples:
- 10,000 rows → lists = 10
- 100,000 rows → lists = 100
- 1,000,000 rows → lists = 1000
-- Function to search for similar chunks
CREATE OR REPLACE FUNCTION match_document_chunks(
query_embedding vector(1536),
match_threshold float DEFAULT 0.5,
match_count int DEFAULT 5,
filter_document_id uuid DEFAULT NULL
)
RETURNS TABLE (
id UUID,
document_id UUID,
content TEXT,
page_number INTEGER,
similarity FLOAT,
document_title TEXT,
document_author TEXT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
dc.id,
dc.document_id,
dc.content,
dc.page_number,
1 - (dc.embedding <=> query_embedding) AS similarity,
d.title AS document_title,
d.author AS document_author
FROM document_chunks dc
JOIN documents d ON dc.document_id = d.id
WHERE
(filter_document_id IS NULL OR dc.document_id = filter_document_id)
AND 1 - (dc.embedding <=> query_embedding) > match_threshold
AND d.user_id = auth.uid()
ORDER BY dc.embedding <=> query_embedding
LIMIT match_count;
END;
$$;
Let's trace what happens when a user asks: "What is a closure?"
const response = await openai.embeddings.create({
model: 'text-embedding-ada-002',
input: 'What is a closure?'
})
const queryEmbedding = response.data[0].embedding
// [0.0234, -0.0567, ..., 0.0123] (1536 numbers)
const embeddingString = `[${queryEmbedding.join(',')}]`
// "[0.0234,-0.0567,...,0.0123]"
const { data, error } = await supabase.rpc('match_document_chunks', {
query_embedding: embeddingString,
match_threshold: 0.5,
match_count: 5
})
1. Find 3 closest clusters to query vector
2. Search only those 3 clusters (say, 3000 chunks)
3. Calculate exact cosine distance for those 3000
4. Filter where similarity > 0.5
5. Sort by similarity
6. Return top 5
Result in ~10-20ms instead of 500ms+ 🚀
USER QUESTION
"What is a closure?"
↓
[OpenAI API]
↓
Query Embedding
[0.234, -0.567, ..., 0.123] (1536D)
↓
[pgvector]
↓
IVFFlat Index
• Finds 3 closest clusters
• Searches ~3000 chunks
• Calculates cosine similarity
↓
Top 5 Matches
1. "A closure is a function..." (similarity: 0.89)
2. "Functions in JavaScript..." (similarity: 0.82)
3. "Lexical scope means..." (similarity: 0.76)
4. "Here's a closure example..." (similarity: 0.71)
5. "Inner functions have access..." (similarity: 0.68)
↓
[Build Context]
↓
[Send to Claude]
↓
ANSWER WITH CITATIONS
-- ❌ Error: dimension mismatch
CREATE TABLE chunks (embedding vector(768)) -- Wrong!
-- ✅ Correct
CREATE TABLE chunks (embedding vector(1536)) -- Matches ada-002
// ❌ Wrong: Postgres sees this as text, not vector
await supabase.from('chunks').insert({
embedding: [0.1, 0.2, 0.3] // JSON array
})
// ✅ Correct: Convert to pgvector string format
await supabase.from('chunks').insert({
embedding: `[${embedding.join(',')}]` // String format
})
Let's review what we learned:
Before moving to Part 4, try these exercises:
match_threshold to 0.3 vs 0.7?<=>) and euclidean (<->)EXPLAIN ANALYZE to compare query times🎯 Embeddings transform text into math - enabling semantic search
🎯 pgvector brings vector search to PostgreSQL - no separate database needed
🎯 IVFFlat indexing makes search fast - 100-1000x speedup vs brute force
🎯 Cosine distance is best for text - measures similarity regardless of magnitude