The 'Brain' of RAG: A Guide to Embeddings & Vector Databases
In our last posts, we've built RAG pipelines (see our introduction to RAG), chosen frameworks (see our RAG framework comparison), and even designed agents (see our agent framework comparison). We've used terms like "embeddings" and "vector databases" as if they're magic boxes.
This post is for you if you've ever stopped and asked, "But how does it actually work? How does a computer 'find' the right chunk of text?"
Understanding this is the single biggest step you can take from being an AI user to an AI engineer. We're going to open the black box, brick by brick, and see how the "brain" of RAG really thinks.
The Core Problem: Computers can't read "meaning"
Let's start with a simple problem. A user searches your knowledge base for "king".
A traditional "keyword" search (like Ctrl+F) will find this:
- "The king sat on the throne."
- "I am king of the world!"
But it will miss this:
- "The queen ruled the land."
- "A monarch's duty is to their people."
- "His majesty entered the court."
To a computer, the strings "king" and "queen" are as different as "apple" and "banana". It has no concept of "royalty" or "meaning". This is the failure of keyword search.
To build a "smart" RAG, we need to solve two problems:
-
The "Translation" Problem: How do we translate text "meaning" into a format (numbers) that a computer can understand?
-
The "Search" Problem: Once we have millions of documents in this number format, how do we search them instantly?
1. The "Translation" (Understanding Embeddings)
This is the first piece of the puzzle. We solve the translation problem with a special type of AI model called an Embedding Model.
An embedding model is a "translator". It has been trained on billions of sentences, and its only job is to read a piece of text and convert its "meaning" into a list of numbers called a vector.
Think of it as a "coordinate" on a giant map of meaning.
- The text "king" might be translated to the vector
[0.1, 0.8, -0.2, ...] - The text "queen" might be
[0.2, 0.7, -0.1, ...] - The text "apple" might be
[-0.9, 0.1, 0.5, ...]
When these vectors are plotted, "king" and "queen" will be extremely close to each other, while "apple" will be on the other side of the map.
graph TD
A["Text: 'king'"] --> B[Embedding Model]
B --> C["Vector: [0.1, 0.8, -0.2, ...]"]
D["Text: 'queen'"] --> B
B --> E["Vector: [0.2, 0.7, -0.1, ...]"]
F["Text: 'apple'"] --> B
B --> G["Vector: [-0.9, 0.1, 0.5, ...]"]
Making it Real: How to create an Embedding
You don't need to train this model. You just use an open-source one, like sentence-transformers. For more on embedding models, see our vector databases guide.
from sentence_transformers import SentenceTransformer
# 1. Load a pre-trained "translator" model
# This model converts text into a 384-dimension vector
model = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Define our sentences
sentences = [
"The king sat on the throne.",
"The queen ruled the land.",
"I ate a red apple."
]
# 3. "Encode" them into vectors
embeddings = model.encode(sentences)
print(f"Shape of our embeddings: {embeddings.shape}")
# Output: Shape of our embeddings: (3, 384)
# This means we have 3 vectors, each 384 numbers long.
# Let's see the "distance" between them
from sklearn.metrics.pairwise import cosine_similarity
# Compare "king" and "queen"
sim_king_queen = cosine_similarity([embeddings[0]], [embeddings[1]])
# Compare "king" and "apple"
sim_king_apple = cosine_similarity([embeddings[0]], [embeddings[2]])
print(f"King vs. Queen Similarity: {sim_king_queen[0][0]:.4f}")
print(f"King vs. Apple Similarity: {sim_king_apple[0][0]:.4f}")
Observation:
When you run this, you'll see:
King vs. Queen Similarity: 0.7588(A very high score!)King vs. Apple Similarity: 0.1044(A very low score!)
We have mathematically proven that the model "understands" that "king" and "queen" are related. This is the magic of semantic search.
Think About It: An embedding model is the "translator" that turns all your documents into a massive list of "coordinates". Now we have a new problem: if you have 10 million documents (10 million vectors), how do you find the closest one to your query?
2. The "Phone Book" Problem (Why for loops fail)
So, we have our 10 million document vectors. A user asks a question.
- We translate the user's question into a
query_vector. - We now have to find the closest document vector to our query vector.
The "naive" or "brute-force" way to do this is a simple for loop.
# The "Brute Force" way. DO NOT DO THIS.
def find_closest_vector(query_vector, all_document_vectors):
best_similarity = -1
best_document = None
# This loop is the problem
for doc_vector in all_document_vectors:
# This one calculation is fast...
similarity = cosine_similarity(query_vector, doc_vector)
if similarity > best_similarity:
best_similarity = similarity
best_document = doc_vector
return best_document
# If all_document_vectors has 10,000,000 items,
# this loop will take... hours.
This is a "Full Scan" or Exact Nearest Neighbor (ENN) search. It is 100% accurate, but it is impossibly slow. It's like finding a phone number by reading the entire phone book, line by line.
We need a "GPS".
3. The "GPS" (Understanding Vector Databases)
A Vector Database (like Chroma, Qdrant, or Pinecone) is a specialized tool built to do one thing: solve the "phone book" problem instantly. For choosing the right vector database, see our vector database comparison guide.
It doesn't use a for loop. It uses a magic trick called Approximate Nearest Neighbor (ANN) search.
The ANN algorithm is a "shortcut". Instead of checking all 10 million vectors, it builds a smart "map" of your data ahead of time. A popular algorithm is HNSW (Hierarchical Navigable Small Worlds).
Here's a simple analogy for how HNSW works:
-
The "Map": When you add your 10 million vectors, the database builds a multi-layered graph. It's like creating a "Country" layer, a "State" layer, a "City" layer, and a "Street" layer.
-
The "Search": When your
query_vector("find 'king'") comes in:- It starts at the "Country" layer (e.g., "Food" vs. "History" vs. "People"). It finds the closest "country" is "People".
- It drops down to the "State" layer within "People" (e.g., "Politics" vs. "Art" vs. "Science"). It finds the closest "state" is "Politics".
- It drops down to the "City" layer within "Politics" (e.g., "Elections" vs. "Royalty"). It finds "Royalty".
- It drops to the "Street" layer and quickly scans the 50 vectors on that "street" to find the exact closest one: "queen".
Instead of 10,000,000 comparisons, it only did about 30.
Full Scan (Slow):
graph TD
A[Query] --> B[1?]
B --> C[2?]
C --> D[3?]
D --> E[...]
E --> F[10,000,000?]
F --> G[Answer]
style A fill:#ffebee,stroke:#b71c1c
style G fill:#ffebee,stroke:#b71c1c
ANN Search (Fast):
graph TD
H[Query] --> I[Country Layer]
I --> J[State Layer]
J --> K[City Layer]
K --> L[Street Layer]
L --> M[Scan 50 vectors]
M --> N[Answer]
style H fill:#e8f5e9,stroke:#388e3c
style N fill:#e8f5e9,stroke:#388e3c
This is why it's called "Approximate". It's possible the perfect answer was on a different "street" in a different "city". But it's 99.9% likely to find a good enough answer in milliseconds instead of hours.
Putting it all together: The full RAG flow
Now we can see the full picture. Our two key components—the Embedding Model and the Vector Database—work together to power our RAG system.
graph TD
A[Your Doc .pdf] --> B[1. Chunking]
B --> C[Chunk 1]
B --> D[Chunk 2]
B --> E[Chunk 3...]
C --> F[2. Embedding Model<br/>The Translator]
D --> F
E --> F
F --> G[Vector 1]
F --> H[Vector 2]
F --> I[Vector 3...]
subgraph VDB["3. Vector Database e.g., Chroma"]
direction LR
J[ANN Index<br/>The Map]
G --> J
H --> J
I --> J
end
K[User Query: Find...] --> L[2. Embedding Model<br/>The Translator]
L --> M[Query Vector]
M -- "4. Search" --> J
J -- "5. Retrieve" --> N[Relevant Chunks]
N --> O[6. LLM]
K --> O
O --> P[Final Answer]
-
Ingestion (One-time): We "chunk" our documents (see our chunking guide), "translate" each chunk into a vector using the Embedding Model, and store these vectors in the Vector Database, which builds its fast "map".
-
Querying (Real-time): The user's query is "translated" by the same Embedding Model. The Vector Database uses its fast ANN search ("GPS") to find the closest document vectors. These vectors' corresponding text chunks are retrieved and given to the LLM.
Observation:
- The Embedding Model defines the quality of your search. A better model creates a better "map".
- The Vector Database defines the speed of your search. A better database searches the "map" faster.
You cannot have a good RAG system without both.
Challenge for You
-
Use Case: You are building a RAG system for a movie database.
-
The Goal: You want a search for "fast cars and explosions" to return the Fast & Furious movies, even if the movie descriptions don't use those exact words.
-
Your Task:
- Which component is responsible for "understanding" that the meaning of "fast cars and explosions" is semantically close to the meaning of "a film about street racing and heists"?
- Which component is responsible for searching through 1 million movie descriptions in under 50 milliseconds?
Key takeaways
- Embeddings translate meaning into math: Embedding models convert text into numerical vectors that capture semantic relationships, enabling computers to understand meaning
- Vector databases solve the search problem: ANN algorithms like HNSW enable fast approximate nearest neighbor search across millions of vectors
- Both components are essential: The embedding model determines search quality, while the vector database determines search speed
- ANN trade-offs: Approximate search sacrifices perfect accuracy for massive speed improvements (milliseconds vs. hours)
- Understanding the fundamentals: Knowing how embeddings and vector databases work is crucial for building and optimizing production RAG systems
For more on building production AI systems, check out our AI Bootcamp for Software Engineers.