Retrieval-Augmented Generation (RAG): Giving LLMs an Open Book

Param Harrison
7 min read

Share this post

In our last posts, we learned how to talk to LLMs (Prompt Engineering) and what they are (Token Predictors). But they all share a fundamental problem: they are like brilliant students taking a closed-book exam.

An LLM can only answer questions based on the knowledge it memorized during its training. This leads to two major problems:

  1. Knowledge Cutoff: The model knows nothing about events that happened after its training. It can't tell you yesterday's news or stock prices.

  2. Private Data: The model has no access to your company's internal documents, your personal notes, or your new product's technical specs.

What if we could give the LLM an "open-book exam" instead? This is the key insight of RAG.

Retrieval-Augmented Generation (RAG) is a technique that retrieves relevant information from your documents first, then uses an LLM to generate an answer based only on that information.

The problem: the closed-book exam

First, let's prove the problem. Imagine we have a private company memo. The LLM has never seen this text.

# Our private document
project_memo = """
Project Nova: Q3 2025 Internal Report
Prepared by: Dr. Evelyn Reed
Date: October 23, 2025

...The team successfully integrated the new chronosynclastic infundibulum,
resulting in a 40% increase in signal stability. The project's lead
engineer is named David Chen.
"""

# A question the LLM can't possibly know
query = "What was the main achievement in Project Nova, and who is the lead engineer?"

If we ask the LLM this question directly, it will fail.

# The "Closed-Book" prompt
prompt = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": query}
]

# The LLM's (failed) response would be:
# "I do not have access to internal reports like 'Project Nova'..."
# Or it might "hallucinate" (make up) an answer.

The model correctly states it doesn't know. Now, let's build the RAG pipeline to give it an "open book."

The RAG solution: an open-book exam

The RAG pipeline has a few simple steps. We'll turn our document into a searchable knowledge base and then use it to answer our query.

graph TD
    A[Your Document] --> B(Step 1: Chunk)
    B --> C(Step 2: Embed)
    C --> D[Step 3: Store in Vector DB]
    E[User Query] --> F(Step 2: Embed)
    F --> G(Step 4: Retrieve)
    D -- Fetches relevant chunks --> G
    G --> H(Step 5: Augment)
    E -- Original query --> H
    H --> I[Step 6: Generate]
    I --> J[Final Answer]

Step 1: Chunking the document

We can't just feed a huge document to the model. We need to break it into smaller, manageable chunks. For this example, we'll just split it by newlines.

# For a real app, you'd use a more advanced chunking strategy
# (e.g., by paragraph or a fixed token size)

chunks = [line for line in project_memo.split('\n') if line.strip() != ""]

# Our document is now a list of text strings:
# [
#   "Project Nova: Q3 2025 Internal Report",
#   "Prepared by: Dr. Evelyn Reed",
#   ...
# ]

Step 2: Creating embeddings

Next, we convert each text chunk into a list of numbers called a vector embedding. These vectors represent the semantic meaning of the text. Chunks with similar meanings will have mathematically similar vectors.

We use a special model (not a huge LLM) to do this.

from sentence_transformers import SentenceTransformer

# Load a model specifically for creating embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert all our text chunks into numerical vectors
chunk_embeddings = embedding_model.encode(chunks)

# chunk_embeddings is now a list of vectors (arrays of numbers)
# e.g., [[0.1, 0.4, -0.2, ...], [0.8, 0.1, 0.9, ...], ...]

This is the magic of RAG. We've just enabled semantic search.

  • Keyword Search (Ctrl+F): You search for "delay." It only finds the exact word "delay."
  • Semantic Search (Vectors): You search for "delay." It finds chunks containing "revised timeline," "pushed back," or "holdup," because the meaning is similar.
graph TD
    subgraph KEYWORD["Keyword Search (Exact Match)"]
        A["Query: 'project delay'"] --> B{"Finds 'project delay'"}
        A -.-> C("Fails to find 'revised timeline'")
    end
    
    subgraph SEMANTIC["Semantic Search (Meaning Match)"]
        D["Query: 'project delay'"] --> E["Vector for 'delay'"]
        E -- "Is 'close to'" --> F["Vector for 'revised timeline'"]
        E -- "Is 'far from'" --> G["Vector for 'Dr. Evelyn Reed'"]
    end

Step 3: Creating a vector store (our "open book")

Now we need a place to store our embeddings and their corresponding text chunks. This is our searchable "library." We'll use a vector database like ChromaDB.

import chromadb
chroma_client = chromadb.Client()

# Create a "collection" (like a table) to hold our docs
collection = chroma_client.get_or_create_collection(name="project_nova_docs")

# We need unique IDs for each chunk
chunk_ids = [str(i) for i in range(len(chunks))]

# Add our embeddings and the original text to the database
collection.add(
    embeddings=chunk_embeddings,
    documents=chunks,
    ids=chunk_ids
)

Our knowledge base is now "indexed" and ready for questions.

Step 4: Retrieve, Augment, and Generate

This is the core RAG loop. We'll take our user's query, find the most relevant chunks, and then "augment" a new prompt for the LLM.

1. Retrieve

First, we embed the user's query using the same model and search the collection.

# The same query from before
query = "What was the main achievement in Project Nova, and who is the lead engineer?"

# 1. Embed the query
query_embedding = embedding_model.encode([query])

# 2. Search the collection for the top 3 most similar chunks
results = collection.query(
    query_embeddings=query_embedding,
    n_results=3
)

retrieved_chunks = results['documents'][0]
# retrieved_chunks might be:
# [
#   "...40% increase in signal stability.",
#   "...lead engineer is named David Chen.",
#   "Project Nova: Q3 2025 Internal Report"
# ]

2. Augment

Next, we create a new prompt that includes these retrieved chunks as context.

# We build a new prompt, stuffing the retrieved text into it
context = "\n".join(retrieved_chunks)

augmented_prompt = f"""
Use the following context to answer the question.
If the answer is not in the context, say 'I don't know'.

Context:
{context}

Question: {query}
"""

3. Generate

Finally, we send this new, context-rich prompt to the LLM.

# The "Open-Book" prompt
from openai import OpenAI
llm_client = OpenAI(api_key="...")

prompt = [
    {"role": "system", "content": "You are a precise and factual assistant."},
    {"role": "user", "content": augmented_prompt}
]

# The LLM's (successful) response will be:
# "The main achievement in Project Nova was a 40% increase in
# signal stability. The lead engineer is David Chen."

Success! The model isn't using its internal memory; it's reading the "open book" we just gave it.

Your mental model: RAG = Retriever + Generator

Think of RAG as a two-part system:

  1. The Retriever (The Librarian): Its only job is to be an expert at searching your knowledge base. It takes a query and finds the most relevant documents. This part is fast and uses vector search.

  2. The Generator (The Synthesizer): This is the LLM. Its job is to take the user's query and the documents from the Retriever and synthesize them into a single, human-readable answer.

graph TD
    A["User Query"] --> B["Retriever (Librarian)"]
    C["Vector Database"] -- "Chunks" --> B
    B -- "Relevant Chunks" --> D["Generator (LLM)"]
    A --> D
    D --> E["Final Answer"]

RAG is great for:

  • Answering questions over private documents (like your company's Wiki)
  • Providing up-to-date information (by adding new documents to the vector store)
  • Reducing hallucinations by "grounding" the LLM in specific facts
  • Providing citations for its answers (since you know which chunks it used)

Key takeaways

  • RAG solves the knowledge problem: It gives LLMs an "open-book" exam, letting them use external, up-to-date, or private information
  • Everything is vectors: RAG works by converting text (documents and queries) into numerical embeddings and finding the ones that are mathematically closest in meaning
  • The pipeline is key: A successful RAG system depends on good chunking, accurate embeddings, and a well-crafted prompt
  • Retrieval first, generation second: The core idea is to separate the problem of finding information from the problem of explaining it
  • Grounding reduces hallucinations: By forcing the LLM to base its answer on provided text, we significantly reduce its tendency to make things up

For more on building production AI systems, check out our AI Engineering Bootcamp.

Share this post

Continue Reading

Weekly Bytes of AI — Newsletter by Param

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.