RAG Evaluation: Proving Your System Actually Works

Param Harrison
6 min read

Share this post

We've spent a lot of time building complex RAG systems. We've optimized chunking, added web search, and even built self-correcting agents. Each step felt like an improvement.

But how do we know for sure?

If you change your chunking strategy, how can you prove to your boss that it resulted in a 10% increase in answer quality? Relying on a few example queries isn't enough. A change that improves one answer might make ten others worse.

To build reliable AI, we must move from anecdotal evidence ("it feels better") to quantitative measurement. RAG evaluation is the science of defining what "good" means and then systematically scoring our system against those criteria. It's the difference between being a hobbyist and an engineer.

The RAG Triad: What do we measure?

Before we test, we need to know what to measure. A "good" RAG system balances three core components, often called the RAG Triad:

  1. Context Relevance (Retrieval Quality): Did our retriever find the right information? Did it find all the info needed (Context Recall) and not include irrelevant junk (Context Precision)?

  2. Faithfulness (Generation Quality): Did the LLM's answer stick to the facts from the retrieved context? Or did it "hallucinate" and make things up?

  3. Answer Correctness (Overall Quality): Was the final answer actually correct and relevant to the user's question?

graph TD
    A[User Query] --> B(1. Retrieval)
    B -- "Did we find the right info?" --> C(Context Relevance)
    B --> D(2. Generation)
    D -- "Did the LLM stick to the context?" --> E(Faithfulness)
    D -- "Was the final answer correct?" --> F(Answer Correctness)
    
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px

We need to score our system on all of these metrics to get a complete picture.

Step 1: Build a test pipeline and test questions

To evaluate a system, we first need a system to test. We'll build a simple RAG pipeline based on a tiny knowledge base.

# 1. Our "knowledge base" (just a few facts)
documents = [
    "The first wizarding war ended when Lord Voldemort's Killing Curse rebounded...",
    "Harry Potter was left with a lightning-bolt scar...",
    "The three Unforgivable Curses are the Imperius Curse, the Cruciatus Curse, and the Killing Curse."
]

# (Code to add these docs to a ChromaDB collection)

# 2. Our simple RAG pipeline function
def simple_rag_pipeline(question):
    # Retrieve
    retrieved_docs = collection.query(query_texts=[question], n_results=2)['documents'][0]
    context = "\n".join(retrieved_docs)
    
    # Generate
    prompt = f"Context:\n{context}\n\nQuestion: {question}"
    answer = call_llm(prompt) # call_llm is our function to talk to OpenAI
    
    return {"answer": answer, "contexts": retrieved_docs}

Next, and most importantly, we need an evaluation dataset. This includes questions and, for the best results, the "perfect" answers (called ground_truth).

Crucially, we must include questions that cannot be answered by our documents. This tests if the system knows what it doesn't know.

# Our "test sheet"
eval_questions = [
    "What caused the first wizarding war to end?",
    "What are the Unforgivable Curses?",
    "Who was the Minister for Magic during the first war?" # Not in our documents!
]

ground_truths = [
    "The war ended when Voldemort's Killing Curse rebounded on him.",
    "The Imperius, Cruciatus, and Killing Curses.",
    "The provided context does not mention the Minister for Magic."
]

Step 2: Running the evaluation with RAGAs

We could grade these answers by hand, but it's slow and subjective. Instead, we'll use a framework called RAGAs.

RAGAs acts as an automated "judge." It uses powerful LLMs (like GPT-4) to read the question, the retrieved_contexts, the generated answer, and the ground_truth and then score our system on the metrics we defined.

First, we run our pipeline to get the outputs we want to score:

# 1. Run our pipeline on all questions
generated_data = []
for q in eval_questions:
    result = simple_rag_pipeline(q)
    generated_data.append(result)

# 2. Format the data for RAGAs
from datasets import Dataset

ragas_dataset_dict = {
    'question': eval_questions,
    'answer': [d['answer'] for d in generated_data],
    'contexts': [d['contexts'] for d in generated_data],
    'ground_truth': ground_truths
}

ragas_dataset = Dataset.from_dict(ragas_dataset_dict)

Now, we just hand this dataset to RAGAs and ask it to evaluate.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness, context_recall, context_precision

# 3. Run the evaluation!
result = evaluate(
    dataset=ragas_dataset,
    metrics=[
        context_precision,  # Did we retrieve relevant junk?
        context_recall,     # Did we retrieve all the right info?
        faithfulness,       # Did the answer stick to the context?
        answer_correctness  # Was the answer correct?
    ]
)

# 4. Display the results
print(result.to_pandas())

The output is a clean table of scores for each question, showing us exactly where our system is weak. For the "Minister for Magic" question, the context_recall score would be very low (close to 0.0) because our retriever failed to find the necessary information, and RAGAs correctly identified this failure.

Step 3: Proving an optimization works (A/B testing)

This is where evaluation becomes a superpower. Let's make a change and prove if it's better.

Hypothesis: Retrieving more documents (n_results=3 instead of 2) will improve our answers.

We create an advanced_rag_pipeline that's identical, but retrieves 3 documents. We run it, evaluate it, and compare the results.

Question RAG (k=2) answer_correctness RAG (k=3) answer_correctness
"What caused the war to end?" 0.95 0.96
"What are the Curses?" 0.51 0.92
"Who was the Minister?" 0.88 0.88

Conclusion: Look at that! For the "Curses" question, retrieving only 2 documents wasn't enough, so the answer was incomplete (score: 0.51). Our "advanced" pipeline, retrieving 3 documents, got all the needed context and scored a 0.92.

We have just quantitatively proven that our change (k=3) is a significant improvement.

Key takeaways

  • If you can't measure it, you can't improve it: Evaluation is the core discipline of building production AI. Stop "feeling" and start measuring.
  • The RAG triad is your guide: Focus on your Retrieval Quality (Context Precision/Recall) and your Generation Quality (Faithfulness/Correctness) to get a complete picture.
  • Frameworks automate judging: Tools like RAGAs automate the complex, expensive task of LLM-based evaluation, letting you test and iterate rapidly.
  • Evaluation is comparative: The true power of evaluation is in A/B testing—proving that a change to your system leads to a measurable improvement.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI — Newsletter by Param

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.