RAG Evaluation: Metrics That Actually Matter

Param Harrison
6 min read

Share this post

In our previous posts, we've built a complete, self-correcting RAG agent (see our RAG agent series). We've added routing, tool use, and self-correction. It feels smarter. But in engineering, "feels" is not a word we can put in a report.

This post is for you if you've ever asked:

  • "I changed my prompt. Did it actually make the answers better, or just different?"
  • "I swapped my embedding model. How do I prove it was worth the cost?"
  • "My boss wants a report on our RAG quality. Where do I even start?"

Relying on a few example queries is not engineering. Today, we'll learn how to quantitatively measure the quality of our RAG system.

The problem: "Which is better?" is not a guess

You've built a RAG pipeline. You A/B test two different prompts on the same query.

graph TD
    A[User Query] --> B(Pipeline V1: Simple Prompt)
    B --> C["Answer A: 'The report is positive.'"]
    
    A --> D(Pipeline V2: Detailed Prompt)
    D --> E["Answer B: 'The report, published in Q4, is positive, citing a 15% growth.'"]
    
    F((Which is better?)) -- ??? --> C
    F -- ??? --> E
    
    style C fill:#fff8e1,stroke:#f57f17
    style E fill:#e8f5e9,stroke:#388e3c

Answer B looks more detailed, but is it more accurate? Did it hallucinate that "15% growth"? We have no way of knowing without a "judge." We must move from anecdotal evidence to automated, quantitative measurement.

The solution: The "RAG Triad" (What to measure)

Before we pick a tool, we must define what "good" means. A RAG system has two main components that can fail: the Retriever and the Generator. We must test both.

This gives us the RAG Triad of metrics:

graph TD
    A[User Query] --> B(1. Retriever)
    B -- "Judge 1: Did we find the right info?" --> C(Context Relevance / Precision)
    B -- "Judge 2: Did we find ALL the info?" --> D(Context Recall)
    
    B --> E(2. Generator)
    E -- "Judge 3: Did the LLM stick to the context?" --> F(Faithfulness)
    E -- "Judge 4: Did the final answer actually answer the query?" --> G(Answer Relevancy)
    
    style C fill:#e3f2fd,stroke:#0d47a1
    style D fill:#e3f2fd,stroke:#0d47a1
    style F fill:#e0f2f1,stroke:#00695c
    style G fill:#e0f2f1,stroke:#00695c
  1. Context Relevance / Precision: Are the retrieved documents actually relevant? Or is it 90% junk? (Tests the Retriever)
  2. Context Recall: Did we find all the necessary documents to answer the question? (Tests the Retriever)
  3. Faithfulness: Did the LLM stick to the facts in the context? Or did it make stuff up (hallucinate)? (Tests the Generator)
  4. Answer Relevancy: Did the final answer actually answer the user's question, or did it get sidetracked? (Tests the Generator)

The "How": Using an "LLM-as-Judge"

We could manually score 100 questions on these 4 metrics. But that's slow and expensive. The modern solution is to use an LLM-as-Judge.

We'll use a framework like RAGAs to automate this. RAGAs uses a powerful LLM (like GPT-4o) to act as our "judge."

The "How":

Your code looks like a simple Python script.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

# 1. First, we create our "golden set" of test questions
# We also need the "ground_truth" - the perfect, human-written answer
eval_questions = ["What is RAG?"]
ground_truths = ["RAG is a technique to retrieve data..."]

# 2. We run our RAG pipeline (that we built in a previous post)
# to get its answers and the context it used.
results = my_rag_pipeline(eval_questions) 
# results = {
#   'question': "What is RAG?",
#   'answer': "RAG is a technique to...",
#   'contexts': [["Retrieval-Augmented Generation (RAG) is a..."], ... ]
# }

# 3. We format this data for RAGAs
dataset = Dataset.from_dict({
    'question': eval_questions,
    'answer': [results['answer']],
    'contexts': [results['contexts']],
    'ground_truth': ground_truths
})

# 4. We ask RAGAs to evaluate our pipeline's output
result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ]
)

# 5. Get your scores
# result = {'faithfulness': 1.0, 'answer_relevancy': 0.95, ...}
print(result)

Observation: We now have a repeatable, objective way to score our agent. When we change a prompt, we just re-run this script. If faithfulness drops from 0.95 to 0.80, we know our change caused more hallucinations, and we can block the change. This is how you run CI/CD for your AI agents.

Think About It: The "LLM-as-Judge" (like RAGAs) is powerful, but it's not free. Every evaluation call is an API call to GPT-4o. How could this be a problem? (Hint: Think about cost and speed when testing 10,000 questions).

Next step

Now that we can measure failure, we need to handle it. In our next post, we'll build a resilient, production-grade agent that can survive real-world chaos like API timeouts and tool failures.

Challenge for you

  1. Find Your "Golden Set": You can't test without a good test. Take a RAG agent you've built. Write 5 "golden set" questions for it.
  2. The Set: Include 3 questions where you know the answer is in the docs, and 2 where you know it is not.
  3. Be the Judge: Manually score your agent's answers for these 5 questions on a 1-5 scale for Faithfulness (Did it make stuff up?) and Answer Relevancy (Did it answer the right question?).
  4. Analyze: Where did it fail? This is your first evaluation dataset!

Key takeaways

  • Quantitative metrics replace gut feelings: The RAG Triad (context relevance, recall, faithfulness, answer relevancy) gives us objective scores
  • LLM-as-Judge automates evaluation: Tools like RAGAs use powerful LLMs to score answers at scale, replacing expensive manual evaluation
  • Test both retriever and generator: Context metrics test retrieval quality; faithfulness and answer relevancy test generation quality
  • Evaluation enables CI/CD: Automated evaluation lets you catch regressions before they reach production
  • Golden sets are essential: You need a curated set of test questions with known good answers to measure improvement

For more on RAG evaluation, see our RAG evaluation guide.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.