What's the time commitment for this bootcamp?

The bootcamp requires 10 hours per week over 6 weeks. This includes live sessions, hands-on projects, and self-paced learning. Most students find this manageable alongside their full-time jobs.

Do I need prior AI experience to join?

No prior AI experience is required, but you should have few years of software development experience. The bootcamp is designed for software engineers who want to upskill in AI engineering.

What if I can't make a live session?

All live sessions are recorded and available for replay. We also offer multiple office hours throughout the week, so you can catch up on any missed content or get help with assignments.

How much should I budget for APIs and resources?

We estimate €10-50 for the entire bootcamp, covering API costs for OpenAI and other services. We'll show you how to optimize costs and use free tiers when possible.

What happens if I can't attend this cohort?

You can defer to the next cohort at no additional cost. We run cohorts every 2-3 months, so you won't have to wait long to join.

How long will I have access to materials after the bootcamp?

You'll have lifetime access to all course materials, recordings, and the private community. This includes future updates and new content we add to the bootcamp.

What's the refund policy?

Yes, you can get a 100% refund if you've progressed less than 10% of the bootcamp or it's within 7 days of your purchase. We're confident in our curriculum and instructor quality, which is why we offer this guarantee.

Do you offer team discounts?

Yes! We offer 20%+ discounts for teams of 3 or more. Contact us at param@learnwithparam.com for team pricing and bulk enrollment options.

RAG Evaluation: Metrics That Actually Matter

In our previous posts, we've built a complete, self-correcting RAG agent (see our RAG agent series). We've added routing, tool use, and self-correction. It feels smarter. But in engineering, "feels" is not a word we can put in a report.

This post is for you if you've ever asked:

"I changed my prompt. Did it actually make the answers better, or just different?"
"I swapped my embedding model. How do I prove it was worth the cost?"
"My boss wants a report on our RAG quality. Where do I even start?"

Relying on a few example queries is not engineering. Today, we'll learn how to quantitatively measure the quality of our RAG system.

The problem: "Which is better?" is not a guess

You've built a RAG pipeline. You A/B test two different prompts on the same query.

graph TD
    A[User Query] --> B(Pipeline V1: Simple Prompt)
    B --> C["Answer A: 'The report is positive.'"]
    
    A --> D(Pipeline V2: Detailed Prompt)
    D --> E["Answer B: 'The report, published in Q4, is positive, citing a 15% growth.'"]
    
    F((Which is better?)) -- ??? --> C
    F -- ??? --> E
    
    style C fill:#fff8e1,stroke:#f57f17
    style E fill:#e8f5e9,stroke:#388e3c

Answer B looks more detailed, but is it more accurate? Did it hallucinate that "15% growth"? We have no way of knowing without a "judge." We must move from anecdotal evidence to automated, quantitative measurement.

The solution: The "RAG Triad" (What to measure)

Before we pick a tool, we must define what "good" means. A RAG system has two main components that can fail: the Retriever and the Generator. We must test both.

This gives us the RAG Triad of metrics:

graph TD
    A[User Query] --> B(1. Retriever)
    B -- "Judge 1: Did we find the right info?" --> C(Context Relevance / Precision)
    B -- "Judge 2: Did we find ALL the info?" --> D(Context Recall)
    
    B --> E(2. Generator)
    E -- "Judge 3: Did the LLM stick to the context?" --> F(Faithfulness)
    E -- "Judge 4: Did the final answer actually answer the query?" --> G(Answer Relevancy)
    
    style C fill:#e3f2fd,stroke:#0d47a1
    style D fill:#e3f2fd,stroke:#0d47a1
    style F fill:#e0f2f1,stroke:#00695c
    style G fill:#e0f2f1,stroke:#00695c

Context Relevance / Precision: Are the retrieved documents actually relevant? Or is it 90% junk? (Tests the Retriever)
Context Recall: Did we find all the necessary documents to answer the question? (Tests the Retriever)
Faithfulness: Did the LLM stick to the facts in the context? Or did it make stuff up (hallucinate)? (Tests the Generator)
Answer Relevancy: Did the final answer actually answer the user's question, or did it get sidetracked? (Tests the Generator)

The "How": Using an "LLM-as-Judge"

We could manually score 100 questions on these 4 metrics. But that's slow and expensive. The modern solution is to use an LLM-as-Judge.

We'll use a framework like RAGAs to automate this. RAGAs uses a powerful LLM (like GPT-4o) to act as our "judge."

The "How":

Your code looks like a simple Python script.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

# 1. First, we create our "golden set" of test questions
# We also need the "ground_truth" - the perfect, human-written answer
eval_questions = ["What is RAG?"]
ground_truths = ["RAG is a technique to retrieve data..."]

# 2. We run our RAG pipeline (that we built in a previous post)
# to get its answers and the context it used.
results = my_rag_pipeline(eval_questions) 
# results = {
#   'question': "What is RAG?",
#   'answer': "RAG is a technique to...",
#   'contexts': [["Retrieval-Augmented Generation (RAG) is a..."], ... ]
# }

# 3. We format this data for RAGAs
dataset = Dataset.from_dict({
    'question': eval_questions,
    'answer': [results['answer']],
    'contexts': [results['contexts']],
    'ground_truth': ground_truths
})

# 4. We ask RAGAs to evaluate our pipeline's output
result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ]
)

# 5. Get your scores
# result = {'faithfulness': 1.0, 'answer_relevancy': 0.95, ...}
print(result)

Observation: We now have a repeatable, objective way to score our agent. When we change a prompt, we just re-run this script. If faithfulness drops from 0.95 to 0.80, we know our change caused more hallucinations, and we can block the change. This is how you run CI/CD for your AI agents.

Think About It: The "LLM-as-Judge" (like RAGAs) is powerful, but it's not free. Every evaluation call is an API call to GPT-4o. How could this be a problem? (Hint: Think about cost and speed when testing 10,000 questions).

Next step

Now that we can measure failure, we need to handle it. In our next post, we'll build a resilient, production-grade agent that can survive real-world chaos like API timeouts and tool failures.

Challenge for you

Find Your "Golden Set": You can't test without a good test. Take a RAG agent you've built. Write 5 "golden set" questions for it.
The Set: Include 3 questions where you know the answer is in the docs, and 2 where you know it is not.
Be the Judge: Manually score your agent's answers for these 5 questions on a 1-5 scale for Faithfulness (Did it make stuff up?) and Answer Relevancy (Did it answer the right question?).
Analyze: Where did it fail? This is your first evaluation dataset!

Key takeaways

Quantitative metrics replace gut feelings: The RAG Triad (context relevance, recall, faithfulness, answer relevancy) gives us objective scores
LLM-as-Judge automates evaluation: Tools like RAGAs use powerful LLMs to score answers at scale, replacing expensive manual evaluation
Test both retriever and generator: Context metrics test retrieval quality; faithfulness and answer relevancy test generation quality
Evaluation enables CI/CD: Automated evaluation lets you catch regressions before they reach production
Golden sets are essential: You need a curated set of test questions with known good answers to measure improvement

For more on RAG evaluation, see our RAG evaluation guide.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

RAG Evaluation: Metrics That Actually Matter

Share this post

The problem: "Which is better?" is not a guess

The solution: The "RAG Triad" (What to measure)

The "How": Using an "LLM-as-Judge"

Next step

Challenge for you

Key takeaways

Share this post

Continue Reading

RAG Optimization: Speed, Cost, and Quality

Multi-Hop RAG: When One Retrieval Isn't Enough

Production RAG: Handling Edge Cases and Failures

Assembling and Running Your Thinking RAG Agent

Building a Self-Correcting RAG Agent

RAG Evaluation: Metrics That Actually Matter

Share this post

The problem: "Which is better?" is not a guess

The solution: The "RAG Triad" (What to measure)

The "How": Using an "LLM-as-Judge"

Next step

Challenge for you

Key takeaways

Share this post

Continue Reading

RAG Optimization: Speed, Cost, and Quality

Multi-Hop RAG: When One Retrieval Isn't Enough

Production RAG: Handling Edge Cases and Failures

Assembling and Running Your Thinking RAG Agent

Building a Self-Correcting RAG Agent

Weekly Bytes of AI