What's the time commitment for this bootcamp?

The bootcamp requires 10 hours per week over 6 weeks. This includes live sessions, hands-on projects, and self-paced learning. Most students find this manageable alongside their full-time jobs.

Do I need prior AI experience to join?

No prior AI experience is required, but you should have few years of software development experience. The bootcamp is designed for software engineers who want to upskill in AI engineering.

What if I can't make a live session?

All live sessions are recorded and available for replay. We also offer multiple office hours throughout the week, so you can catch up on any missed content or get help with assignments.

How much should I budget for APIs and resources?

We estimate €10-50 for the entire bootcamp, covering API costs for OpenAI and other services. We'll show you how to optimize costs and use free tiers when possible.

What happens if I can't attend this cohort?

You can defer to the next cohort at no additional cost. We run cohorts every 2-3 months, so you won't have to wait long to join.

How long will I have access to materials after the bootcamp?

You'll have lifetime access to all course materials, recordings, and the private community. This includes future updates and new content we add to the bootcamp.

What's the refund policy?

Yes, you can get a 100% refund if you've progressed less than 10% of the bootcamp or it's within 7 days of your purchase. We're confident in our curriculum and instructor quality, which is why we offer this guarantee.

Do you offer team discounts?

Yes! We offer 20%+ discounts for teams of 3 or more. Contact us at param@learnwithparam.com for team pricing and bulk enrollment options.

RAG Evaluation: Proving Your System Actually Works

We've spent a lot of time building complex RAG systems. We've optimized chunking, added web search, and even built self-correcting agents. Each step felt like an improvement.

But how do we know for sure?

If you change your chunking strategy, how can you prove to your boss that it resulted in a 10% increase in answer quality? Relying on a few example queries isn't enough. A change that improves one answer might make ten others worse.

To build reliable AI, we must move from anecdotal evidence ("it feels better") to quantitative measurement. RAG evaluation is the science of defining what "good" means and then systematically scoring our system against those criteria. It's the difference between being a hobbyist and an engineer.

The RAG Triad: What do we measure?

Before we test, we need to know what to measure. A "good" RAG system balances three core components, often called the RAG Triad:

Context Relevance (Retrieval Quality): Did our retriever find the right information? Did it find all the info needed (Context Recall) and not include irrelevant junk (Context Precision)?
Faithfulness (Generation Quality): Did the LLM's answer stick to the facts from the retrieved context? Or did it "hallucinate" and make things up?
Answer Correctness (Overall Quality): Was the final answer actually correct and relevant to the user's question?

graph TD
    A[User Query] --> B(1. Retrieval)
    B -- "Did we find the right info?" --> C(Context Relevance)
    B --> D(2. Generation)
    D -- "Did the LLM stick to the context?" --> E(Faithfulness)
    D -- "Was the final answer correct?" --> F(Answer Correctness)
    
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px

We need to score our system on all of these metrics to get a complete picture.

Step 1: Build a test pipeline and test questions

To evaluate a system, we first need a system to test. We'll build a simple RAG pipeline based on a tiny knowledge base.

# 1. Our "knowledge base" (just a few facts)
documents = [
    "The first wizarding war ended when Lord Voldemort's Killing Curse rebounded...",
    "Harry Potter was left with a lightning-bolt scar...",
    "The three Unforgivable Curses are the Imperius Curse, the Cruciatus Curse, and the Killing Curse."
]

# (Code to add these docs to a ChromaDB collection)

# 2. Our simple RAG pipeline function
def simple_rag_pipeline(question):
    # Retrieve
    retrieved_docs = collection.query(query_texts=[question], n_results=2)['documents'][0]
    context = "\n".join(retrieved_docs)
    
    # Generate
    prompt = f"Context:\n{context}\n\nQuestion: {question}"
    answer = call_llm(prompt) # call_llm is our function to talk to OpenAI
    
    return {"answer": answer, "contexts": retrieved_docs}

Next, and most importantly, we need an evaluation dataset. This includes questions and, for the best results, the "perfect" answers (called ground_truth).

Crucially, we must include questions that cannot be answered by our documents. This tests if the system knows what it doesn't know.

# Our "test sheet"
eval_questions = [
    "What caused the first wizarding war to end?",
    "What are the Unforgivable Curses?",
    "Who was the Minister for Magic during the first war?" # Not in our documents!
]

ground_truths = [
    "The war ended when Voldemort's Killing Curse rebounded on him.",
    "The Imperius, Cruciatus, and Killing Curses.",
    "The provided context does not mention the Minister for Magic."
]

Step 2: Running the evaluation with RAGAs

We could grade these answers by hand, but it's slow and subjective. Instead, we'll use a framework called RAGAs.

RAGAs acts as an automated "judge". It uses powerful LLMs (like GPT-4) to read the question, the retrieved_contexts, the generated answer, and the ground_truth and then score our system on the metrics we defined.

First, we run our pipeline to get the outputs we want to score:

# 1. Run our pipeline on all questions
generated_data = []
for q in eval_questions:
    result = simple_rag_pipeline(q)
    generated_data.append(result)

# 2. Format the data for RAGAs
from datasets import Dataset

ragas_dataset_dict = {
    'question': eval_questions,
    'answer': [d['answer'] for d in generated_data],
    'contexts': [d['contexts'] for d in generated_data],
    'ground_truth': ground_truths
}

ragas_dataset = Dataset.from_dict(ragas_dataset_dict)

Now, we just hand this dataset to RAGAs and ask it to evaluate.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness, context_recall, context_precision

# 3. Run the evaluation!
result = evaluate(
    dataset=ragas_dataset,
    metrics=[
        context_precision,  # Did we retrieve relevant junk?
        context_recall,     # Did we retrieve all the right info?
        faithfulness,       # Did the answer stick to the context?
        answer_correctness  # Was the answer correct?
    ]
)

# 4. Display the results
print(result.to_pandas())

The output is a clean table of scores for each question, showing us exactly where our system is weak. For the "Minister for Magic" question, the context_recall score would be very low (close to 0.0) because our retriever failed to find the necessary information, and RAGAs correctly identified this failure.

Step 3: Proving an optimization works (A/B testing)

This is where evaluation becomes a superpower. Let's make a change and prove if it's better.

Hypothesis: Retrieving more documents (n_results=3 instead of 2) will improve our answers.

We create an advanced_rag_pipeline that's identical, but retrieves 3 documents. We run it, evaluate it, and compare the results.

Question	RAG (k=2) `answer_correctness`	RAG (k=3) `answer_correctness`
"What caused the war to end?"	0.95	0.96
"What are the Curses?"	0.51	0.92
"Who was the Minister?"	0.88	0.88

Conclusion: Look at that! For the "Curses" question, retrieving only 2 documents wasn't enough, so the answer was incomplete (score: 0.51). Our "advanced" pipeline, retrieving 3 documents, got all the needed context and scored a 0.92.

We have just quantitatively proven that our change (k=3) is a significant improvement.

Key takeaways

If you can't measure it, you can't improve it: Evaluation is the core discipline of building production AI. Stop "feeling" and start measuring.
The RAG triad is your guide: Focus on your Retrieval Quality (Context Precision/Recall) and your Generation Quality (Faithfulness/Correctness) to get a complete picture.
Frameworks automate judging: Tools like RAGAs automate the complex, expensive task of LLM-based evaluation, letting you test and iterate rapidly.
Evaluation is comparative: The true power of evaluation is in A/B testing—proving that a change to your system leads to a measurable improvement.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

RAG Evaluation: Proving Your System Actually Works

Share this post

The RAG Triad: What do we measure?

Step 1: Build a test pipeline and test questions

Step 2: Running the evaluation with RAGAs

Step 3: Proving an optimization works (A/B testing)

Key takeaways

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Browser Automation: Building Agents That See and Click

RAG Evaluation: Proving Your System Actually Works

Share this post

The RAG Triad: What do we measure?

Step 1: Build a test pipeline and test questions

Step 2: Running the evaluation with RAGAs

Step 3: Proving an optimization works (A/B testing)

Key takeaways

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Browser Automation: Building Agents That See and Click

Weekly Bytes of AI