What's the time commitment for this bootcamp?

The bootcamp requires 10 hours per week over 6 weeks. This includes live sessions, hands-on projects, and self-paced learning. Most students find this manageable alongside their full-time jobs.

Do I need prior AI experience to join?

No prior AI experience is required, but you should have few years of software development experience. The bootcamp is designed for software engineers who want to upskill in AI engineering.

What if I can't make a live session?

All live sessions are recorded and available for replay. We also offer multiple office hours throughout the week, so you can catch up on any missed content or get help with assignments.

How much should I budget for APIs and resources?

We estimate €10-50 for the entire bootcamp, covering API costs for OpenAI and other services. We'll show you how to optimize costs and use free tiers when possible.

What happens if I can't attend this cohort?

You can defer to the next cohort at no additional cost. We run cohorts every 2-3 months, so you won't have to wait long to join.

How long will I have access to materials after the bootcamp?

You'll have lifetime access to all course materials, recordings, and the private community. This includes future updates and new content we add to the bootcamp.

What's the refund policy?

Yes, you can get a 100% refund if you've progressed less than 10% of the bootcamp or it's within 7 days of your purchase. We're confident in our curriculum and instructor quality, which is why we offer this guarantee.

Do you offer team discounts?

Yes! We offer 20%+ discounts for teams of 3 or more. Contact us at param@learnwithparam.com for team pricing and bulk enrollment options.

RAG Optimization: Speed, Cost, and Quality

In our last post, we built a powerful, multi-hop RAG agent. It's smart, it's complex, and it's... slow. And expensive.

Our agent's "brain" (the planner, the executor, the generator) is all using one big, powerful LLM. Every single step is a costly, high-latency API call.

This post is for you if you've ever built a powerful agent and then faced the hard questions from your team:

"Why is this so slow?" (Latency)
"Why is our OpenAI bill so high?" (Cost)

Today, we'll learn how to optimize our RAG agent by balancing the eternal triangle: Speed, Quality, and Cost.

The problem: The "One-Size-Fits-All" engine

We're building a race car. We're using a massive, 800-horsepower V8 engine (like GPT-4o) to power everything—the wheels, the windshield wipers, and the radio.

graph TD
    A[Query] --> B(Router GPT-4o: $5.00/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o: $5.00/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#ffebee,stroke:#b71c1c
    style D fill:#ffebee,stroke:#b71c1c
    style E fill:#ffebee,stroke:#b71c1c

Why this is bad:

Cost: The Router and Grader nodes are simple classification tasks ("web" vs. "vector", "yes" vs. "no"). Using our most expensive model for these is an incredible waste of money.
Speed (Latency): The Router has to finish before anything else can even start. Using a slow, powerful model here adds 2-3 seconds of "dead air" for the user.

The solution: "Asymmetric" agent design

We need to stop using one engine. A production-grade agent uses an asymmetric model stack.

Nano Models (e.g., gpt-4o-mini, Phi-3-mini, Llama-3-8B):
- Jobs: Simple, structured tasks.
- Use for: Routing, Grading, Data Extraction.
- Gives us: ⚡️ Speed and 💰 Low Cost.
Power Models (e.g., GPT-4o, Claude 3.5 Sonnet):
- Jobs: Complex, creative, nuanced tasks.
- Use for: Final Answer Generation.
- Gives us: ✨ Quality.

Let's redesign our agent's brain:

graph TD
    A[Query] --> B(Router GPT-4o-mini: $0.15/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o-mini: $0.15/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#e8f5e9,stroke:#388e3c
    style D fill:#e8f5e9,stroke:#388e3c
    style E fill:#e3f2fd,stroke:#0d47a1

Observation: We've just cut our API costs by over 90% and made our agent significantly faster. The user perceives the app as "fast" because the slow Generate step only happens after all the fast routing and retrieval are done.

How to optimize your RAG pipeline (3 techniques)

Let's look at three practical ways to optimize for Speed, Cost, and Quality.

1. Speed: Parallel retrieval (Async)

Problem: In our multi-hop agent, we wait for the internal search to finish, then we (maybe) do a web search. This is sequential and slow.

Solution: Do both at the same time.

graph TD
    subgraph SEQUENTIAL["Sequential (Slow)"]
        A[Query] --> B(Internal RAG)
        B --> C(Web Search)
        C --> D[Answer]
    end
    
    subgraph PARALLEL["Parallel (Fast)"]
        E[Query] --> F(Run in Parallel)
        F --> G[Internal RAG]
        F --> H[Web Search]
        G & H --> I(Combine & Rank)
        I --> J[Answer]
    end
    
    style A fill:#ffebee,stroke:#b71c1c
    style E fill:#e8f5e9,stroke:#388e3c

The "How" (Python asyncio):

Instead of calling functions one by one, we use asyncio.gather to run them concurrently.

import asyncio

async def retrieve_internal(query):
    # ... (code for internal RAG)
    return internal_docs

async def retrieve_web(query):
    # ... (code for web search)
    return web_docs

async def run_parallel_retrieval(query):
    print("--- Starting parallel retrieval ---")
    
    # This runs BOTH functions at the same time
    results = await asyncio.gather(
        retrieve_internal(query),
        retrieve_web(query)
    )
    
    # 'results' will be [ [internal_docs], [web_docs] ]
    all_docs = results[0] + results[1]
    return all_docs

2. Cost: Asymmetric models

Problem: Our Router and Grader nodes are using our most expensive GPT-4o model.

Solution: Explicitly tell those nodes to use a cheaper, faster model.

The "How":

When we define our nodes, we just pass in a different client or model name.

# Our "cheap and fast" client for simple tasks
cheap_client = OpenAI(model="gpt-4o-mini")

# Our "smart and slow" client for the final answer
smart_client = OpenAI(model="gpt-4o")

def route_query(state):
    # ... (prompt) ...
    # This call is FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def grade_documents(state):
    # ... (prompt) ...
    # This call is ALSO FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def generate(state):
    # ... (prompt) ...
    # This is our ONLY EXPENSIVE call
    response = smart_client.chat.completions.create(...)
    return {"generation": ...}

3. Quality: Re-ranking

Problem: We want to give the LLM the "best" context.

If we Retrieve(k=3), we might miss a key fact. (Low Recall)
If we Retrieve(k=20), we flood the LLM with 19 irrelevant docs. (Low Precision)

Solution: A 2-step process. First, retrieve many docs (for high recall), then use a second, lightweight "Re-Ranker" model to find the best ones (for high precision).

graph TD
    A[Query] --> B(1. Retrieve k=20)
    B --> C[20 Documents]
    C --> D(2. Re-Ranker Find best 3)
    D --> E[Top 3 Documents]
    E --> F(3. Generate)
    F --> G[Answer]
    
    style B fill:#e3f2fd,stroke:#0d47a1
    style D fill:#e0f2f1,stroke:#00695c

Observation: This "Retrieve-then-Rank" pattern is a standard for high-quality RAG. It gives the Generator the "best of both worlds": a wide search (high recall) and focused context (high precision).

Challenge for you

Use Case: Your agent's generate node (using GPT-4o) is still 90% of your total cost.
The Problem: Many user questions are simple, like "What is Model-V?" They don't need GPT-4o's power.
Your Task: How would you implement a "Cost-Saving" step before the generate node? (Hint: Think about a new "Grader" node. What would it grade? How could it route to different generator models?)

Key takeaways

Asymmetric model design saves money: Use cheap, fast models for simple tasks (routing, grading) and expensive models only for complex generation
Parallel retrieval reduces latency: Running multiple retrieval steps concurrently cuts total wait time significantly
Re-ranking improves quality: Retrieve many documents for recall, then re-rank to select the best few for precision
Optimize the critical path: The user perceives speed based on the longest sequential path—optimize that first
Cost and quality are trade-offs: Simple questions don't need expensive models; complex questions do—route accordingly

For more on system optimization, see our streaming at scale guide and our concurrency guide.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

RAG Optimization: Speed, Cost, and Quality

Share this post

The problem: The "One-Size-Fits-All" engine

The solution: "Asymmetric" agent design

How to optimize your RAG pipeline (3 techniques)

1. Speed: Parallel retrieval (Async)

2. Cost: Asymmetric models

3. Quality: Re-ranking

Challenge for you

Key takeaways

Share this post

Continue Reading

Multi-Hop RAG: When One Retrieval Isn't Enough

Production RAG: Handling Edge Cases and Failures

RAG Evaluation: Metrics That Actually Matter

Assembling and Running Your Thinking RAG Agent

Building a Self-Correcting RAG Agent

RAG Optimization: Speed, Cost, and Quality

Share this post

Share this post

Continue Reading

Multi-Hop RAG: When One Retrieval Isn't Enough

Production RAG: Handling Edge Cases and Failures

RAG Evaluation: Metrics That Actually Matter

Assembling and Running Your Thinking RAG Agent

Building a Self-Correcting RAG Agent

Weekly Bytes of AI