What's the time commitment for this bootcamp?

The bootcamp requires 10 hours per week over 6 weeks. This includes live sessions, hands-on projects, and self-paced learning. Most students find this manageable alongside their full-time jobs.

Do I need prior AI experience to join?

No prior AI experience is required, but you should have few years of software development experience. The bootcamp is designed for software engineers who want to upskill in AI engineering.

What if I can't make a live session?

All live sessions are recorded and available for replay. We also offer multiple office hours throughout the week, so you can catch up on any missed content or get help with assignments.

How much should I budget for APIs and resources?

We estimate €10-50 for the entire bootcamp, covering API costs for OpenAI and other services. We'll show you how to optimize costs and use free tiers when possible.

What happens if I can't attend this cohort?

You can defer to the next cohort at no additional cost. We run cohorts every 2-3 months, so you won't have to wait long to join.

How long will I have access to materials after the bootcamp?

You'll have lifetime access to all course materials, recordings, and the private community. This includes future updates and new content we add to the bootcamp.

What's the refund policy?

Yes, you can get a 100% refund if you've progressed less than 10% of the bootcamp or it's within 7 days of your purchase. We're confident in our curriculum and instructor quality, which is why we offer this guarantee.

Do you offer team discounts?

Yes! We offer 20%+ discounts for teams of 3 or more. Contact us at param@learnwithparam.com for team pricing and bulk enrollment options.

Production RAG: Handling Edge Cases and Failures

In our last post, we learned how to measure our RAG agent's quality. We built a "golden set" and used RAGAs to score our bot.

But this post is for you if you've ever moved a "working" demo to production, only to have it crash 10 minutes later.

In the real world, things break. APIs time out. Third-party services go down. LLMs get rate-limited. If your agent is just a "happy path" script, it's not a production system. It's a liability.

Today, we'll build a Resilient Agent that can handle real-world chaos using fallbacks, retries, and graceful degradation.

The problem: The "Brittle" agent

Our "happy path" agent works great... if the world is perfect. But what happens when our "web search" tool times out?

sequenceDiagram
    participant User
    participant Agent
    participant Web_Search_Tool
    
    User->>Agent: "What's the news on Project-Z?"
    activate Agent
    Agent->>Web_Search_Tool: search("Project-Z")
    activate Web_Search_Tool
    
    note right of Web_Search_Tool: ... (API times out after 30s) ...
    
    Web_Search_Tool-->>Agent: [X ERROR 504: Gateway Timeout]
    deactivate Web_Search_Tool
    
    Agent->>Agent: [CRASH]
    Agent-->>User: {"error": "Internal Server Error"}
    deactivate Agent

Why this is bad:

The User gets a broken app. This is the worst possible experience.
The Agent is brittle. A single, common network error brought down our entire system.

A production agent must be anti-fragile. It needs a "Plan B."

The solution: A "Graceful Fallback" graph

We can't prevent all errors, but we can handle them. We will stop thinking in "chains" and start thinking in "graphs" with conditional logic.

We will build an agent with this logic:

Try Tool A (e.g., our primary, high-quality paid search API).
Did it work?
- Yes: Great, go to "Generate Answer."
- No (Timeout/Error): Don't crash. Go to "Plan B."
Try Tool B (e.g., a free, less reliable web search tool).
Did that work?
- Yes: Great, go to "Generate Answer" (with the Plan B data).
- No: Don't crash. Go to "Plan C."
Plan C: Generate a graceful failure message.

This is called Graceful Degradation.

graph TD
    A[Start] --> B(Try Tool A: Paid Search API)
    B --> C{Success?}
    C -- "Yes" --> D[Generate Answer]
    C -- "No (e.g., Timeout)" --> E(Try Tool B: Free Web Search)
    E --> F{Success?}
    F -- "Yes" --> D
    F -- "No (e.g., Timeout)" --> G[Generate Graceful Error: Sorry, I can't search right now.]
    D --> H[End]
    G --> H
    
    style B fill:#e3f2fd,stroke:#0d47a1
    style E fill:#fff8e1,stroke:#f57f17
    style G fill:#ffebee,stroke:#b71c1c
    style D fill:#e8f5e9,stroke:#388e3c

The "How": Building fallbacks with LangGraph

We can build this exact logic using LangGraph. We'll define our "state" and our "nodes," but this time, our nodes will include try/except blocks.

Brick 1: The "Memory" (`GraphState`)

Our "memory" needs to hold the question and a context that might be filled by either Tool A or Tool B.

from typing import TypedDict, List

class GraphState(TypedDict):
    question: str
    context: List[str]
    error_message: str # To store what went wrong

Brick 2: The "Nodes" (with error handling)

Now, we build our nodes. This time, they don't just "run"; they "try to run."

# Our (fictional) tools
def paid_search_tool(query: str) -> List[str]:
    # This tool is great, but it might fail
    if "fail" in query: # A mock failure
        raise TimeoutError("API timed out after 30 seconds")
    return ["Fact from Paid API: ..."]

def free_search_tool(query: str) -> List[str]:
    # This is our cheap, reliable fallback
    return ["Fact from Free Search: ..."]

# --- Node 1: Try Tool A ---
def try_tool_a(state):
    print("---NODE: Trying Tool A (Paid Search)---")
    try:
        context = paid_search_tool(state["question"])
        return {"context": context, "error_message": None}
    except Exception as e:
        print(f"Tool A failed: {e}")
        return {"context": [], "error_message": str(e)}

# --- Node 2: Try Tool B (The Fallback) ---
def try_tool_b(state):
    print("---NODE: Trying Tool B (Free Search)---")
    # Our simple fallback tool is very reliable
    context = free_search_tool(state["question"])
    return {"context": context, "error_message": None}

# --- Node 3: The Final "Safety Net" ---
def generate_error_message(state):
    print("---NODE: All tools failed. Gracefully failing.---")
    return {"context": [f"I'm sorry, my search tools are currently offline. The error was: {state['error_message']}"]}

Observation: Our nodes are now "smart." They catch errors and update the GraphState instead of crashing the program.

Brick 3: The "Wires" (The conditional logic)

Now, we wire it all up in our LangGraph workflow.

from langgraph.graph import StateGraph, END

# --- The "Decision" Edges ---
def check_tool_a_success(state):
    # Did the first tool work?
    if state["error_message"] is None:
        return "generate" # Yes, go straight to the answer
    else:
        return "try_tool_b" # No, trigger Plan B

def check_tool_b_success(state):
    # (In a real app, we'd check again, but for this demo
    # we'll assume Tool B always works or we'll fail)
    if state["context"]:
        return "generate"
    else:
        return "fail_gracefully"

# --- Build the Graph ---
workflow = StateGraph(GraphState)
workflow.add_node("try_tool_a", try_tool_a)
workflow.add_node("try_tool_b", try_tool_b)
workflow.add_node("fail_gracefully", generate_error_message)
workflow.add_node("generate", ...) # Our final LLM generator node

# --- Set the Logic Flow ---
workflow.set_entry_point("try_tool_a")

# The first critical decision
workflow.add_conditional_edges(
    "try_tool_a",
    check_tool_a_success,
    {
        "generate": "generate",
        "try_tool_b": "try_tool_b"
    }
)

# The second critical decision
workflow.add_conditional_edges(
    "try_tool_b",
    check_tool_b_success,
    {
        "generate": "generate",
        "fail_gracefully": "fail_gracefully"
    }
)

# The final paths
workflow.add_edge("generate", END)
workflow.add_edge("fail_gracefully", "generate") # We still go to 'generate' to show the user the error

app = workflow.compile()

Result: We've built a resilient agent!

If we send {"question": "What is Model-V?"}, it follows try_tool_a -> generate.
If we send {"question": "fail this query"}, it follows try_tool_a -> (Fails) -> try_tool_b -> generate.

Our bot no longer crashes. It degrades gracefully.

Challenge for you

Use Case: Our current logic retries any error.
The Problem: What if try_tool_a fails with a 401 Unauthorized (Bad API Key) error? Retrying with try_tool_b is a waste of time and money; the real problem is our key.
Your Task: How would you modify the check_tool_a_success logic to not retry on a 401? (Hint: The function can return more than two strings. What if it returned "fail_fast" and you added a new node for that?)

Key takeaways

Production systems need error handling: Happy path code will fail in production—you must handle timeouts, rate limits, and service outages
Fallback strategies prevent crashes: When Tool A fails, gracefully try Tool B instead of crashing
Graceful degradation maintains UX: Even when tools fail, provide a helpful error message instead of a generic 500 error
Conditional edges enable resilience: LangGraph's conditional edges let you route based on success/failure, creating self-healing agents
Error state is part of state: Store error messages in your GraphState so downstream nodes can make informed decisions

For more on building resilient systems, see our concurrency and resilience guide.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Production RAG: Handling Edge Cases and Failures

Share this post

The problem: The "Brittle" agent

The solution: A "Graceful Fallback" graph

The "How": Building fallbacks with LangGraph

Brick 1: The "Memory" (`GraphState`)

Brick 2: The "Nodes" (with error handling)

Brick 3: The "Wires" (The conditional logic)

Challenge for you

Key takeaways

Share this post

Continue Reading

RAG Optimization: Speed, Cost, and Quality

Multi-Hop RAG: When One Retrieval Isn't Enough

RAG Evaluation: Metrics That Actually Matter

Assembling and Running Your Thinking RAG Agent

Building a Self-Correcting RAG Agent

Production RAG: Handling Edge Cases and Failures

Share this post

Share this post

Continue Reading

RAG Optimization: Speed, Cost, and Quality

Multi-Hop RAG: When One Retrieval Isn't Enough

RAG Evaluation: Metrics That Actually Matter

Assembling and Running Your Thinking RAG Agent

Building a Self-Correcting RAG Agent

Weekly Bytes of AI