Production RAG: Handling Edge Cases and Failures
In our last post, we learned how to measure our RAG agent's quality. We built a "golden set" and used RAGAs to score our bot.
But this post is for you if you've ever moved a "working" demo to production, only to have it crash 10 minutes later.
In the real world, things break. APIs time out. Third-party services go down. LLMs get rate-limited. If your agent is just a "happy path" script, it's not a production system. It's a liability.
Today, we'll build a Resilient Agent that can handle real-world chaos using fallbacks, retries, and graceful degradation.
The problem: The "Brittle" agent
Our "happy path" agent works great... if the world is perfect. But what happens when our "web search" tool times out?
sequenceDiagram
participant User
participant Agent
participant Web_Search_Tool
User->>Agent: "What's the news on Project-Z?"
activate Agent
Agent->>Web_Search_Tool: search("Project-Z")
activate Web_Search_Tool
note right of Web_Search_Tool: ... (API times out after 30s) ...
Web_Search_Tool-->>Agent: [X ERROR 504: Gateway Timeout]
deactivate Web_Search_Tool
Agent->>Agent: [CRASH]
Agent-->>User: {"error": "Internal Server Error"}
deactivate Agent
Why this is bad:
- The User gets a broken app. This is the worst possible experience.
- The Agent is brittle. A single, common network error brought down our entire system.
A production agent must be anti-fragile. It needs a "Plan B."
The solution: A "Graceful Fallback" graph
We can't prevent all errors, but we can handle them. We will stop thinking in "chains" and start thinking in "graphs" with conditional logic.
We will build an agent with this logic:
- Try Tool A (e.g., our primary, high-quality paid search API).
- Did it work?
- Yes: Great, go to "Generate Answer."
- No (Timeout/Error): Don't crash. Go to "Plan B."
- Try Tool B (e.g., a free, less reliable web search tool).
- Did that work?
- Yes: Great, go to "Generate Answer" (with the Plan B data).
- No: Don't crash. Go to "Plan C."
- Plan C: Generate a graceful failure message.
This is called Graceful Degradation.
graph TD
A[Start] --> B(Try Tool A: Paid Search API)
B --> C{Success?}
C -- "Yes" --> D[Generate Answer]
C -- "No (e.g., Timeout)" --> E(Try Tool B: Free Web Search)
E --> F{Success?}
F -- "Yes" --> D
F -- "No (e.g., Timeout)" --> G[Generate Graceful Error: Sorry, I can't search right now.]
D --> H[End]
G --> H
style B fill:#e3f2fd,stroke:#0d47a1
style E fill:#fff8e1,stroke:#f57f17
style G fill:#ffebee,stroke:#b71c1c
style D fill:#e8f5e9,stroke:#388e3c
The "How": Building fallbacks with LangGraph
We can build this exact logic using LangGraph. We'll define our "state" and our "nodes," but this time, our nodes will include try/except blocks.
Brick 1: The "Memory" (GraphState)
Our "memory" needs to hold the question and a context that might be filled by either Tool A or Tool B.
from typing import TypedDict, List
class GraphState(TypedDict):
question: str
context: List[str]
error_message: str # To store what went wrong
Brick 2: The "Nodes" (with error handling)
Now, we build our nodes. This time, they don't just "run"; they "try to run."
# Our (fictional) tools
def paid_search_tool(query: str) -> List[str]:
# This tool is great, but it might fail
if "fail" in query: # A mock failure
raise TimeoutError("API timed out after 30 seconds")
return ["Fact from Paid API: ..."]
def free_search_tool(query: str) -> List[str]:
# This is our cheap, reliable fallback
return ["Fact from Free Search: ..."]
# --- Node 1: Try Tool A ---
def try_tool_a(state):
print("---NODE: Trying Tool A (Paid Search)---")
try:
context = paid_search_tool(state["question"])
return {"context": context, "error_message": None}
except Exception as e:
print(f"Tool A failed: {e}")
return {"context": [], "error_message": str(e)}
# --- Node 2: Try Tool B (The Fallback) ---
def try_tool_b(state):
print("---NODE: Trying Tool B (Free Search)---")
# Our simple fallback tool is very reliable
context = free_search_tool(state["question"])
return {"context": context, "error_message": None}
# --- Node 3: The Final "Safety Net" ---
def generate_error_message(state):
print("---NODE: All tools failed. Gracefully failing.---")
return {"context": [f"I'm sorry, my search tools are currently offline. The error was: {state['error_message']}"]}
Observation: Our nodes are now "smart." They catch errors and update the GraphState instead of crashing the program.
Brick 3: The "Wires" (The conditional logic)
Now, we wire it all up in our LangGraph workflow.
from langgraph.graph import StateGraph, END
# --- The "Decision" Edges ---
def check_tool_a_success(state):
# Did the first tool work?
if state["error_message"] is None:
return "generate" # Yes, go straight to the answer
else:
return "try_tool_b" # No, trigger Plan B
def check_tool_b_success(state):
# (In a real app, we'd check again, but for this demo
# we'll assume Tool B always works or we'll fail)
if state["context"]:
return "generate"
else:
return "fail_gracefully"
# --- Build the Graph ---
workflow = StateGraph(GraphState)
workflow.add_node("try_tool_a", try_tool_a)
workflow.add_node("try_tool_b", try_tool_b)
workflow.add_node("fail_gracefully", generate_error_message)
workflow.add_node("generate", ...) # Our final LLM generator node
# --- Set the Logic Flow ---
workflow.set_entry_point("try_tool_a")
# The first critical decision
workflow.add_conditional_edges(
"try_tool_a",
check_tool_a_success,
{
"generate": "generate",
"try_tool_b": "try_tool_b"
}
)
# The second critical decision
workflow.add_conditional_edges(
"try_tool_b",
check_tool_b_success,
{
"generate": "generate",
"fail_gracefully": "fail_gracefully"
}
)
# The final paths
workflow.add_edge("generate", END)
workflow.add_edge("fail_gracefully", "generate") # We still go to 'generate' to show the user the error
app = workflow.compile()
Result: We've built a resilient agent!
- If we send
{"question": "What is Model-V?"}, it followstry_tool_a->generate. - If we send
{"question": "fail this query"}, it followstry_tool_a-> (Fails) ->try_tool_b->generate.
Our bot no longer crashes. It degrades gracefully.
Challenge for you
- Use Case: Our current logic retries any error.
- The Problem: What if
try_tool_afails with a 401 Unauthorized (Bad API Key) error? Retrying withtry_tool_bis a waste of time and money; the real problem is our key. - Your Task: How would you modify the
check_tool_a_successlogic to not retry on a 401? (Hint: The function can return more than two strings. What if it returned"fail_fast"and you added a new node for that?)
Key takeaways
- Production systems need error handling: Happy path code will fail in production—you must handle timeouts, rate limits, and service outages
- Fallback strategies prevent crashes: When Tool A fails, gracefully try Tool B instead of crashing
- Graceful degradation maintains UX: Even when tools fail, provide a helpful error message instead of a generic 500 error
- Conditional edges enable resilience: LangGraph's conditional edges let you route based on success/failure, creating self-healing agents
- Error state is part of state: Store error messages in your GraphState so downstream nodes can make informed decisions
For more on building resilient systems, see our concurrency and resilience guide.
For more on building production AI systems, check out our AI Bootcamp for Software Engineers.