RAG Optimization: Speed, Cost, and Quality

Param Harrison
6 min read

Share this post

In our last post, we built a powerful, multi-hop RAG agent. It's smart, it's complex, and it's... slow. And expensive.

Our agent's "brain" (the planner, the executor, the generator) is all using one big, powerful LLM. Every single step is a costly, high-latency API call.

This post is for you if you've ever built a powerful agent and then faced the hard questions from your team:

  • "Why is this so slow?" (Latency)
  • "Why is our OpenAI bill so high?" (Cost)

Today, we'll learn how to optimize our RAG agent by balancing the eternal triangle: Speed, Quality, and Cost.

The problem: The "One-Size-Fits-All" engine

We're building a race car. We're using a massive, 800-horsepower V8 engine (like GPT-4o) to power everything—the wheels, the windshield wipers, and the radio.

graph TD
    A[Query] --> B(Router GPT-4o: $5.00/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o: $5.00/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#ffebee,stroke:#b71c1c
    style D fill:#ffebee,stroke:#b71c1c
    style E fill:#ffebee,stroke:#b71c1c

Why this is bad:

  • Cost: The Router and Grader nodes are simple classification tasks ("web" vs. "vector", "yes" vs. "no"). Using our most expensive model for these is an incredible waste of money.
  • Speed (Latency): The Router has to finish before anything else can even start. Using a slow, powerful model here adds 2-3 seconds of "dead air" for the user.

The solution: "Asymmetric" agent design

We need to stop using one engine. A production-grade agent uses an asymmetric model stack.

  • Nano Models (e.g., gpt-4o-mini, Phi-3-mini, Llama-3-8B):
    • Jobs: Simple, structured tasks.
    • Use for: Routing, Grading, Data Extraction.
    • Gives us: ⚡️ Speed and 💰 Low Cost.
  • Power Models (e.g., GPT-4o, Claude 3.5 Sonnet):
    • Jobs: Complex, creative, nuanced tasks.
    • Use for: Final Answer Generation.
    • Gives us:Quality.

Let's redesign our agent's brain:

graph TD
    A[Query] --> B(Router GPT-4o-mini: $0.15/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o-mini: $0.15/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#e8f5e9,stroke:#388e3c
    style D fill:#e8f5e9,stroke:#388e3c
    style E fill:#e3f2fd,stroke:#0d47a1

Observation: We've just cut our API costs by over 90% and made our agent significantly faster. The user perceives the app as "fast" because the slow Generate step only happens after all the fast routing and retrieval are done.

How to optimize your RAG pipeline (3 techniques)

Let's look at three practical ways to optimize for Speed, Cost, and Quality.

1. Speed: Parallel retrieval (Async)

Problem: In our multi-hop agent, we wait for the internal search to finish, then we (maybe) do a web search. This is sequential and slow.

Solution: Do both at the same time.

graph TD
    subgraph SEQUENTIAL["Sequential (Slow)"]
        A[Query] --> B(Internal RAG)
        B --> C(Web Search)
        C --> D[Answer]
    end
    
    subgraph PARALLEL["Parallel (Fast)"]
        E[Query] --> F(Run in Parallel)
        F --> G[Internal RAG]
        F --> H[Web Search]
        G & H --> I(Combine & Rank)
        I --> J[Answer]
    end
    
    style A fill:#ffebee,stroke:#b71c1c
    style E fill:#e8f5e9,stroke:#388e3c

The "How" (Python asyncio):

Instead of calling functions one by one, we use asyncio.gather to run them concurrently.

import asyncio

async def retrieve_internal(query):
    # ... (code for internal RAG)
    return internal_docs

async def retrieve_web(query):
    # ... (code for web search)
    return web_docs

async def run_parallel_retrieval(query):
    print("--- Starting parallel retrieval ---")
    
    # This runs BOTH functions at the same time
    results = await asyncio.gather(
        retrieve_internal(query),
        retrieve_web(query)
    )
    
    # 'results' will be [ [internal_docs], [web_docs] ]
    all_docs = results[0] + results[1]
    return all_docs

2. Cost: Asymmetric models

Problem: Our Router and Grader nodes are using our most expensive GPT-4o model.

Solution: Explicitly tell those nodes to use a cheaper, faster model.

The "How":

When we define our nodes, we just pass in a different client or model name.

# Our "cheap and fast" client for simple tasks
cheap_client = OpenAI(model="gpt-4o-mini")

# Our "smart and slow" client for the final answer
smart_client = OpenAI(model="gpt-4o")

def route_query(state):
    # ... (prompt) ...
    # This call is FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def grade_documents(state):
    # ... (prompt) ...
    # This call is ALSO FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def generate(state):
    # ... (prompt) ...
    # This is our ONLY EXPENSIVE call
    response = smart_client.chat.completions.create(...)
    return {"generation": ...}

3. Quality: Re-ranking

Problem: We want to give the LLM the "best" context.

  • If we Retrieve(k=3), we might miss a key fact. (Low Recall)
  • If we Retrieve(k=20), we flood the LLM with 19 irrelevant docs. (Low Precision)

Solution: A 2-step process. First, retrieve many docs (for high recall), then use a second, lightweight "Re-Ranker" model to find the best ones (for high precision).

graph TD
    A[Query] --> B(1. Retrieve k=20)
    B --> C[20 Documents]
    C --> D(2. Re-Ranker Find best 3)
    D --> E[Top 3 Documents]
    E --> F(3. Generate)
    F --> G[Answer]
    
    style B fill:#e3f2fd,stroke:#0d47a1
    style D fill:#e0f2f1,stroke:#00695c

Observation: This "Retrieve-then-Rank" pattern is a standard for high-quality RAG. It gives the Generator the "best of both worlds": a wide search (high recall) and focused context (high precision).

Challenge for you

  1. Use Case: Your agent's generate node (using GPT-4o) is still 90% of your total cost.
  2. The Problem: Many user questions are simple, like "What is Model-V?" They don't need GPT-4o's power.
  3. Your Task: How would you implement a "Cost-Saving" step before the generate node? (Hint: Think about a new "Grader" node. What would it grade? How could it route to different generator models?)

Key takeaways

  • Asymmetric model design saves money: Use cheap, fast models for simple tasks (routing, grading) and expensive models only for complex generation
  • Parallel retrieval reduces latency: Running multiple retrieval steps concurrently cuts total wait time significantly
  • Re-ranking improves quality: Retrieve many documents for recall, then re-rank to select the best few for precision
  • Optimize the critical path: The user perceives speed based on the longest sequential path—optimize that first
  • Cost and quality are trade-offs: Simple questions don't need expensive models; complex questions do—route accordingly

For more on system optimization, see our streaming at scale guide and our concurrency guide.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.