Architecting CodeRabbit like code-review AI agent: The Orchestration Brain

Param Harrison
22 min read

Share this post

In Part 1, we built the "Eyes and Ears" of our system. We learned how to ingest events, filter noise, build context with GraphRAG, and route intelligently with Model Cascading.

Now we face the hardest problem in AI Engineering: Reliability.

Writing a script that calls an LLM once is easy. Writing an Agent that can navigate a complex codebase, handle flaky APIs, manage concurrent file reviews, and wait for human feedback without crashing? That requires a different kind of architecture.

In Part 2, we are building the Orchestration Brain. We will move beyond simple "chains" and build Durable Agentic Workflows that can run for minutes, hours, or even days.


The Failure Case: Agents are Fragile

When you build a complex agent (like a Code Reviewer), you are orchestrating a long-running process with many points of failure.

graph TD
    A[Simple Python Script] --> B[Review File 1]
    B --> C[Review File 2]
    C --> D[Review File 3]
    D --> E{Server Crashes}
    E --> F[All Progress Lost]
    F --> G[Start Over from File 1]
    
    B --> H[API Rate Limit]
    H --> I[Script Dies]
    I --> F
    
    C --> J[Invalid JSON Response]
    J --> K[Exception Thrown]
    K --> F
    
    style F fill:#ffebee,stroke:#b71c1c
    style G fill:#ffebee,stroke:#b71c1c

The Three Killers:

  1. Time: A deep review of multiple files might take several minutes. Standard HTTP requests timeout after 30-60 seconds. Your agent needs to survive longer than a single request.

  2. Flakiness: The LLM might return valid JSON 99% of the time, but that 1% failure crashes your entire loop. You need resilience built into the architecture.

  3. Async State: Sometimes the agent needs to wait—for a rate limit to reset, for a database query to complete, or for a human to approve a sensitive action. Traditional scripts can't "pause" and "resume" efficiently.

Observation: If you run this in a standard Python script with a while True: loop, a single server restart kills your agent's memory. All progress is lost. You start from scratch, wasting time and money.


The Solution: Durable Workflow Orchestration

To build a robust Agentic Workflow, we stop writing "scripts" and start writing Workflows. We use an orchestration engine (like Temporal, Durable Task Framework, or similar tools).

This separates your agent into two parts:

  1. The Workflow (The Plan): A deterministic definition of what should happen. It is durable. If the server dies, it resumes exactly where it left off.

  2. The Activities (The Actions): The actual work (calling LLMs, fetching code, posting comments). These are volatile and retriable.

Architecture: The Durable Brain

graph TD
    subgraph Orchestrator["The Orchestrator (The Brain)"]
        Workflow[Workflow Definition]
        State[(Durable State Store)]
        Workflow <-->|Checkpoints| State
    end
    
    subgraph Workers["Worker Nodes (The Body)"]
        Worker1[Worker 1]
        Worker2[Worker 2]
        Worker3[Worker 3]
        
        Act1[Activity: Fetch Context]
        Act2[Activity: LLM Call]
        Act3[Activity: Post Comment]
        
        Worker1 --> Act1
        Worker2 --> Act2
        Worker3 --> Act3
    end
    
    Workflow -->|Commands| Worker1
    Workflow -->|Commands| Worker2
    Workflow -->|Commands| Worker3
    
    Act1 -->|Results| Workflow
    Act2 -->|Results| Workflow
    Act3 -->|Results| Workflow
    
    style State fill:#e3f2fd,stroke:#0d47a1
    style Workflow fill:#fff9c4,stroke:#fbc02d

How it Works:

  1. Workflow as Code: You write your agent's logic as a workflow function. Each step is an "activity" that can be retried independently.

  2. Automatic Checkpointing: After each activity completes, the orchestrator saves the state to durable storage. If the server crashes, it replays from the last checkpoint.

  3. Distributed Execution: Workers can be on different machines. The orchestrator coordinates them, ensuring each activity runs exactly once (or retries on failure).

Observation: This is fundamentally different from a script. Your agent becomes immortal. It can survive crashes, network partitions, and API outages. The orchestrator handles all the complexity of state management and recovery.

The State Machine View

stateDiagram-v2
    [*] --> EventReceived
    EventReceived --> FilterNoise
    FilterNoise --> BuildContext
    BuildContext --> ParallelReview
    
    state ParallelReview {
        [*] --> ReviewFile1
        [*] --> ReviewFile2
        [*] --> ReviewFile3
        
        ReviewFile1 --> [*]
        ReviewFile2 --> [*]
        ReviewFile3 --> [*]
    }
    
    ParallelReview --> AggregateResults
    AggregateResults --> PostComment
    PostComment --> [*]
    
    note right of ParallelReview
        Each file review is independent
        Failures only retry that file
    end note

Think About It: Traditional scripts are like cooking without a recipe—if you get interrupted, you forget where you were. Workflows are like having a recipe with checkboxes. You can walk away, come back hours later, and pick up exactly where you left off.


Pattern 1: The "Parallel Swarm" (Concurrent Agent Execution)

The Architectural Problem:

You have a Pull Request with many changed files.

  • Option A (Serial): The agent reviews File 1, then File 2, then File 3... This takes too long. User experience suffers.

  • Option B (Naive Parallel): You fire many async calls to the LLM. You hit rate limits immediately. The system crashes or gets throttled.

The Solution:

We use a Fan-Out / Fan-In pattern. The Workflow acts as a commander, spawning multiple "Sub-Agents" (Activities) to review files in parallel, but managed by the orchestrator's concurrency controls.

The Architecture

graph TD
    Start[PR Event] --> Planner[Planner Agent]
    Planner --> Analyze[Analyze Changed Files]
    Analyze --> FanOut{Fan-Out}
    
    subgraph Swarm["The Swarm (Parallel Execution)"]
        FanOut --> A1[Review Agent: File A]
        FanOut --> A2[Review Agent: File B]
        FanOut --> A3[Review Agent: File C]
        FanOut --> A4[Review Agent: File D]
        FanOut --> A5[Review Agent: File E]
        
        A1 --> LLM1[LLM Call]
        A2 --> LLM2[LLM Call]
        A3 --> LLM3[LLM Call]
        A4 --> LLM4[LLM Call]
        A5 --> LLM5[LLM Call]
    end
    
    LLM1 --> FanIn[Fan-In Aggregator]
    LLM2 --> FanIn
    LLM3 --> FanIn
    LLM4 --> FanIn
    LLM5 --> FanIn
    
    FanIn --> Summary[Summary Agent]
    Summary --> Post[Post Review]
    
    style Swarm fill:#e8f5e9,stroke:#388e3c

How the AI Agent Works Here:

  1. Planner Agent: Analyzes the PR and creates a list of files to review. This becomes the "work queue."

  2. Fan-Out: The orchestrator spawns parallel activities—one for each file. Each runs independently in its own worker.

  3. Concurrency Control: The orchestrator respects limits. If you set "max 10 concurrent reviews," it queues the rest. You never hit rate limits.

  4. Partial Success: If File C's review fails (API timeout), the orchestrator only retries File C. Files A, B, D, and E stay completed.

  5. Fan-In: Once all reviews complete, the aggregator collects results and synthesizes a final summary.

Concrete Example: Before vs. After

Without Orchestration (Naive Parallel):

graph TD
    A[Start] --> B[Fire 20 Async Calls]
    B --> C[Hit Rate Limit]
    C --> D[Some Calls Fail]
    D --> E[Unclear Which Ones]
    E --> F[Retry All 20]
    F --> C
    
    style C fill:#ffebee,stroke:#b71c1c
    style E fill:#ffebee,stroke:#b71c1c

Result: You waste API calls by retrying completed work. You hit rate limits repeatedly. Total time: unpredictable.

With Orchestration (Managed Parallel):

graph TD
    A[Start] --> B[Queue 20 Reviews]
    B --> C[Process 10 at a Time]
    C --> D[2 Fail]
    D --> E[Track: Files 5 and 12 Failed]
    E --> F[Retry Only 5 and 12]
    F --> G[All 20 Complete]
    
    style G fill:#e8f5e9,stroke:#388e3c

Result: You never hit rate limits. Failed reviews retry individually. Total time: predictable and optimal.

The Math:

Approach Concurrent Reviews Rate Limit Hits Wasted Retries Total Time
Naive Parallel Unlimited Frequent Many Unpredictable
Orchestrated Controlled Never Zero Optimal

Observation: The orchestrator acts like a traffic cop. It knows exactly which cars (reviews) have crossed the intersection (completed) and which need to retry. No wasted work, no rate limit chaos.

Think About It: How do you set the concurrency limit? Too low, and you're slow. Too high, and you hit rate limits. The answer: start with your API's rate limit divided by expected response time. If your LLM allows 100 requests/minute and each takes 3 seconds, aim for ~5 concurrent reviews.


Pattern 2: The "Self-Healing Loop" (Resilient Reasoning)

The Architectural Problem:

Agents often fail to output the correct format. You ask for JSON, it gives you Markdown. You ask for a security check, it gives you a style nitpick. LLMs are powerful but unreliable.

In a traditional script, you either:

  • Accept garbage output (bad user experience)
  • Throw an exception and crash (fragile system)

The Solution:

We build a Validation and Self-Correction Loop into the workflow. This is not just a retry; it's a reasoning retry with feedback.

The Architecture

stateDiagram-v2
    [*] --> GenerateOutput
    
    state "Self-Healing Loop" as Loop {
        GenerateOutput --> Validate
        
        state Validate {
            [*] --> CheckFormat
            CheckFormat --> CheckSchema
            CheckSchema --> CheckRelevance
        }
        
        Validate --> Success : All Pass
        Validate --> SelfCorrect : Any Fail
        
        state SelfCorrect {
            [*] --> AnalyzeError
            AnalyzeError --> BuildFeedback
            BuildFeedback --> RegenerateWithContext
        }
        
        RegenerateWithContext --> Validate
    }
    
    Success --> [*]
    
    note right of SelfCorrect
        Feed errors back to LLM
        "You produced invalid JSON.
        Here's what was wrong..."
    end note

How it Works:

  1. Generate: The agent produces output (e.g., a code review comment).

  2. Validate: Run checks:

    • Is it valid JSON?
    • Does it match the schema (required fields present)?
    • Is it relevant to the code change?
  3. Self-Correct: If validation fails, don't crash. Instead:

    • Capture the specific error ("Missing 'severity' field")
    • Build a feedback prompt: "Your previous output was missing the 'severity' field. Please regenerate."
    • Call the LLM again with the error context.
  4. Iterate: Repeat until valid or max attempts reached.

In a durable workflow, this loop can run indefinitely. If the API goes down during retry, the workflow pauses and resumes exactly where it was when the API comes back online.

Concrete Example: The Self-Correction Flow

Scenario: Ask the agent to find security vulnerabilities in a function.

Attempt 1 (Raw Output):

{
  "comment": "This code looks fine to me!",
  "issues": []
}

Validation Result: ❌ No severity field. Not helpful.

Feedback Prompt:

Your previous response:
{
  "comment": "This code looks fine to me!",
  "issues": []
}

Problem: You must include a 'severity' field for each finding.
Schema: { "comment": string, "issues": [{"type": string, "severity": string, "line": number}] }

Please regenerate your analysis with the correct schema.

Attempt 2 (Corrected Output):

{
  "comment": "No critical security issues found.",
  "issues": [],
  "severity": "none"
}

Validation Result: ✅ Valid schema. Review approved.

The Resilience Comparison

Approach Invalid Output System Response User Impact
Naive Script Crashes Review never posted Bad UX
Basic Retry Retries blindly Same error repeats Wasted cost
Self-Healing Loop Learns from error Corrects and succeeds Good UX

Observation: The self-healing loop turns the LLM's unreliability into a solvable problem. By giving it feedback, you transform a 90% success rate into 99%+ after 2-3 iterations.

Think About It: How many retry attempts should you allow? Too few, and legitimate edge cases fail. Too many, and you waste tokens on lost causes. Most production systems use 3 attempts: the initial try, one correction, and one final attempt.


Pattern 3: The "Long Wait" (Human-in-the-Loop Workflows)

The Architectural Problem:

Sometimes, an agent shouldn't act alone. It finds a "Critical Security Vulnerability" in authentication code. You don't want it to auto-post that to a public PR. You want a human to verify it first.

In a traditional script, waiting for a human is impossible (you can't sleep() for days and keep the server running).

The Solution:

We use Signals and Durable State. The Workflow calculates a high severity score and enters an AwaitingApproval state. It essentially "hibernates" (saving its state to the database). It consumes zero compute while waiting.

When a human clicks "Approve" on your dashboard, a signal is sent to the workflow. It wakes up, remembers everything (the code, the analysis, the vulnerability details), and proceeds to post the comment.

The Architecture

graph TD
    Analysis[AI Analysis Complete] --> Severity{Check Severity}
    
    Severity -->|Low or Medium| AutoPost[Auto-Post Review]
    Severity -->|High or Critical| Hibernate[Hibernate Workflow]
    
    Hibernate --> WaitState[Waiting State Zero CPU]
    
    Human((Human Engineer)) -->|Reviews in Dashboard| Decision{Approve?}
    Decision -->|Yes| Signal1[Send Approve Signal]
    Decision -->|No| Signal2[Send Reject Signal]
    
    Signal1 --> WakeUp[Wake Up Workflow]
    Signal2 --> WakeUp
    
    WakeUp --> LoadState[Load Saved State]
    LoadState --> PostOrDiscard{Which Signal?}
    
    PostOrDiscard -->|Approved| Post[Post Review Comment]
    PostOrDiscard -->|Rejected| Discard[Discard Review]
    
    AutoPost --> End[End]
    Post --> End
    Discard --> End
    
    style WaitState fill:#fff9c4,stroke:#fbc02d
    style LoadState fill:#e3f2fd,stroke:#0d47a1

How it Works:

  1. Conditional Logic: After the AI completes its analysis, check severity. If critical, branch to the "human approval" path.

  2. Hibernate: The workflow enters a durable wait state. The orchestrator saves all variables (analysis results, file contents, PR metadata) to the database.

  3. Zero Cost: While hibernating, the workflow uses no CPU, no memory. It's just a record in a database.

  4. Human Action: An engineer reviews the finding in a web dashboard. They can see the full context because it's stored in the workflow state.

  5. Wake Up: The engineer clicks "Approve" or "Reject." This sends a signal to the orchestrator.

  6. Resume: The workflow wakes up with all its original state intact. It posts the comment or discards it based on the signal.

Concrete Example: The Timeline

Monday 9:00 AM:

  • PR submitted
  • AI detects "SQL Injection vulnerability in auth module"
  • Severity: Critical
  • Workflow hibernates

Monday 2:00 PM:

  • Engineer reviews finding
  • Confirms it's a real vulnerability
  • Clicks "Approve"

Monday 2:00:05 PM:

  • Workflow wakes up
  • Remembers all details from 5 hours ago
  • Posts detailed security review to PR

What happened in between: Zero compute usage. The workflow was just a row in a database.

The Cost Comparison

Approach Waiting Strategy Cost While Waiting State Preserved
Polling Script Check every 10 seconds High (constant CPU) No (crashes lose it)
Webhook + Database Manual state management Low Partial (complex code)
Durable Workflow Native signals Zero Complete (automatic)

Observation: Human-in-the-loop workflows transform your agent from a "fire and forget" script into a collaborative system. The agent can wait indefinitely for human input without consuming resources. This unlocks entirely new use cases.

Think About It: What else could your agent wait for? External API callbacks? Scheduled time delays? Multi-stage approvals from different teams? With durable workflows, all of these become trivial to implement.


Pattern 4: The "Saga Pattern" (Compensating Transactions)

The Architectural Problem:

Your agent performs a series of actions:

  1. Posts a "reviewing..." comment to GitHub
  2. Calls the LLM for analysis (3 minutes)
  3. Posts the final review

What happens if step 3 fails? You've already posted the "reviewing..." comment. Users see a broken experience: "Still reviewing..." forever.

The Solution:

We use the Saga Pattern—each action has a compensating action that "undoes" it if the workflow fails.

The Architecture

graph TD
    Start[Start Review] --> A1[Action 1: Post In Progress Comment]
    A1 --> A2[Action 2: Call LLM]
    A2 --> A3[Action 3: Post Final Review]
    A3 --> Success[Complete]
    
    A1 -.->|On Failure| C1[Compensate: Delete In Progress]
    A2 -.->|On Failure| C2[Compensate: Post Error Message]
    
    A2 -->|Fails| TriggerSaga[Trigger Rollback]
    TriggerSaga --> C1
    C1 --> C2
    C2 --> CleanExit[Clean Exit State]
    
    style TriggerSaga fill:#ffebee,stroke:#b71c1c
    style CleanExit fill:#fff8e1,stroke:#f57f17

How it Works:

  1. Register Compensations: For each action, register a compensating action:

    • Posted a comment? Register: "Delete that comment."
    • Created a draft? Register: "Delete the draft."
  2. Execute Forward: Run the workflow normally. If everything succeeds, ignore compensations.

  3. Rollback on Failure: If step N fails, execute compensations for steps 1 through N-1 in reverse order.

Result: Users never see broken state. The system either completes fully or cleans up after itself.

Concrete Example: The Compensation Chain

Scenario: Review workflow fails during LLM call.

Forward Actions:

  1. ✅ Posted: "🤖 CodeReviewer is analyzing your changes..."
  2. ❌ LLM call times out
  3. ❌ Never reached

Compensation (Automatic Rollback):

  1. Delete the "analyzing..." comment
  2. Post: "⚠️ Review temporarily unavailable. We'll retry shortly."

User Experience: Instead of seeing a stale "analyzing..." message, users see a clear status update. Trust maintained.

Observation: The Saga Pattern ensures your agent leaves no trace of partial failures. This is critical for production systems where "cleanup" is as important as "doing."

Think About It: What if the compensation itself fails? (e.g., GitHub API is down, can't delete the comment.) Advanced systems implement exponential backoff for compensations and alert humans if cleanup repeatedly fails.


Putting It All Together: A Real Workflow

Let's trace a complete PR review through our orchestrated system.

Scenario: A developer pushes a PR with sensitive changes to payment processing code.

stateDiagram-v2
    [*] --> ReceiveWebhook
    ReceiveWebhook --> FilterFiles
    
    state FilterFiles {
        [*] --> CheckFileTypes
        CheckFileTypes --> DropNoise
        DropNoise --> [*]
    }
    
    FilterFiles --> BuildContext
    
    state BuildContext {
        [*] --> ParseAST
        ParseAST --> QueryGraph
        QueryGraph --> AssembleContext
        AssembleContext --> [*]
    }
    
    BuildContext --> ParallelReview
    
    state ParallelReview {
        state "Fan-Out" as FanOut
        state "Fan-In" as FanIn
        
        [*] --> FanOut
        FanOut --> ReviewFile1
        FanOut --> ReviewFile2
        FanOut --> ReviewFile3
        
        ReviewFile1 --> SelfHeal1
        ReviewFile2 --> SelfHeal2
        ReviewFile3 --> SelfHeal3
        
        SelfHeal1 --> FanIn
        SelfHeal2 --> FanIn
        SelfHeal3 --> FanIn
        
        FanIn --> [*]
    }
    
    ParallelReview --> CheckSeverity
    
    state CheckSeverity {
        [*] --> CalculateRisk
        CalculateRisk --> Critical
        Critical --> [*] : High Risk
        Critical --> [*] : Low Risk
    }
    
    CheckSeverity --> AwaitApproval : High Risk
    CheckSeverity --> PostReview : Low Risk
    
    state AwaitApproval {
        [*] --> HibernateWorkflow
        HibernateWorkflow --> WaitForSignal
        WaitForSignal --> WakeUp
        WakeUp --> [*]
    }
    
    AwaitApproval --> PostReview
    PostReview --> [*]

Timeline with State Preservation:

Time Event Workflow State CPU Usage
T+0s Webhook received ReceiveWebhook Active
T+1s Files filtered FilterFiles Active
T+3s Context built BuildContext Active
T+5s Reviews started (parallel) ParallelReview Active (3 workers)
T+12s File 2 fails validation SelfHeal2 (retry) Active
T+18s All reviews complete CheckSeverity Active
T+19s Critical finding detected AwaitApproval Zero (hibernating)
T+2h Human approves WakeUp Active
T+2h:01s Review posted PostReview Active
T+2h:02s Complete [*] Zero

Total Active CPU Time: ~20 seconds (out of 2 hours)

Key Observations:

  1. Self-Healing: File 2's validation failure didn't crash the workflow. It self-corrected and continued.

  2. Parallel Efficiency: Three files reviewed simultaneously, reducing total time.

  3. Hibernation: During the 2-hour wait for human approval, the workflow used zero compute but retained complete state.

  4. Deterministic: If the server crashed at T+10s, the workflow would resume at ParallelReview with Files 1 and 3 already completed. Only File 2 would retry.


Challenge: Design Decisions for Your Orchestration

Challenge 1: The Timeout Strategy

Your LLM call is taking longer than expected. How long do you wait before considering it "failed"?

Options:

  1. Fixed Timeout: Fail after 30 seconds (simple but inflexible)
  2. Adaptive Timeout: Measure p95 latency, use 2x as timeout (smart but complex)
  3. No Timeout: Let it run forever (risky)

Your Task: For critical security reviews, would you use a longer timeout than for style checks? How do you balance thoroughness vs. responsiveness?

Challenge 2: The Retry Logic

An LLM call fails with "Rate Limit Exceeded." What's your retry strategy?

Options:

  1. Immediate Retry: Try again instantly (likely to fail again)
  2. Fixed Backoff: Wait 10 seconds, then retry (simple but not optimal)
  3. Exponential Backoff: Wait 1s, then 2s, then 4s, then 8s... (industry standard)
  4. Jittered Backoff: Exponential + random jitter to avoid thundering herd

Your Task: If many files hit rate limits simultaneously, how do you prevent all of them from retrying at the same time?

Challenge 3: The Concurrency Limit

You can run multiple file reviews in parallel, but your LLM API has rate limits.

Options:

  1. Fixed Limit: Always run exactly 10 concurrent reviews (safe but not optimal)
  2. Dynamic Limit: Monitor rate limit headers, adjust on the fly (optimal but complex)
  3. Priority Queue: Critical files get priority, style checks wait (fairness vs. urgency)

Your Task: How do you handle a mix of users? Large enterprises want fast reviews, small teams can wait longer. Do you use per-tenant concurrency quotas?


System Comparison: Script vs. Orchestrated Workflow

Dimension Traditional Script Orchestrated Workflow
State Management In-memory (lost on crash) Durable (survives crashes)
Failure Handling Retry all or crash Retry only failed steps
Concurrency Manual async code Built-in parallelism
Long Waits Impossible or polling Native support (hibernation)
Debugging Logs only Full execution history
Cost (idle) High (always running) Zero (hibernates)
Complexity Low (simple scripts) Medium (new concepts)
Scalability Single machine Distributed workers
graph TD
    subgraph Script["Traditional Script"]
        S1[Runs on One Server] --> S2[Keeps State in Memory]
        S2 --> S3[Crash = Lost Progress]
        S3 --> S4[Restart from Beginning]
        style S3 fill:#ffebee,stroke:#b71c1c
    end
    
    subgraph Workflow["Orchestrated Workflow"]
        W1[Runs on Worker Pool] --> W2[Keeps State in Database]
        W2 --> W3[Crash = Resume from Checkpoint]
        W3 --> W4[No Lost Progress]
        style W4 fill:#e8f5e9,stroke:#388e3c
    end

Key Architectural Patterns Summary

Pattern Problem Solved Key Benefit Implementation Complexity
Parallel Swarm Slow serial processing Concurrent execution with rate limit protection Medium
Self-Healing Loop Unreliable LLM outputs Automatic validation and correction Low
Long Wait (Signals) Human-in-the-loop delays Zero-cost hibernation for days Low
Saga Pattern Partial failures leave broken state Automatic compensation and cleanup High

Discussion Points for Engineers

1. The Idempotency Challenge

Your workflow retries a failed activity. But what if the activity actually succeeded, and only the response was lost? You might post the same comment twice.

Questions:

  • How do you make your activities idempotent? (Same input = same output, safe to repeat)
  • Do you generate unique IDs for each review and check "Did I already post this?"
  • What about external side effects (posting to GitHub)? How do you ensure exactly-once delivery?

2. The Debugging Problem

A workflow failed 3 hours ago. How do you debug it?

Questions:

  • Do you store every LLM input/output for replay?
  • How much logging is too much? (Storage costs vs. debuggability)
  • Can you "time travel" and replay a workflow from any checkpoint?

3. The Version Migration Problem

You're running workflows that can last days. You need to deploy a new version of your code. What happens to in-flight workflows?

Questions:

  • Do you let old workflows finish with old code?
  • Do you force-migrate them to new code mid-execution?
  • How do you handle schema changes in your state store?

What's Next in Part 3: Cost Optimization & Intelligence

We've built a durable, scalable, resilient orchestration brain. But we haven't talked about cost optimization and advanced intelligence patterns.

In Part 3, we'll dive into:

graph LR
    A[Part 2: Orchestration] --> B[Part 3: Cost and Intelligence]
    B --> C[Advanced Model Routing]
    B --> D[Caching Strategies]
    B --> E[Semantic Deduplication]
    B --> F[Prompt Optimization]
    B --> G[Observability and Metrics]
    
    style B fill:#e3f2fd,stroke:#0d47a1

Key Questions We'll Answer:

  • How do you cache LLM responses without losing accuracy?
  • What if two developers submit the same PR twice? Can you reuse the review?
  • How do you detect when your prompts are degrading in quality?
  • What metrics actually matter for production AI agents?

Takeaways

The Three Pillars of Orchestration

graph TD
    A[Durable Orchestration] --> B[1. State Persistence]
    A --> C[2. Concurrency Control]
    A --> D[3. Failure Recovery]
    
    B --> E[Survive Crashes]
    C --> F[Scale Efficiently]
    D --> G[Self-Heal Errors]
    
    style A fill:#e3f2fd,stroke:#0d47a1
    style E fill:#e8f5e9,stroke:#388e3c
    style F fill:#e8f5e9,stroke:#388e3c
    style G fill:#e8f5e9,stroke:#388e3c

Key Insights

  • Scripts are fragile, workflows are immortal — A crash in a script loses everything. A crash in a workflow loses nothing. State persistence is the foundation of reliable AI agents.

  • Parallelism without orchestration is chaos — Naive parallel execution hits rate limits and retries completed work. Orchestrated parallelism is efficient and deterministic.

  • Self-healing is better than error handling — Instead of crashing on invalid output, feed the error back to the LLM. Turn failures into learning moments.

  • Human-in-the-loop is a feature, not a bug — Durable workflows can wait indefinitely for human approval, consuming zero resources. This unlocks collaborative AI agents.

  • Compensation is as important as action — Production systems must clean up after themselves. The Saga Pattern ensures no broken state is left behind.

The Reliability Matrix

Pattern Crash Resilience Cost While Waiting Partial Failure Handling
Script ❌ None 💰💰💰 High ❌ Retry all
Basic Queue ⚠️ Partial 💰💰 Medium ⚠️ Manual
Workflow Orchestration ✅ Complete 💰 Zero ✅ Automatic

The Winning Strategy: Start with basic queues for simple tasks. Graduate to workflow orchestration when you need durability, long-running processes, or human-in-the-loop interactions.


For more on building production AI systems at scale, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.