Architecting CodeRabbit like code-review AI agent: The Orchestration Brain
In Part 1, we built the "Eyes and Ears" of our system. We learned how to ingest events, filter noise, build context with GraphRAG, and route intelligently with Model Cascading.
Now we face the hardest problem in AI Engineering: Reliability.
Writing a script that calls an LLM once is easy. Writing an Agent that can navigate a complex codebase, handle flaky APIs, manage concurrent file reviews, and wait for human feedback without crashing? That requires a different kind of architecture.
In Part 2, we are building the Orchestration Brain. We will move beyond simple "chains" and build Durable Agentic Workflows that can run for minutes, hours, or even days.
The Failure Case: Agents are Fragile
When you build a complex agent (like a Code Reviewer), you are orchestrating a long-running process with many points of failure.
graph TD
A[Simple Python Script] --> B[Review File 1]
B --> C[Review File 2]
C --> D[Review File 3]
D --> E{Server Crashes}
E --> F[All Progress Lost]
F --> G[Start Over from File 1]
B --> H[API Rate Limit]
H --> I[Script Dies]
I --> F
C --> J[Invalid JSON Response]
J --> K[Exception Thrown]
K --> F
style F fill:#ffebee,stroke:#b71c1c
style G fill:#ffebee,stroke:#b71c1c
The Three Killers:
-
Time: A deep review of multiple files might take several minutes. Standard HTTP requests timeout after 30-60 seconds. Your agent needs to survive longer than a single request.
-
Flakiness: The LLM might return valid JSON 99% of the time, but that 1% failure crashes your entire loop. You need resilience built into the architecture.
-
Async State: Sometimes the agent needs to wait—for a rate limit to reset, for a database query to complete, or for a human to approve a sensitive action. Traditional scripts can't "pause" and "resume" efficiently.
Observation: If you run this in a standard Python script with a while True: loop, a single server restart kills your agent's memory. All progress is lost. You start from scratch, wasting time and money.
The Solution: Durable Workflow Orchestration
To build a robust Agentic Workflow, we stop writing "scripts" and start writing Workflows. We use an orchestration engine (like Temporal, Durable Task Framework, or similar tools).
This separates your agent into two parts:
-
The Workflow (The Plan): A deterministic definition of what should happen. It is durable. If the server dies, it resumes exactly where it left off.
-
The Activities (The Actions): The actual work (calling LLMs, fetching code, posting comments). These are volatile and retriable.
Architecture: The Durable Brain
graph TD
subgraph Orchestrator["The Orchestrator (The Brain)"]
Workflow[Workflow Definition]
State[(Durable State Store)]
Workflow <-->|Checkpoints| State
end
subgraph Workers["Worker Nodes (The Body)"]
Worker1[Worker 1]
Worker2[Worker 2]
Worker3[Worker 3]
Act1[Activity: Fetch Context]
Act2[Activity: LLM Call]
Act3[Activity: Post Comment]
Worker1 --> Act1
Worker2 --> Act2
Worker3 --> Act3
end
Workflow -->|Commands| Worker1
Workflow -->|Commands| Worker2
Workflow -->|Commands| Worker3
Act1 -->|Results| Workflow
Act2 -->|Results| Workflow
Act3 -->|Results| Workflow
style State fill:#e3f2fd,stroke:#0d47a1
style Workflow fill:#fff9c4,stroke:#fbc02d
How it Works:
-
Workflow as Code: You write your agent's logic as a workflow function. Each step is an "activity" that can be retried independently.
-
Automatic Checkpointing: After each activity completes, the orchestrator saves the state to durable storage. If the server crashes, it replays from the last checkpoint.
-
Distributed Execution: Workers can be on different machines. The orchestrator coordinates them, ensuring each activity runs exactly once (or retries on failure).
Observation: This is fundamentally different from a script. Your agent becomes immortal. It can survive crashes, network partitions, and API outages. The orchestrator handles all the complexity of state management and recovery.
The State Machine View
stateDiagram-v2
[*] --> EventReceived
EventReceived --> FilterNoise
FilterNoise --> BuildContext
BuildContext --> ParallelReview
state ParallelReview {
[*] --> ReviewFile1
[*] --> ReviewFile2
[*] --> ReviewFile3
ReviewFile1 --> [*]
ReviewFile2 --> [*]
ReviewFile3 --> [*]
}
ParallelReview --> AggregateResults
AggregateResults --> PostComment
PostComment --> [*]
note right of ParallelReview
Each file review is independent
Failures only retry that file
end note
Think About It: Traditional scripts are like cooking without a recipe—if you get interrupted, you forget where you were. Workflows are like having a recipe with checkboxes. You can walk away, come back hours later, and pick up exactly where you left off.
Pattern 1: The "Parallel Swarm" (Concurrent Agent Execution)
The Architectural Problem:
You have a Pull Request with many changed files.
-
Option A (Serial): The agent reviews File 1, then File 2, then File 3... This takes too long. User experience suffers.
-
Option B (Naive Parallel): You fire many async calls to the LLM. You hit rate limits immediately. The system crashes or gets throttled.
The Solution:
We use a Fan-Out / Fan-In pattern. The Workflow acts as a commander, spawning multiple "Sub-Agents" (Activities) to review files in parallel, but managed by the orchestrator's concurrency controls.
The Architecture
graph TD
Start[PR Event] --> Planner[Planner Agent]
Planner --> Analyze[Analyze Changed Files]
Analyze --> FanOut{Fan-Out}
subgraph Swarm["The Swarm (Parallel Execution)"]
FanOut --> A1[Review Agent: File A]
FanOut --> A2[Review Agent: File B]
FanOut --> A3[Review Agent: File C]
FanOut --> A4[Review Agent: File D]
FanOut --> A5[Review Agent: File E]
A1 --> LLM1[LLM Call]
A2 --> LLM2[LLM Call]
A3 --> LLM3[LLM Call]
A4 --> LLM4[LLM Call]
A5 --> LLM5[LLM Call]
end
LLM1 --> FanIn[Fan-In Aggregator]
LLM2 --> FanIn
LLM3 --> FanIn
LLM4 --> FanIn
LLM5 --> FanIn
FanIn --> Summary[Summary Agent]
Summary --> Post[Post Review]
style Swarm fill:#e8f5e9,stroke:#388e3c
How the AI Agent Works Here:
-
Planner Agent: Analyzes the PR and creates a list of files to review. This becomes the "work queue."
-
Fan-Out: The orchestrator spawns parallel activities—one for each file. Each runs independently in its own worker.
-
Concurrency Control: The orchestrator respects limits. If you set "max 10 concurrent reviews," it queues the rest. You never hit rate limits.
-
Partial Success: If File C's review fails (API timeout), the orchestrator only retries File C. Files A, B, D, and E stay completed.
-
Fan-In: Once all reviews complete, the aggregator collects results and synthesizes a final summary.
Concrete Example: Before vs. After
Without Orchestration (Naive Parallel):
graph TD
A[Start] --> B[Fire 20 Async Calls]
B --> C[Hit Rate Limit]
C --> D[Some Calls Fail]
D --> E[Unclear Which Ones]
E --> F[Retry All 20]
F --> C
style C fill:#ffebee,stroke:#b71c1c
style E fill:#ffebee,stroke:#b71c1c
Result: You waste API calls by retrying completed work. You hit rate limits repeatedly. Total time: unpredictable.
With Orchestration (Managed Parallel):
graph TD
A[Start] --> B[Queue 20 Reviews]
B --> C[Process 10 at a Time]
C --> D[2 Fail]
D --> E[Track: Files 5 and 12 Failed]
E --> F[Retry Only 5 and 12]
F --> G[All 20 Complete]
style G fill:#e8f5e9,stroke:#388e3c
Result: You never hit rate limits. Failed reviews retry individually. Total time: predictable and optimal.
The Math:
| Approach | Concurrent Reviews | Rate Limit Hits | Wasted Retries | Total Time |
|---|---|---|---|---|
| Naive Parallel | Unlimited | Frequent | Many | Unpredictable |
| Orchestrated | Controlled | Never | Zero | Optimal |
Observation: The orchestrator acts like a traffic cop. It knows exactly which cars (reviews) have crossed the intersection (completed) and which need to retry. No wasted work, no rate limit chaos.
Think About It: How do you set the concurrency limit? Too low, and you're slow. Too high, and you hit rate limits. The answer: start with your API's rate limit divided by expected response time. If your LLM allows 100 requests/minute and each takes 3 seconds, aim for ~5 concurrent reviews.
Pattern 2: The "Self-Healing Loop" (Resilient Reasoning)
The Architectural Problem:
Agents often fail to output the correct format. You ask for JSON, it gives you Markdown. You ask for a security check, it gives you a style nitpick. LLMs are powerful but unreliable.
In a traditional script, you either:
- Accept garbage output (bad user experience)
- Throw an exception and crash (fragile system)
The Solution:
We build a Validation and Self-Correction Loop into the workflow. This is not just a retry; it's a reasoning retry with feedback.
The Architecture
stateDiagram-v2
[*] --> GenerateOutput
state "Self-Healing Loop" as Loop {
GenerateOutput --> Validate
state Validate {
[*] --> CheckFormat
CheckFormat --> CheckSchema
CheckSchema --> CheckRelevance
}
Validate --> Success : All Pass
Validate --> SelfCorrect : Any Fail
state SelfCorrect {
[*] --> AnalyzeError
AnalyzeError --> BuildFeedback
BuildFeedback --> RegenerateWithContext
}
RegenerateWithContext --> Validate
}
Success --> [*]
note right of SelfCorrect
Feed errors back to LLM
"You produced invalid JSON.
Here's what was wrong..."
end note
How it Works:
-
Generate: The agent produces output (e.g., a code review comment).
-
Validate: Run checks:
- Is it valid JSON?
- Does it match the schema (required fields present)?
- Is it relevant to the code change?
-
Self-Correct: If validation fails, don't crash. Instead:
- Capture the specific error ("Missing 'severity' field")
- Build a feedback prompt: "Your previous output was missing the 'severity' field. Please regenerate."
- Call the LLM again with the error context.
-
Iterate: Repeat until valid or max attempts reached.
In a durable workflow, this loop can run indefinitely. If the API goes down during retry, the workflow pauses and resumes exactly where it was when the API comes back online.
Concrete Example: The Self-Correction Flow
Scenario: Ask the agent to find security vulnerabilities in a function.
Attempt 1 (Raw Output):
{
"comment": "This code looks fine to me!",
"issues": []
}
Validation Result: ❌ No severity field. Not helpful.
Feedback Prompt:
Your previous response:
{
"comment": "This code looks fine to me!",
"issues": []
}
Problem: You must include a 'severity' field for each finding.
Schema: { "comment": string, "issues": [{"type": string, "severity": string, "line": number}] }
Please regenerate your analysis with the correct schema.
Attempt 2 (Corrected Output):
{
"comment": "No critical security issues found.",
"issues": [],
"severity": "none"
}
Validation Result: ✅ Valid schema. Review approved.
The Resilience Comparison
| Approach | Invalid Output | System Response | User Impact |
|---|---|---|---|
| Naive Script | Crashes | Review never posted | Bad UX |
| Basic Retry | Retries blindly | Same error repeats | Wasted cost |
| Self-Healing Loop | Learns from error | Corrects and succeeds | Good UX |
Observation: The self-healing loop turns the LLM's unreliability into a solvable problem. By giving it feedback, you transform a 90% success rate into 99%+ after 2-3 iterations.
Think About It: How many retry attempts should you allow? Too few, and legitimate edge cases fail. Too many, and you waste tokens on lost causes. Most production systems use 3 attempts: the initial try, one correction, and one final attempt.
Pattern 3: The "Long Wait" (Human-in-the-Loop Workflows)
The Architectural Problem:
Sometimes, an agent shouldn't act alone. It finds a "Critical Security Vulnerability" in authentication code. You don't want it to auto-post that to a public PR. You want a human to verify it first.
In a traditional script, waiting for a human is impossible (you can't sleep() for days and keep the server running).
The Solution:
We use Signals and Durable State. The Workflow calculates a high severity score and enters an AwaitingApproval state. It essentially "hibernates" (saving its state to the database). It consumes zero compute while waiting.
When a human clicks "Approve" on your dashboard, a signal is sent to the workflow. It wakes up, remembers everything (the code, the analysis, the vulnerability details), and proceeds to post the comment.
The Architecture
graph TD
Analysis[AI Analysis Complete] --> Severity{Check Severity}
Severity -->|Low or Medium| AutoPost[Auto-Post Review]
Severity -->|High or Critical| Hibernate[Hibernate Workflow]
Hibernate --> WaitState[Waiting State Zero CPU]
Human((Human Engineer)) -->|Reviews in Dashboard| Decision{Approve?}
Decision -->|Yes| Signal1[Send Approve Signal]
Decision -->|No| Signal2[Send Reject Signal]
Signal1 --> WakeUp[Wake Up Workflow]
Signal2 --> WakeUp
WakeUp --> LoadState[Load Saved State]
LoadState --> PostOrDiscard{Which Signal?}
PostOrDiscard -->|Approved| Post[Post Review Comment]
PostOrDiscard -->|Rejected| Discard[Discard Review]
AutoPost --> End[End]
Post --> End
Discard --> End
style WaitState fill:#fff9c4,stroke:#fbc02d
style LoadState fill:#e3f2fd,stroke:#0d47a1
How it Works:
-
Conditional Logic: After the AI completes its analysis, check severity. If critical, branch to the "human approval" path.
-
Hibernate: The workflow enters a durable wait state. The orchestrator saves all variables (analysis results, file contents, PR metadata) to the database.
-
Zero Cost: While hibernating, the workflow uses no CPU, no memory. It's just a record in a database.
-
Human Action: An engineer reviews the finding in a web dashboard. They can see the full context because it's stored in the workflow state.
-
Wake Up: The engineer clicks "Approve" or "Reject." This sends a signal to the orchestrator.
-
Resume: The workflow wakes up with all its original state intact. It posts the comment or discards it based on the signal.
Concrete Example: The Timeline
Monday 9:00 AM:
- PR submitted
- AI detects "SQL Injection vulnerability in auth module"
- Severity: Critical
- Workflow hibernates
Monday 2:00 PM:
- Engineer reviews finding
- Confirms it's a real vulnerability
- Clicks "Approve"
Monday 2:00:05 PM:
- Workflow wakes up
- Remembers all details from 5 hours ago
- Posts detailed security review to PR
What happened in between: Zero compute usage. The workflow was just a row in a database.
The Cost Comparison
| Approach | Waiting Strategy | Cost While Waiting | State Preserved |
|---|---|---|---|
| Polling Script | Check every 10 seconds | High (constant CPU) | No (crashes lose it) |
| Webhook + Database | Manual state management | Low | Partial (complex code) |
| Durable Workflow | Native signals | Zero | Complete (automatic) |
Observation: Human-in-the-loop workflows transform your agent from a "fire and forget" script into a collaborative system. The agent can wait indefinitely for human input without consuming resources. This unlocks entirely new use cases.
Think About It: What else could your agent wait for? External API callbacks? Scheduled time delays? Multi-stage approvals from different teams? With durable workflows, all of these become trivial to implement.
Pattern 4: The "Saga Pattern" (Compensating Transactions)
The Architectural Problem:
Your agent performs a series of actions:
- Posts a "reviewing..." comment to GitHub
- Calls the LLM for analysis (3 minutes)
- Posts the final review
What happens if step 3 fails? You've already posted the "reviewing..." comment. Users see a broken experience: "Still reviewing..." forever.
The Solution:
We use the Saga Pattern—each action has a compensating action that "undoes" it if the workflow fails.
The Architecture
graph TD
Start[Start Review] --> A1[Action 1: Post In Progress Comment]
A1 --> A2[Action 2: Call LLM]
A2 --> A3[Action 3: Post Final Review]
A3 --> Success[Complete]
A1 -.->|On Failure| C1[Compensate: Delete In Progress]
A2 -.->|On Failure| C2[Compensate: Post Error Message]
A2 -->|Fails| TriggerSaga[Trigger Rollback]
TriggerSaga --> C1
C1 --> C2
C2 --> CleanExit[Clean Exit State]
style TriggerSaga fill:#ffebee,stroke:#b71c1c
style CleanExit fill:#fff8e1,stroke:#f57f17
How it Works:
-
Register Compensations: For each action, register a compensating action:
- Posted a comment? Register: "Delete that comment."
- Created a draft? Register: "Delete the draft."
-
Execute Forward: Run the workflow normally. If everything succeeds, ignore compensations.
-
Rollback on Failure: If step N fails, execute compensations for steps 1 through N-1 in reverse order.
Result: Users never see broken state. The system either completes fully or cleans up after itself.
Concrete Example: The Compensation Chain
Scenario: Review workflow fails during LLM call.
Forward Actions:
- ✅ Posted: "🤖 CodeReviewer is analyzing your changes..."
- ❌ LLM call times out
- ❌ Never reached
Compensation (Automatic Rollback):
- Delete the "analyzing..." comment
- Post: "⚠️ Review temporarily unavailable. We'll retry shortly."
User Experience: Instead of seeing a stale "analyzing..." message, users see a clear status update. Trust maintained.
Observation: The Saga Pattern ensures your agent leaves no trace of partial failures. This is critical for production systems where "cleanup" is as important as "doing."
Think About It: What if the compensation itself fails? (e.g., GitHub API is down, can't delete the comment.) Advanced systems implement exponential backoff for compensations and alert humans if cleanup repeatedly fails.
Putting It All Together: A Real Workflow
Let's trace a complete PR review through our orchestrated system.
Scenario: A developer pushes a PR with sensitive changes to payment processing code.
stateDiagram-v2
[*] --> ReceiveWebhook
ReceiveWebhook --> FilterFiles
state FilterFiles {
[*] --> CheckFileTypes
CheckFileTypes --> DropNoise
DropNoise --> [*]
}
FilterFiles --> BuildContext
state BuildContext {
[*] --> ParseAST
ParseAST --> QueryGraph
QueryGraph --> AssembleContext
AssembleContext --> [*]
}
BuildContext --> ParallelReview
state ParallelReview {
state "Fan-Out" as FanOut
state "Fan-In" as FanIn
[*] --> FanOut
FanOut --> ReviewFile1
FanOut --> ReviewFile2
FanOut --> ReviewFile3
ReviewFile1 --> SelfHeal1
ReviewFile2 --> SelfHeal2
ReviewFile3 --> SelfHeal3
SelfHeal1 --> FanIn
SelfHeal2 --> FanIn
SelfHeal3 --> FanIn
FanIn --> [*]
}
ParallelReview --> CheckSeverity
state CheckSeverity {
[*] --> CalculateRisk
CalculateRisk --> Critical
Critical --> [*] : High Risk
Critical --> [*] : Low Risk
}
CheckSeverity --> AwaitApproval : High Risk
CheckSeverity --> PostReview : Low Risk
state AwaitApproval {
[*] --> HibernateWorkflow
HibernateWorkflow --> WaitForSignal
WaitForSignal --> WakeUp
WakeUp --> [*]
}
AwaitApproval --> PostReview
PostReview --> [*]
Timeline with State Preservation:
| Time | Event | Workflow State | CPU Usage |
|---|---|---|---|
| T+0s | Webhook received | ReceiveWebhook |
Active |
| T+1s | Files filtered | FilterFiles |
Active |
| T+3s | Context built | BuildContext |
Active |
| T+5s | Reviews started (parallel) | ParallelReview |
Active (3 workers) |
| T+12s | File 2 fails validation | SelfHeal2 (retry) |
Active |
| T+18s | All reviews complete | CheckSeverity |
Active |
| T+19s | Critical finding detected | AwaitApproval |
Zero (hibernating) |
| T+2h | Human approves | WakeUp |
Active |
| T+2h:01s | Review posted | PostReview |
Active |
| T+2h:02s | Complete | [*] |
Zero |
Total Active CPU Time: ~20 seconds (out of 2 hours)
Key Observations:
-
Self-Healing: File 2's validation failure didn't crash the workflow. It self-corrected and continued.
-
Parallel Efficiency: Three files reviewed simultaneously, reducing total time.
-
Hibernation: During the 2-hour wait for human approval, the workflow used zero compute but retained complete state.
-
Deterministic: If the server crashed at T+10s, the workflow would resume at
ParallelReviewwith Files 1 and 3 already completed. Only File 2 would retry.
Challenge: Design Decisions for Your Orchestration
Challenge 1: The Timeout Strategy
Your LLM call is taking longer than expected. How long do you wait before considering it "failed"?
Options:
- Fixed Timeout: Fail after 30 seconds (simple but inflexible)
- Adaptive Timeout: Measure p95 latency, use 2x as timeout (smart but complex)
- No Timeout: Let it run forever (risky)
Your Task: For critical security reviews, would you use a longer timeout than for style checks? How do you balance thoroughness vs. responsiveness?
Challenge 2: The Retry Logic
An LLM call fails with "Rate Limit Exceeded." What's your retry strategy?
Options:
- Immediate Retry: Try again instantly (likely to fail again)
- Fixed Backoff: Wait 10 seconds, then retry (simple but not optimal)
- Exponential Backoff: Wait 1s, then 2s, then 4s, then 8s... (industry standard)
- Jittered Backoff: Exponential + random jitter to avoid thundering herd
Your Task: If many files hit rate limits simultaneously, how do you prevent all of them from retrying at the same time?
Challenge 3: The Concurrency Limit
You can run multiple file reviews in parallel, but your LLM API has rate limits.
Options:
- Fixed Limit: Always run exactly 10 concurrent reviews (safe but not optimal)
- Dynamic Limit: Monitor rate limit headers, adjust on the fly (optimal but complex)
- Priority Queue: Critical files get priority, style checks wait (fairness vs. urgency)
Your Task: How do you handle a mix of users? Large enterprises want fast reviews, small teams can wait longer. Do you use per-tenant concurrency quotas?
System Comparison: Script vs. Orchestrated Workflow
| Dimension | Traditional Script | Orchestrated Workflow |
|---|---|---|
| State Management | In-memory (lost on crash) | Durable (survives crashes) |
| Failure Handling | Retry all or crash | Retry only failed steps |
| Concurrency | Manual async code | Built-in parallelism |
| Long Waits | Impossible or polling | Native support (hibernation) |
| Debugging | Logs only | Full execution history |
| Cost (idle) | High (always running) | Zero (hibernates) |
| Complexity | Low (simple scripts) | Medium (new concepts) |
| Scalability | Single machine | Distributed workers |
graph TD
subgraph Script["Traditional Script"]
S1[Runs on One Server] --> S2[Keeps State in Memory]
S2 --> S3[Crash = Lost Progress]
S3 --> S4[Restart from Beginning]
style S3 fill:#ffebee,stroke:#b71c1c
end
subgraph Workflow["Orchestrated Workflow"]
W1[Runs on Worker Pool] --> W2[Keeps State in Database]
W2 --> W3[Crash = Resume from Checkpoint]
W3 --> W4[No Lost Progress]
style W4 fill:#e8f5e9,stroke:#388e3c
end
Key Architectural Patterns Summary
| Pattern | Problem Solved | Key Benefit | Implementation Complexity |
|---|---|---|---|
| Parallel Swarm | Slow serial processing | Concurrent execution with rate limit protection | Medium |
| Self-Healing Loop | Unreliable LLM outputs | Automatic validation and correction | Low |
| Long Wait (Signals) | Human-in-the-loop delays | Zero-cost hibernation for days | Low |
| Saga Pattern | Partial failures leave broken state | Automatic compensation and cleanup | High |
Discussion Points for Engineers
1. The Idempotency Challenge
Your workflow retries a failed activity. But what if the activity actually succeeded, and only the response was lost? You might post the same comment twice.
Questions:
- How do you make your activities idempotent? (Same input = same output, safe to repeat)
- Do you generate unique IDs for each review and check "Did I already post this?"
- What about external side effects (posting to GitHub)? How do you ensure exactly-once delivery?
2. The Debugging Problem
A workflow failed 3 hours ago. How do you debug it?
Questions:
- Do you store every LLM input/output for replay?
- How much logging is too much? (Storage costs vs. debuggability)
- Can you "time travel" and replay a workflow from any checkpoint?
3. The Version Migration Problem
You're running workflows that can last days. You need to deploy a new version of your code. What happens to in-flight workflows?
Questions:
- Do you let old workflows finish with old code?
- Do you force-migrate them to new code mid-execution?
- How do you handle schema changes in your state store?
What's Next in Part 3: Cost Optimization & Intelligence
We've built a durable, scalable, resilient orchestration brain. But we haven't talked about cost optimization and advanced intelligence patterns.
In Part 3, we'll dive into:
graph LR
A[Part 2: Orchestration] --> B[Part 3: Cost and Intelligence]
B --> C[Advanced Model Routing]
B --> D[Caching Strategies]
B --> E[Semantic Deduplication]
B --> F[Prompt Optimization]
B --> G[Observability and Metrics]
style B fill:#e3f2fd,stroke:#0d47a1
Key Questions We'll Answer:
- How do you cache LLM responses without losing accuracy?
- What if two developers submit the same PR twice? Can you reuse the review?
- How do you detect when your prompts are degrading in quality?
- What metrics actually matter for production AI agents?
Takeaways
The Three Pillars of Orchestration
graph TD
A[Durable Orchestration] --> B[1. State Persistence]
A --> C[2. Concurrency Control]
A --> D[3. Failure Recovery]
B --> E[Survive Crashes]
C --> F[Scale Efficiently]
D --> G[Self-Heal Errors]
style A fill:#e3f2fd,stroke:#0d47a1
style E fill:#e8f5e9,stroke:#388e3c
style F fill:#e8f5e9,stroke:#388e3c
style G fill:#e8f5e9,stroke:#388e3c
Key Insights
-
Scripts are fragile, workflows are immortal — A crash in a script loses everything. A crash in a workflow loses nothing. State persistence is the foundation of reliable AI agents.
-
Parallelism without orchestration is chaos — Naive parallel execution hits rate limits and retries completed work. Orchestrated parallelism is efficient and deterministic.
-
Self-healing is better than error handling — Instead of crashing on invalid output, feed the error back to the LLM. Turn failures into learning moments.
-
Human-in-the-loop is a feature, not a bug — Durable workflows can wait indefinitely for human approval, consuming zero resources. This unlocks collaborative AI agents.
-
Compensation is as important as action — Production systems must clean up after themselves. The Saga Pattern ensures no broken state is left behind.
The Reliability Matrix
| Pattern | Crash Resilience | Cost While Waiting | Partial Failure Handling |
|---|---|---|---|
| Script | ❌ None | 💰💰💰 High | ❌ Retry all |
| Basic Queue | ⚠️ Partial | 💰💰 Medium | ⚠️ Manual |
| Workflow Orchestration | ✅ Complete | 💰 Zero | ✅ Automatic |
The Winning Strategy: Start with basic queues for simple tasks. Graduate to workflow orchestration when you need durability, long-running processes, or human-in-the-loop interactions.
For more on building production AI systems at scale, check out our AI Bootcamp for Software Engineers.