Architecting CodeRabbit like code-review AI agent: The Intelligence Layer
In Part 1, we built the Ingestion Engine to handle webhook storms and filter noise.
In Part 2, we built the Orchestration Brain to manage long-running workflows reliably.
Now, we build the soul of the product: The Intelligence.
You have the code. You have the workflow. Now, how do you actually review it?
If you just send a git diff to a large language model with the prompt "Review this code," you will build a bad product.
- It will be too expensive: You are paying top-dollar to review typos.
- It will be too noisy: It will nag developers about missing docstrings in experimental code.
- It will be hallucination-prone: It will claim a variable is undefined because it can't see the import file.
In this final part, we architect the Intelligence Layer. We will build a multi-model system that mimics a Principal Engineer: it ignores the fluff, understands the broader context, and only speaks when it matters.
The Failure Case: The Naive Review
Before we dive into solutions, let's visualize what happens when you naively send every change to a large reasoning model.
graph TD
A[Developer Changes 20 Files] --> B[Send All to Large Model]
B --> C1[File 1: README typo]
B --> C2[File 2: Whitespace fix]
B --> C3[File 3: Comment update]
B --> C4[File 4-19: Similar trivial]
B --> C5[File 20: Critical auth bug]
C1 --> D[High Cost Review]
C2 --> D
C3 --> D
C4 --> D
C5 --> D
D --> E[50 Comments Generated]
E --> F[Developer Ignores All]
B --> G[Budget Explosion]
G --> H[Slow Response Time]
style F fill:#ffebee,stroke:#b71c1c
style G fill:#ffebee,stroke:#b71c1c
The Three Failures:
-
Cost Disaster: You spent premium model cost on reviewing a README typo. The cost-to-value ratio is terrible.
-
Signal Buried: The critical security bug in File 20 is buried under 49 nitpicks about formatting. The developer misses it.
-
Developer Fatigue: After receiving 50 comments on their last PR, the developer starts ignoring all reviews. Trust erodes.
Observation: The naive approach treats all code changes equally. But a Principal Engineer knows: a typo fix and an authentication bypass are not the same. Your AI needs the same judgment.
Pattern 1: The "Model Cascade" (Cost & Speed Optimization)
The Architectural Problem:
Most code changes are trivial. Renaming a variable, fixing a typo, or updating documentation does not require frontier model reasoning. Using your most expensive model for everything destroys your unit economics.
The Solution:
We implement a Model Cascade. We use cheap, fast models to filter and categorize, and expensive models only for deep reasoning.
The Architecture
graph TD
Input[Code Change] --> Router{Router Agent}
Router -->|Trivial| FastPath[Fast Path]
Router -->|Complex| SlowPath[Slow Path]
subgraph FastPath["Fast Path Low Cost"]
F1[Static Analysis] --> F2[Rule-Based Checks]
F2 --> F3[Auto-Fix or Skip]
end
subgraph SlowPath["Slow Path High Quality"]
S1[Build Rich Context] --> S2[Large Reasoning Model]
S2 --> S3[Deep Analysis]
end
F3 --> Output[Review Comments]
S3 --> Output
style Router fill:#fff9c4,stroke:#fbc02d
How it Works:
-
The Router (The Triage Specialist): A small, fast model analyzes the diff and classifies it:
COSMETIC: Whitespace, formatting, commentsDOCS: README, documentation updatesLOGIC: Business logic, algorithmsSECURITY: Authentication, permissions, data handling
-
The Fast Path: For
COSMETICandDOCSchanges:- Run static analysis tools (linters, formatters)
- Apply deterministic rules
- Auto-generate brief feedback or skip entirely
- Cost: Near zero
-
The Slow Path: Only
LOGICandSECURITYchanges trigger the expensive reasoning model:- Build comprehensive context (more on this in Pattern 2)
- Use Chain of Thought reasoning
- Generate detailed, thoughtful reviews
- Cost: Higher, but justified
Observation: The router acts as a bouncer at a nightclub. It makes a split-second decision: "Fast lane or VIP treatment?" This single decision reduces your expensive model usage by 60-70%.
Concrete Example: The Classification Decision
Let's walk through how the router classifies different changes.
Change 1: README Update
- ## Setup Instructions
+ ## Getting Started
Router Analysis:
- File type:
.md - Change type: Text only
- No code logic affected
- Classification:
DOCS - Route: Fast Path (skip or auto-comment)
Change 2: Authentication Logic
def login(username, password):
- if password == user.password:
+ if hash(password) == user.password_hash:
return create_session(user)
Router Analysis:
- File type:
.py - Function:
login(security-sensitive) - Change: Password handling modified
- Classification:
SECURITY - Route: Slow Path (deep reasoning required)
The Cost Breakdown
Let's compare the two approaches:
Approach 1: Brute Force (Everything to Large Model)
graph TD
A[100 Files Changed] --> B[All to Large Model]
B --> C[70 Trivial Changes]
B --> D[30 Complex Changes]
C --> E[High Cost for Low Value]
D --> F[High Cost for High Value]
E --> G[Total: High Cost]
F --> G
style E fill:#ffebee,stroke:#b71c1c
Approach 2: Model Cascade (Smart Routing)
graph TD
A[100 Files Changed] --> B{Router}
B --> C[70 Trivial Changes]
B --> D[30 Complex Changes]
C --> E[Static Analysis]
D --> F[Large Model]
E --> G[Minimal Cost]
F --> H[Justified Cost]
G --> I[Total: Much Lower Cost]
H --> I
style I fill:#e8f5e9,stroke:#388e3c
The Impact:
| Metric | Brute Force | Model Cascade | Improvement |
|---|---|---|---|
| Cost per PR | High | Low | 70% reduction |
| Response Time | Slow (all files queued) | Fast (parallel tiers) | 3x faster |
| Quality | Same for all | Focused on what matters | Better signal |
| Developer Trust | Low (too much noise) | High (thoughtful comments) | Higher adoption |
Think About It: How do you train the router? You could use a small fine-tuned model on historical data (labeled examples of "this was important" vs "this was a nitpick"). Or use a lightweight LLM with few-shot examples in the prompt. The key: keep it fast (sub-second) and cheap.
Pattern 2: Context Engineering with GraphRAG
The Architectural Problem:
The "Context Trap"—the most insidious problem in code review AI.
- Too little context: The AI hallucinates bugs because it can't see helper functions or imported utilities.
- Too much context: You hit the context window limit, or worse, you confuse the model with irrelevant code from unrelated files.
The Solution:
We don't just "stuff the prompt." We use Graph-Based Retrieval Augmented Generation (GraphRAG). We treat the codebase not as flat text files, but as a Dependency Graph.
The Architecture
graph TD
subgraph Input["The Change"]
Change[Developer modifies checkout.py]
end
subgraph Parser["AST Analysis"]
Change --> AST[Parse Abstract Syntax Tree]
AST --> Symbols[Extract Symbols]
Symbols --> Imports[Find Imports]
Symbols --> Calls[Find Function Calls]
end
subgraph Graph["Knowledge Graph Query"]
Imports --> Query1[Where is payment_service defined?]
Calls --> Query2[Where is calculate_tax defined?]
Query1 --> GraphDB[(Code Dependency Graph)]
Query2 --> GraphDB
GraphDB --> Snippets[Return Relevant Snippets]
end
subgraph Context["Context Assembly"]
Change --> Base[Base: Modified File]
Snippets --> Related[Related: Dependencies]
Base --> Final[Final Context Window]
Related --> Final
end
Final --> LLM[LLM Review]
style GraphDB fill:#fff9c4,stroke:#fbc02d
style Final fill:#e3f2fd,stroke:#0d47a1
How it Works:
-
Symbol Extraction: When reviewing
checkout.py, we parse the AST to identify:- Imported modules:
from payment_service import charge - Function calls:
calculate_tax(amount) - Class instantiations:
Auth.validate(token)
- Imported modules:
-
Graph Lookup: For each external symbol, we query our pre-built code index:
- "Where is
chargedefined?" - "What's the signature of
calculate_tax?" - "What does
Auth.validatereturn?"
- "Where is
-
Smart Snippeting: Instead of dumping entire files, we fetch only:
- Function signatures
- Docstrings
- Type annotations
- Return types
-
Context Assembly: Build the final prompt:
You are reviewing checkout.py. Modified code: [The actual diff] External dependencies used in this file: - charge(amount, card): Charges card. Returns {'success': bool, 'id': str} - calculate_tax(amount): Returns float. Raises ValueError if amount < 0 - Auth.validate(token): Returns User object or None
Observation: This gives the AI "X-Ray Vision" into the codebase. It understands not just what changed, but how that change interacts with the rest of the system. This is the difference between a surface-level review and a deep architectural critique.
Concrete Example: Preventing a False Positive
The Code Change:
# checkout.py
def process_order(order_id, user_token):
user = Auth.validate(user_token) # Reviewer might worry: What if this returns None?
amount = calculate_total(order_id)
tax = calculate_tax(amount) # Reviewer might worry: What if amount is negative?
charge(amount + tax, user.card) # Reviewer might worry: What if user is None?
return {"status": "success"}
Without GraphRAG (Naive Review):
graph LR
A[AI sees only checkout.py] --> B[No context on Auth.validate]
B --> C[Potential bug - user might be None]
C --> D[False Positive Comment]
style D fill:#ffebee,stroke:#b71c1c
AI Comment: ⚠️ "user could be None if token is invalid. This will cause an AttributeError on user.card."
Developer Response: 😤 "No, Auth.validate raises an exception for invalid tokens. Read the docs!"
With GraphRAG (Context-Aware Review):
graph LR
A[AI sees checkout.py] --> B[GraphRAG fetches Auth.validate signature]
B --> C[Auth.validate raises AuthError if invalid]
C --> D[No False Positive]
D --> E[Focuses on Real Issues]
style E fill:#e8f5e9,stroke:#388e3c
AI Comment: ✅ "Code looks good. Auth exception handling is upstream. Consider adding a try-catch for charge() failures."
Developer Response: 👍 "Good point, will add."
The Context Window Budget Strategy
With GraphRAG, we face a new problem: what if a function is called from many places?
graph TD
A[Function has many callers] --> B{Strategy}
B -->|Naive: Include All| C[Context Overflow]
B -->|Smart: Prioritize| D[Relevance Ranking]
D --> E[Same File: Priority 1]
D --> F[Same Module: Priority 2]
D --> G[Test Files: Priority 3]
D --> H[Distant Files: Priority 4]
E --> I[Include Top N]
F --> I
G --> I
H --> I
I --> J[Fit Context Window]
style C fill:#ffebee,stroke:#b71c1c
style J fill:#e8f5e9,stroke:#388e3c
The Ranking Algorithm:
| Caller Type | Priority | Reasoning |
|---|---|---|
| Same file | High | Most likely related to the change |
| Same module/package | Medium | Likely architectural dependency |
| Test files | Medium | Shows expected behavior |
| Distant modules | Low | Probably unaffected |
Result: We send the top 10 most relevant callers, not all of them. Quality over quantity.
Think About It: Should you tell the AI that you're omitting some callers? Some teams add a note: "This function has 47 callers. Showing the 10 most relevant." This helps the AI understand it's working with a sample, not the complete picture.
Pattern 3: The "Signal-to-Noise" Filter (Developer Happiness)
The Architectural Problem:
If your AI leaves many comments on a PR, developers will ignore all of them. This is "Notification Fatigue." A human Principal Engineer doesn't comment on every missing space; they focus on what breaks the app or creates security risks.
The Solution:
We build a Feedback Aggregator that acts as a final filter before anything reaches the developer.
The Architecture
graph TD
LLM[LLM Generates Many Comments] --> Collector[Comment Collector]
Collector --> Dedup[Deduplication]
Dedup --> Score[Severity Scoring]
Score --> Gate{Severity Gate}
Gate -->|Critical or High| Inline[Post as Inline Comments]
Gate -->|Medium| Batch1[Batch into Summary Table]
Gate -->|Low or Nitpick| Batch2[Aggregate or Discard]
Inline --> GitHub1[GitHub: Block PR]
Batch1 --> GitHub2[GitHub: Summary Comment]
Batch2 --> GitHub2
style Inline fill:#ffebee,stroke:#b71c1c
style Batch1 fill:#fff8e1,stroke:#f57f17
style Batch2 fill:#e8f5e9,stroke:#388e3c
How it Works:
-
Deduplication: The AI often finds the same issue multiple times:
- "Variable name is unclear" (mentioned 5 times)
- Deduplicate to: "Variable naming could be improved in 5 locations"
-
Severity Scoring: Each comment gets a score:
- Critical (5): Security vulnerability, data loss risk
- High (4): Logic error, potential crash
- Medium (3): Performance issue, poor error handling
- Low (2): Code smell, minor improvement
- Nitpick (1): Style preference, formatting
-
Smart Delivery:
- Critical/High: Individual inline comments on specific lines (blocks merge)
- Medium: Grouped into a summary table
- Low/Nitpick: Aggregated into a single collapsible section
Concrete Example: The Review Presentation
Scenario: AI analyzes a PR and generates these findings:
- 1 Critical: SQL injection vulnerability
- 2 High: Unhandled exceptions
- 5 Medium: Missing error messages
- 12 Low: Variable naming suggestions
- 8 Nitpick: Missing docstrings
Naive Delivery (28 separate comments):
graph TD
A[Developer Opens PR] --> B[Sees 28 Comments]
B --> C[Feels Overwhelmed]
C --> D[Ignores All]
D --> E[Merges Without Reading]
style E fill:#ffebee,stroke:#b71c1c
Smart Delivery (Filtered & Grouped):
graph TD
A[Developer Opens PR] --> B[Sees 3 Inline Critical Comments]
B --> C[Immediately Addresses Security Issue]
C --> D[Expands Summary Table]
D --> E[Reviews Medium Priority Items]
E --> F[Dismisses Nitpicks]
F --> G[Merges with Confidence]
style G fill:#e8f5e9,stroke:#388e3c
GitHub Comment Structure:
## 🚨 Critical Issues (Must Fix)
**Line 45:** SQL Injection vulnerability detected
```sql
query = f"SELECT * FROM users WHERE id = {user_id}"
Use parameterized queries to prevent SQL injection.
⚠️ Important Findings (5 items)
| File | Line | Issue | Severity |
|---|---|---|---|
| auth.py | 23 | Unhandled exception | High |
| api.py | 67 | Missing error message | Medium |
| ... 3 more rows ... |
💡 Code Quality Suggestions (20 items)
Low-priority improvements that can be addressed later...
The Impact:
| Delivery Style | Developer Action | PR Merge Time | Trust Level |
|---|---|---|---|
| 28 Separate Comments | Ignores all | Fast (risky) | Low |
| Smart Grouped | Addresses critical | Appropriate | High |
Observation: The aggregator transforms the AI from an annoying perfectionist into a helpful colleague. It respects the developer's attention by showing them what matters most.
Think About It: Should you let developers customize their threshold? Some teams want to see all nitpicks, others want only critical issues. Consider user preferences: "Show me: Critical + High only" vs. "Show me: Everything."
Pattern 4: The "Feedback Loop" (Continuous Improvement)
The Architectural Problem:
The AI will be wrong. It will claim a bug exists where there is none. It will suggest changes that break convention. Users will react negatively.
In a traditional system, you'd need to retrain the model. But you can't retrain a frontier model on custom feedback.
The Solution:
We capture user interaction to improve the system without retraining, using a RAG-Based Memory Pattern.
The Architecture
graph TD
subgraph Review["Review Generation"]
Generate[AI Generates Review] --> Post[Post to PR]
end
subgraph Feedback["User Feedback"]
Post --> UserReact{User Reaction}
UserReact -->|Thumbs Up| Positive[(Positive Examples DB)]
UserReact -->|Thumbs Down| Negative[(Negative Examples DB)]
UserReact -->|No Reaction| Neutral[Neutral]
end
subgraph Future["Future Reviews"]
NewReview[Next PR Review] --> CheckMemory[Query Feedback DBs]
CheckMemory --> Positive
CheckMemory --> Negative
Positive --> GoodPatterns[Reinforce Good Patterns]
Negative --> AvoidPatterns[Avoid Bad Patterns]
GoodPatterns --> EnhancedPrompt[Enhanced Prompt]
AvoidPatterns --> EnhancedPrompt
EnhancedPrompt --> BetterReview[Better Review]
end
style Negative fill:#ffebee,stroke:#b71c1c
style BetterReview fill:#e8f5e9,stroke:#388e3c
How it Works:
-
Capture Feedback: When a user marks a comment as "Not Helpful," we save:
- The code snippet
- The AI's comment
- The user's feedback: "This is false positive"
-
Build Memory: Store these examples in a vector database:
positive_patterns: Comments users found helpfulnegative_patterns: Comments users rejected
-
Query Before Review: Before generating a new review, query both databases:
- "Does this code look similar to anything in
negative_patterns?" - If yes: "Users previously disliked suggestions about X in similar code."
- "Does this code look similar to anything in
-
Inject Guardrails: Add to the prompt:
Based on user feedback, avoid suggesting: - "Add error handling" for test files (users found this annoying) - "Use type hints" in legacy code (users rejected this)
Concrete Example: Learning from Mistakes
Week 1: AI makes a mistake
# test_auth.py
def test_login_failure():
with pytest.raises(AuthError):
login("bad", "credentials") # AI comments: "Add try-catch here"
Developer feedback: 👎 "This is a test. pytest.raises expects an exception. This comment is wrong."
System learns: Saves to negative_patterns:
- Context: Test file using
pytest.raises - Bad suggestion: "Add try-catch"
- Reason: Misunderstands test framework
Week 2: Similar code appears
# test_payment.py
def test_invalid_card():
with pytest.raises(PaymentError):
charge(-100, card) # System queries memory
Memory hit: "This looks like a test with pytest.raises. Users rejected exception handling suggestions in this context."
Enhanced prompt:
You are reviewing a test file.
Note: Do not suggest adding try-catch blocks inside pytest.raises() contexts.
Users have indicated this is not helpful.
Result: AI skips the bad comment. Developer sees only relevant feedback.
The Feedback Dashboard
For engineering teams, we build a dashboard to monitor AI quality:
graph TD
subgraph Metrics["Quality Metrics"]
A[Comments Posted 10000]
B[Thumbs Up 7200]
C[Thumbs Down 1800]
D[Ignored 1000]
end
A --> E[Helpfulness Rate 72 percent]
C --> F[Common Rejections]
F --> G[Top Rejected Patterns]
G --> H1[Type hints in old code]
G --> H2[Docstring nitpicks]
G --> H3[Import organization]
H1 --> I[Automatically Suppress]
H2 --> I
H3 --> I
style E fill:#e8f5e9,stroke:#388e3c
style I fill:#e3f2fd,stroke:#0d47a1
Observation: This creates a self-improving system. The AI gets smarter not by retraining, but by remembering what works and what doesn't. This is closer to how human engineers learn—through experience and feedback.
Think About It: Should you share this memory across all users, or keep it per-team? A startup might want aggressive linting, while a large enterprise prefers conservative reviews. Per-team memory allows customization without configuration.
Putting It All Together: The Complete Intelligence Pipeline
Let's trace a real code change through the entire intelligence layer.
Scenario: Developer modifies authentication logic in a sensitive file.
graph TD
Start[PR Submitted auth.py modified] --> Router[Pattern 1 Router]
Router --> Classify{Classify Change}
Classify -->|SECURITY| Context[Pattern 2 GraphRAG]
Context --> BuildGraph[Build Dependency Context]
BuildGraph --> GetCallers[Find 23 callers of this function]
GetCallers --> Rank[Rank by relevance]
Rank --> TopN[Select top 10 callers]
TopN --> Review[Large Reasoning Model]
Review --> Generate[Generate Review Comments]
Generate --> Filter[Pattern 3 Filter]
Filter --> Score[Score severity]
Score --> Sort{Sort by importance}
Sort -->|Critical| Inline[Post inline critical comments]
Sort -->|Medium| Summary[Add to summary table]
Sort -->|Low| Aggregate[Collapse into optional section]
Inline --> Check[Pattern 4 Check Memory]
Summary --> Check
Aggregate --> Check
Check --> Memory[(User Feedback DB)]
Memory --> Adjust[Adjust or suppress based on past]
Adjust --> Final[Post to GitHub]
style Classify fill:#fff9c4,stroke:#fbc02d
style BuildGraph fill:#e3f2fd,stroke:#0d47a1
style Score fill:#fff8e1,stroke:#f57f17
style Final fill:#e8f5e9,stroke:#388e3c
Timeline:
| Time | Stage | Action | Cost |
|---|---|---|---|
| T+0s | Router | Classifies as SECURITY | Minimal |
| T+1s | GraphRAG | Queries dependency graph | Minimal |
| T+2s | GraphRAG | Fetches 10 relevant caller snippets | Minimal |
| T+4s | LLM | Builds rich context (2K tokens) | Moderate |
| T+12s | LLM | Generates detailed review | High |
| T+13s | Filter | Scores and groups 15 comments | Free |
| T+14s | Memory | Checks against 2 negative patterns | Free |
| T+15s | Final | Posts 2 inline + 1 summary comment | Free |
Total Cost: One expensive LLM call (justified for security code) + minimal infrastructure costs.
Result: Developer sees 2 critical security issues flagged immediately, with context-aware explanations that reference the actual function signatures from other files. No false positives. High trust.
Challenge: Design Decisions for Your Intelligence Layer
Challenge 1: The Router Accuracy Trade-Off
Your router model has 95% accuracy. That means 5% of changes are misclassified.
Questions:
- If the router marks a critical security change as "DOCS," it bypasses deep review. How do you prevent this?
- Should you use a "confidence threshold"? (e.g., "If router is less than 80% confident, route to slow path")
- How do you measure router performance in production? False negatives are silent failures.
Challenge 2: The Context Explosion Problem
A function is called from many places. You can't fit everything in the context window.
Questions:
- Should you run multiple passes? (Review with context A, then review with context B, then merge findings)
- What if the most relevant caller is in a test file? Do tests count as "production callers"?
- Should you let the LLM see that you've omitted some callers, or pretend it has complete context?
Challenge 3: The Feedback Bias Problem
Only unhappy users leave feedback. Your negative patterns grow much faster than positive patterns.
Questions:
- How do you avoid becoming overly conservative? (Suppressing so many patterns that you stop finding real issues)
- Should you periodically "forget" old negative patterns to re-evaluate them?
- How do you weight feedback from expert developers vs. beginners?
System Comparison: Naive vs. Intelligent Architecture
| Dimension | Naive Approach | Intelligent Architecture |
|---|---|---|
| Model Usage | Large model for everything | Cascaded models (70% to fast path) |
| Context Strategy | Dump whole files | GraphRAG with relevance ranking |
| Comment Delivery | All findings, no filtering | Severity-based grouping |
| Cost per PR | High (inefficient) | Low (optimized routing) |
| Developer Trust | Low (too noisy) | High (focused, relevant) |
| False Positive Rate | High (no context) | Low (rich context + memory) |
| Improvement Strategy | Manual prompt tuning | Automated feedback learning |
graph TD
subgraph Naive["Naive System"]
N1[Every change] --> N2[Large model]
N2 --> N3[Many comments]
N3 --> N4[Low trust]
style N4 fill:#ffebee,stroke:#b71c1c
end
subgraph Intelligent["Intelligent System"]
I1[Every change] --> I2[Router]
I2 --> I3[70 percent fast path]
I2 --> I4[30 percent deep review]
I3 --> I5[Minimal comments]
I4 --> I6[Contextual comments]
I5 --> I7[High trust]
I6 --> I7
style I7 fill:#e8f5e9,stroke:#388e3c
end
Key Architectural Patterns Summary
| Pattern | Problem Solved | Key Benefit | Cost Impact |
|---|---|---|---|
| Model Cascade | Expensive models for trivial work | 70% cost reduction | High |
| GraphRAG | Missing context causes hallucinations | 3x accuracy improvement | Moderate |
| Signal-to-Noise Filter | Developer fatigue from too many comments | Higher adoption rate | Free |
| Feedback Loop | AI never improves from mistakes | Continuous quality improvement | Free |
Discussion Points for Engineers
1. The Cold Start Problem
You launch your AI code reviewer. You have no feedback data yet. How do you bootstrap the system?
Questions:
- Do you start conservative (only flag critical issues) or aggressive (show everything)?
- Should you run a "shadow mode" for a week to gather feedback before going live?
- How do you seed your negative patterns database? Use synthetic examples? Import from similar tools?
2. The Multi-Language Challenge
Your router works great for Python. But what about Go, Rust, TypeScript, Java...?
Questions:
- Do you train separate routers per language? (High cost, high quality)
- Use a universal router? (Low cost, lower quality)
- Fine-tune a single model on all languages? (Middle ground)
3. The Privacy-Cost Trade-Off
Building GraphRAG requires indexing the entire codebase. For large repos, this is expensive.
Questions:
- Do you index on every commit? (Fresh but expensive)
- Index nightly? (Stale but cheap)
- Incremental updates? (Complex but optimal)
- How do you handle private repositories? Can you use external graph builders?
The Complete Architecture: All Three Parts
We've now built the full system. Let's see how all three parts work together:
graph TD
subgraph Part1["Part 1 Ingestion"]
GH[GitHub Webhook] --> Buffer[Buffer Pattern]
Buffer --> Filter[Gatekeeper]
end
subgraph Part2["Part 2 Orchestration"]
Filter --> Workflow[Durable Workflow]
Workflow --> Parallel[Parallel Execution]
Parallel --> State[(State Persistence)]
end
subgraph Part3["Part 3 Intelligence"]
State --> Router[Model Cascade]
Router --> GraphRAG[Context Builder]
GraphRAG --> LLM[Reasoning Model]
LLM --> Aggregator[Signal Filter]
Aggregator --> Memory[(Feedback Loop)]
end
Memory --> Output[Post to PR]
style Part1 fill:#e8f5e9,stroke:#388e3c
style Part2 fill:#fff9c4,stroke:#fbc02d
style Part3 fill:#e3f2fd,stroke:#0d47a1
The Data Flow:
- Ingestion (Part 1): Webhook arrives → Buffered → Filtered → Valid code changes extracted
- Orchestration (Part 2): Workflow spawns parallel tasks → Manages state → Handles failures
- Intelligence (Part 3): Routes by complexity → Builds context → Reviews → Filters noise → Learns from feedback
The Result: A production-grade AI code review system that is:
- Reliable: Survives crashes and API failures
- Efficient: Routes intelligently to minimize cost
- Accurate: Uses GraphRAG to understand code relationships
- Trusted: Filters noise and learns from feedback
Takeaways
The Intelligence Pyramid
graph TD
A[Intelligent AI Agent] --> B[1. Smart Routing]
A --> C[2. Rich Context]
A --> D[3. Noise Reduction]
A --> E[4. Continuous Learning]
B --> F[Minimize Cost]
C --> G[Maximize Accuracy]
D --> H[Maximize Trust]
E --> I[Maximize Improvement]
style A fill:#e3f2fd,stroke:#0d47a1
style F fill:#e8f5e9,stroke:#388e3c
style G fill:#e8f5e9,stroke:#388e3c
style H fill:#e8f5e9,stroke:#388e3c
style I fill:#e8f5e9,stroke:#388e3c
Key Insights
-
Not all code is equal — A typo fix and a security vulnerability deserve different treatment. Model cascading recognizes this, routing trivial changes to fast paths and complex logic to deep reasoning.
-
Context is the secret weapon — GraphRAG transforms the AI from a text analyzer into a system architect. By understanding dependencies, it catches bugs that flat-file analysis would miss.
-
Less is more for developers — Flooding a PR with comments destroys trust. Intelligent filtering and grouping ensures developers see what matters, building long-term adoption.
-
AI systems must learn — Unlike traditional software, AI makes mistakes. The feedback loop turns these mistakes into lessons, creating a system that improves automatically without retraining.
-
Architecture beats prompts — A great prompt with bad architecture produces a bad product. A good architecture with an average prompt produces a great product. The system design matters more than the individual components.
The Cost-Quality-Trust Triangle
| Optimization | Implementation | Impact |
|---|---|---|
| Cost | Model Cascade | 70% cost reduction, no quality loss |
| Quality | GraphRAG | 3x accuracy improvement, fewer hallucinations |
| Trust | Signal Filtering | Higher adoption, developers actually read reviews |
| Evolution | Feedback Loop | Continuous improvement without retraining |
The Journey Complete
We've built a production-grade AI code review system from the ground up:
Part 1: The Foundation
- Event ingestion and buffering
- Noise filtering and routing
- Cost optimization through early filtering
Part 2: The Reliability
- Durable workflows that survive failures
- Parallel execution with concurrency control
- Human-in-the-loop for critical decisions
Part 3: The Intelligence
- Model cascading for cost efficiency
- GraphRAG for deep code understanding
- Signal filtering for developer happiness
- Feedback loops for continuous improvement
The Result: A system that doesn't just "work"—it scales, adapts, and earns developer trust.
This is the difference between a demo and a product. A demo shows what's possible. A product solves what's practical.
What's Next: Beyond Code Review
The patterns we've covered aren't limited to code review. They apply to any production AI agent:
- Customer Support Bots: Use model cascading to handle simple questions cheaply, escalate complex ones
- Content Moderation: Use GraphRAG to understand context across posts and threads
- Data Analysis Agents: Use signal filtering to surface only actionable insights
- Compliance Checkers: Use feedback loops to learn company-specific rules
The architecture is the same. The domain changes. The principles endure.
For more on building production AI systems at scale, check out our AI Bootcamp for Software Engineers.