Architecting CodeRabbit like code-review AI agent: The Intelligence Layer

Param Harrison
23 min read

Share this post

In Part 1, we built the Ingestion Engine to handle webhook storms and filter noise.

In Part 2, we built the Orchestration Brain to manage long-running workflows reliably.

Now, we build the soul of the product: The Intelligence.

You have the code. You have the workflow. Now, how do you actually review it?

If you just send a git diff to a large language model with the prompt "Review this code," you will build a bad product.

  1. It will be too expensive: You are paying top-dollar to review typos.
  2. It will be too noisy: It will nag developers about missing docstrings in experimental code.
  3. It will be hallucination-prone: It will claim a variable is undefined because it can't see the import file.

In this final part, we architect the Intelligence Layer. We will build a multi-model system that mimics a Principal Engineer: it ignores the fluff, understands the broader context, and only speaks when it matters.


The Failure Case: The Naive Review

Before we dive into solutions, let's visualize what happens when you naively send every change to a large reasoning model.

graph TD
    A[Developer Changes 20 Files] --> B[Send All to Large Model]
    
    B --> C1[File 1: README typo]
    B --> C2[File 2: Whitespace fix]
    B --> C3[File 3: Comment update]
    B --> C4[File 4-19: Similar trivial]
    B --> C5[File 20: Critical auth bug]
    
    C1 --> D[High Cost Review]
    C2 --> D
    C3 --> D
    C4 --> D
    C5 --> D
    
    D --> E[50 Comments Generated]
    E --> F[Developer Ignores All]
    
    B --> G[Budget Explosion]
    G --> H[Slow Response Time]
    
    style F fill:#ffebee,stroke:#b71c1c
    style G fill:#ffebee,stroke:#b71c1c

The Three Failures:

  1. Cost Disaster: You spent premium model cost on reviewing a README typo. The cost-to-value ratio is terrible.

  2. Signal Buried: The critical security bug in File 20 is buried under 49 nitpicks about formatting. The developer misses it.

  3. Developer Fatigue: After receiving 50 comments on their last PR, the developer starts ignoring all reviews. Trust erodes.

Observation: The naive approach treats all code changes equally. But a Principal Engineer knows: a typo fix and an authentication bypass are not the same. Your AI needs the same judgment.


Pattern 1: The "Model Cascade" (Cost & Speed Optimization)

The Architectural Problem:

Most code changes are trivial. Renaming a variable, fixing a typo, or updating documentation does not require frontier model reasoning. Using your most expensive model for everything destroys your unit economics.

The Solution:

We implement a Model Cascade. We use cheap, fast models to filter and categorize, and expensive models only for deep reasoning.

The Architecture

graph TD
    Input[Code Change] --> Router{Router Agent}
    
    Router -->|Trivial| FastPath[Fast Path]
    Router -->|Complex| SlowPath[Slow Path]
    
    subgraph FastPath["Fast Path Low Cost"]
        F1[Static Analysis] --> F2[Rule-Based Checks]
        F2 --> F3[Auto-Fix or Skip]
    end
    
    subgraph SlowPath["Slow Path High Quality"]
        S1[Build Rich Context] --> S2[Large Reasoning Model]
        S2 --> S3[Deep Analysis]
    end
    
    F3 --> Output[Review Comments]
    S3 --> Output
    
    style Router fill:#fff9c4,stroke:#fbc02d

How it Works:

  1. The Router (The Triage Specialist): A small, fast model analyzes the diff and classifies it:

    • COSMETIC: Whitespace, formatting, comments
    • DOCS: README, documentation updates
    • LOGIC: Business logic, algorithms
    • SECURITY: Authentication, permissions, data handling
  2. The Fast Path: For COSMETIC and DOCS changes:

    • Run static analysis tools (linters, formatters)
    • Apply deterministic rules
    • Auto-generate brief feedback or skip entirely
    • Cost: Near zero
  3. The Slow Path: Only LOGIC and SECURITY changes trigger the expensive reasoning model:

    • Build comprehensive context (more on this in Pattern 2)
    • Use Chain of Thought reasoning
    • Generate detailed, thoughtful reviews
    • Cost: Higher, but justified

Observation: The router acts as a bouncer at a nightclub. It makes a split-second decision: "Fast lane or VIP treatment?" This single decision reduces your expensive model usage by 60-70%.

Concrete Example: The Classification Decision

Let's walk through how the router classifies different changes.

Change 1: README Update

- ## Setup Instructions
+ ## Getting Started

Router Analysis:

  • File type: .md
  • Change type: Text only
  • No code logic affected
  • Classification: DOCS
  • Route: Fast Path (skip or auto-comment)

Change 2: Authentication Logic

  def login(username, password):
-     if password == user.password:
+     if hash(password) == user.password_hash:
          return create_session(user)

Router Analysis:

  • File type: .py
  • Function: login (security-sensitive)
  • Change: Password handling modified
  • Classification: SECURITY
  • Route: Slow Path (deep reasoning required)

The Cost Breakdown

Let's compare the two approaches:

Approach 1: Brute Force (Everything to Large Model)

graph TD
    A[100 Files Changed] --> B[All to Large Model]
    
    B --> C[70 Trivial Changes]
    B --> D[30 Complex Changes]
    
    C --> E[High Cost for Low Value]
    D --> F[High Cost for High Value]
    
    E --> G[Total: High Cost]
    F --> G
    
    style E fill:#ffebee,stroke:#b71c1c

Approach 2: Model Cascade (Smart Routing)

graph TD
    A[100 Files Changed] --> B{Router}
    
    B --> C[70 Trivial Changes]
    B --> D[30 Complex Changes]
    
    C --> E[Static Analysis]
    D --> F[Large Model]
    
    E --> G[Minimal Cost]
    F --> H[Justified Cost]
    
    G --> I[Total: Much Lower Cost]
    H --> I
    
    style I fill:#e8f5e9,stroke:#388e3c

The Impact:

Metric Brute Force Model Cascade Improvement
Cost per PR High Low 70% reduction
Response Time Slow (all files queued) Fast (parallel tiers) 3x faster
Quality Same for all Focused on what matters Better signal
Developer Trust Low (too much noise) High (thoughtful comments) Higher adoption

Think About It: How do you train the router? You could use a small fine-tuned model on historical data (labeled examples of "this was important" vs "this was a nitpick"). Or use a lightweight LLM with few-shot examples in the prompt. The key: keep it fast (sub-second) and cheap.


Pattern 2: Context Engineering with GraphRAG

The Architectural Problem:

The "Context Trap"—the most insidious problem in code review AI.

  • Too little context: The AI hallucinates bugs because it can't see helper functions or imported utilities.
  • Too much context: You hit the context window limit, or worse, you confuse the model with irrelevant code from unrelated files.

The Solution:

We don't just "stuff the prompt." We use Graph-Based Retrieval Augmented Generation (GraphRAG). We treat the codebase not as flat text files, but as a Dependency Graph.

The Architecture

graph TD
    subgraph Input["The Change"]
        Change[Developer modifies checkout.py]
    end
    
    subgraph Parser["AST Analysis"]
        Change --> AST[Parse Abstract Syntax Tree]
        AST --> Symbols[Extract Symbols]
        Symbols --> Imports[Find Imports]
        Symbols --> Calls[Find Function Calls]
    end
    
    subgraph Graph["Knowledge Graph Query"]
        Imports --> Query1[Where is payment_service defined?]
        Calls --> Query2[Where is calculate_tax defined?]
        
        Query1 --> GraphDB[(Code Dependency Graph)]
        Query2 --> GraphDB
        
        GraphDB --> Snippets[Return Relevant Snippets]
    end
    
    subgraph Context["Context Assembly"]
        Change --> Base[Base: Modified File]
        Snippets --> Related[Related: Dependencies]
        
        Base --> Final[Final Context Window]
        Related --> Final
    end
    
    Final --> LLM[LLM Review]
    
    style GraphDB fill:#fff9c4,stroke:#fbc02d
    style Final fill:#e3f2fd,stroke:#0d47a1

How it Works:

  1. Symbol Extraction: When reviewing checkout.py, we parse the AST to identify:

    • Imported modules: from payment_service import charge
    • Function calls: calculate_tax(amount)
    • Class instantiations: Auth.validate(token)
  2. Graph Lookup: For each external symbol, we query our pre-built code index:

    • "Where is charge defined?"
    • "What's the signature of calculate_tax?"
    • "What does Auth.validate return?"
  3. Smart Snippeting: Instead of dumping entire files, we fetch only:

    • Function signatures
    • Docstrings
    • Type annotations
    • Return types
  4. Context Assembly: Build the final prompt:

    You are reviewing checkout.py.
    
    Modified code:
    [The actual diff]
    
    External dependencies used in this file:
    - charge(amount, card): Charges card. Returns {'success': bool, 'id': str}
    - calculate_tax(amount): Returns float. Raises ValueError if amount < 0
    - Auth.validate(token): Returns User object or None
    

Observation: This gives the AI "X-Ray Vision" into the codebase. It understands not just what changed, but how that change interacts with the rest of the system. This is the difference between a surface-level review and a deep architectural critique.

Concrete Example: Preventing a False Positive

The Code Change:

# checkout.py
def process_order(order_id, user_token):
    user = Auth.validate(user_token)  # Reviewer might worry: What if this returns None?
    
    amount = calculate_total(order_id)
    tax = calculate_tax(amount)  # Reviewer might worry: What if amount is negative?
    
    charge(amount + tax, user.card)  # Reviewer might worry: What if user is None?
    return {"status": "success"}

Without GraphRAG (Naive Review):

graph LR
    A[AI sees only checkout.py] --> B[No context on Auth.validate]
    B --> C[Potential bug - user might be None]
    C --> D[False Positive Comment]
    
    style D fill:#ffebee,stroke:#b71c1c

AI Comment: ⚠️ "user could be None if token is invalid. This will cause an AttributeError on user.card."

Developer Response: 😤 "No, Auth.validate raises an exception for invalid tokens. Read the docs!"

With GraphRAG (Context-Aware Review):

graph LR
    A[AI sees checkout.py] --> B[GraphRAG fetches Auth.validate signature]
    B --> C[Auth.validate raises AuthError if invalid]
    C --> D[No False Positive]
    D --> E[Focuses on Real Issues]
    
    style E fill:#e8f5e9,stroke:#388e3c

AI Comment: ✅ "Code looks good. Auth exception handling is upstream. Consider adding a try-catch for charge() failures."

Developer Response: 👍 "Good point, will add."

The Context Window Budget Strategy

With GraphRAG, we face a new problem: what if a function is called from many places?

graph TD
    A[Function has many callers] --> B{Strategy}
    
    B -->|Naive: Include All| C[Context Overflow]
    B -->|Smart: Prioritize| D[Relevance Ranking]
    
    D --> E[Same File: Priority 1]
    D --> F[Same Module: Priority 2]
    D --> G[Test Files: Priority 3]
    D --> H[Distant Files: Priority 4]
    
    E --> I[Include Top N]
    F --> I
    G --> I
    H --> I
    
    I --> J[Fit Context Window]
    
    style C fill:#ffebee,stroke:#b71c1c
    style J fill:#e8f5e9,stroke:#388e3c

The Ranking Algorithm:

Caller Type Priority Reasoning
Same file High Most likely related to the change
Same module/package Medium Likely architectural dependency
Test files Medium Shows expected behavior
Distant modules Low Probably unaffected

Result: We send the top 10 most relevant callers, not all of them. Quality over quantity.

Think About It: Should you tell the AI that you're omitting some callers? Some teams add a note: "This function has 47 callers. Showing the 10 most relevant." This helps the AI understand it's working with a sample, not the complete picture.


Pattern 3: The "Signal-to-Noise" Filter (Developer Happiness)

The Architectural Problem:

If your AI leaves many comments on a PR, developers will ignore all of them. This is "Notification Fatigue." A human Principal Engineer doesn't comment on every missing space; they focus on what breaks the app or creates security risks.

The Solution:

We build a Feedback Aggregator that acts as a final filter before anything reaches the developer.

The Architecture

graph TD
    LLM[LLM Generates Many Comments] --> Collector[Comment Collector]
    
    Collector --> Dedup[Deduplication]
    Dedup --> Score[Severity Scoring]
    
    Score --> Gate{Severity Gate}
    
    Gate -->|Critical or High| Inline[Post as Inline Comments]
    Gate -->|Medium| Batch1[Batch into Summary Table]
    Gate -->|Low or Nitpick| Batch2[Aggregate or Discard]
    
    Inline --> GitHub1[GitHub: Block PR]
    Batch1 --> GitHub2[GitHub: Summary Comment]
    Batch2 --> GitHub2
    
    style Inline fill:#ffebee,stroke:#b71c1c
    style Batch1 fill:#fff8e1,stroke:#f57f17
    style Batch2 fill:#e8f5e9,stroke:#388e3c

How it Works:

  1. Deduplication: The AI often finds the same issue multiple times:

    • "Variable name is unclear" (mentioned 5 times)
    • Deduplicate to: "Variable naming could be improved in 5 locations"
  2. Severity Scoring: Each comment gets a score:

    • Critical (5): Security vulnerability, data loss risk
    • High (4): Logic error, potential crash
    • Medium (3): Performance issue, poor error handling
    • Low (2): Code smell, minor improvement
    • Nitpick (1): Style preference, formatting
  3. Smart Delivery:

    • Critical/High: Individual inline comments on specific lines (blocks merge)
    • Medium: Grouped into a summary table
    • Low/Nitpick: Aggregated into a single collapsible section

Concrete Example: The Review Presentation

Scenario: AI analyzes a PR and generates these findings:

  • 1 Critical: SQL injection vulnerability
  • 2 High: Unhandled exceptions
  • 5 Medium: Missing error messages
  • 12 Low: Variable naming suggestions
  • 8 Nitpick: Missing docstrings

Naive Delivery (28 separate comments):

graph TD
    A[Developer Opens PR] --> B[Sees 28 Comments]
    B --> C[Feels Overwhelmed]
    C --> D[Ignores All]
    D --> E[Merges Without Reading]
    
    style E fill:#ffebee,stroke:#b71c1c

Smart Delivery (Filtered & Grouped):

graph TD
    A[Developer Opens PR] --> B[Sees 3 Inline Critical Comments]
    B --> C[Immediately Addresses Security Issue]
    C --> D[Expands Summary Table]
    D --> E[Reviews Medium Priority Items]
    E --> F[Dismisses Nitpicks]
    F --> G[Merges with Confidence]
    
    style G fill:#e8f5e9,stroke:#388e3c

GitHub Comment Structure:

## 🚨 Critical Issues (Must Fix)

**Line 45:** SQL Injection vulnerability detected
```sql
query = f"SELECT * FROM users WHERE id = {user_id}"

Use parameterized queries to prevent SQL injection.


⚠️ Important Findings (5 items)

File Line Issue Severity
auth.py 23 Unhandled exception High
api.py 67 Missing error message Medium
... 3 more rows ...
💡 Code Quality Suggestions (20 items)

Low-priority improvements that can be addressed later...

```

The Impact:

Delivery Style Developer Action PR Merge Time Trust Level
28 Separate Comments Ignores all Fast (risky) Low
Smart Grouped Addresses critical Appropriate High

Observation: The aggregator transforms the AI from an annoying perfectionist into a helpful colleague. It respects the developer's attention by showing them what matters most.

Think About It: Should you let developers customize their threshold? Some teams want to see all nitpicks, others want only critical issues. Consider user preferences: "Show me: Critical + High only" vs. "Show me: Everything."


Pattern 4: The "Feedback Loop" (Continuous Improvement)

The Architectural Problem:

The AI will be wrong. It will claim a bug exists where there is none. It will suggest changes that break convention. Users will react negatively.

In a traditional system, you'd need to retrain the model. But you can't retrain a frontier model on custom feedback.

The Solution:

We capture user interaction to improve the system without retraining, using a RAG-Based Memory Pattern.

The Architecture

graph TD
    subgraph Review["Review Generation"]
        Generate[AI Generates Review] --> Post[Post to PR]
    end
    
    subgraph Feedback["User Feedback"]
        Post --> UserReact{User Reaction}
        UserReact -->|Thumbs Up| Positive[(Positive Examples DB)]
        UserReact -->|Thumbs Down| Negative[(Negative Examples DB)]
        UserReact -->|No Reaction| Neutral[Neutral]
    end
    
    subgraph Future["Future Reviews"]
        NewReview[Next PR Review] --> CheckMemory[Query Feedback DBs]
        CheckMemory --> Positive
        CheckMemory --> Negative
        
        Positive --> GoodPatterns[Reinforce Good Patterns]
        Negative --> AvoidPatterns[Avoid Bad Patterns]
        
        GoodPatterns --> EnhancedPrompt[Enhanced Prompt]
        AvoidPatterns --> EnhancedPrompt
        
        EnhancedPrompt --> BetterReview[Better Review]
    end
    
    style Negative fill:#ffebee,stroke:#b71c1c
    style BetterReview fill:#e8f5e9,stroke:#388e3c

How it Works:

  1. Capture Feedback: When a user marks a comment as "Not Helpful," we save:

    • The code snippet
    • The AI's comment
    • The user's feedback: "This is false positive"
  2. Build Memory: Store these examples in a vector database:

    • positive_patterns: Comments users found helpful
    • negative_patterns: Comments users rejected
  3. Query Before Review: Before generating a new review, query both databases:

    • "Does this code look similar to anything in negative_patterns?"
    • If yes: "Users previously disliked suggestions about X in similar code."
  4. Inject Guardrails: Add to the prompt:

    Based on user feedback, avoid suggesting:
    - "Add error handling" for test files (users found this annoying)
    - "Use type hints" in legacy code (users rejected this)
    

Concrete Example: Learning from Mistakes

Week 1: AI makes a mistake

# test_auth.py
def test_login_failure():
    with pytest.raises(AuthError):
        login("bad", "credentials")  # AI comments: "Add try-catch here"

Developer feedback: 👎 "This is a test. pytest.raises expects an exception. This comment is wrong."

System learns: Saves to negative_patterns:

  • Context: Test file using pytest.raises
  • Bad suggestion: "Add try-catch"
  • Reason: Misunderstands test framework

Week 2: Similar code appears

# test_payment.py
def test_invalid_card():
    with pytest.raises(PaymentError):
        charge(-100, card)  # System queries memory

Memory hit: "This looks like a test with pytest.raises. Users rejected exception handling suggestions in this context."

Enhanced prompt:

You are reviewing a test file.
Note: Do not suggest adding try-catch blocks inside pytest.raises() contexts.
Users have indicated this is not helpful.

Result: AI skips the bad comment. Developer sees only relevant feedback.

The Feedback Dashboard

For engineering teams, we build a dashboard to monitor AI quality:

graph TD
    subgraph Metrics["Quality Metrics"]
        A[Comments Posted 10000]
        B[Thumbs Up 7200]
        C[Thumbs Down 1800]
        D[Ignored 1000]
    end
    
    A --> E[Helpfulness Rate 72 percent]
    C --> F[Common Rejections]
    
    F --> G[Top Rejected Patterns]
    G --> H1[Type hints in old code]
    G --> H2[Docstring nitpicks]
    G --> H3[Import organization]
    
    H1 --> I[Automatically Suppress]
    H2 --> I
    H3 --> I
    
    style E fill:#e8f5e9,stroke:#388e3c
    style I fill:#e3f2fd,stroke:#0d47a1

Observation: This creates a self-improving system. The AI gets smarter not by retraining, but by remembering what works and what doesn't. This is closer to how human engineers learn—through experience and feedback.

Think About It: Should you share this memory across all users, or keep it per-team? A startup might want aggressive linting, while a large enterprise prefers conservative reviews. Per-team memory allows customization without configuration.


Putting It All Together: The Complete Intelligence Pipeline

Let's trace a real code change through the entire intelligence layer.

Scenario: Developer modifies authentication logic in a sensitive file.

graph TD
    Start[PR Submitted auth.py modified] --> Router[Pattern 1 Router]
    
    Router --> Classify{Classify Change}
    Classify -->|SECURITY| Context[Pattern 2 GraphRAG]
    
    Context --> BuildGraph[Build Dependency Context]
    BuildGraph --> GetCallers[Find 23 callers of this function]
    GetCallers --> Rank[Rank by relevance]
    Rank --> TopN[Select top 10 callers]
    
    TopN --> Review[Large Reasoning Model]
    Review --> Generate[Generate Review Comments]
    
    Generate --> Filter[Pattern 3 Filter]
    Filter --> Score[Score severity]
    Score --> Sort{Sort by importance}
    
    Sort -->|Critical| Inline[Post inline critical comments]
    Sort -->|Medium| Summary[Add to summary table]
    Sort -->|Low| Aggregate[Collapse into optional section]
    
    Inline --> Check[Pattern 4 Check Memory]
    Summary --> Check
    Aggregate --> Check
    
    Check --> Memory[(User Feedback DB)]
    Memory --> Adjust[Adjust or suppress based on past]
    
    Adjust --> Final[Post to GitHub]
    
    style Classify fill:#fff9c4,stroke:#fbc02d
    style BuildGraph fill:#e3f2fd,stroke:#0d47a1
    style Score fill:#fff8e1,stroke:#f57f17
    style Final fill:#e8f5e9,stroke:#388e3c

Timeline:

Time Stage Action Cost
T+0s Router Classifies as SECURITY Minimal
T+1s GraphRAG Queries dependency graph Minimal
T+2s GraphRAG Fetches 10 relevant caller snippets Minimal
T+4s LLM Builds rich context (2K tokens) Moderate
T+12s LLM Generates detailed review High
T+13s Filter Scores and groups 15 comments Free
T+14s Memory Checks against 2 negative patterns Free
T+15s Final Posts 2 inline + 1 summary comment Free

Total Cost: One expensive LLM call (justified for security code) + minimal infrastructure costs.

Result: Developer sees 2 critical security issues flagged immediately, with context-aware explanations that reference the actual function signatures from other files. No false positives. High trust.


Challenge: Design Decisions for Your Intelligence Layer

Challenge 1: The Router Accuracy Trade-Off

Your router model has 95% accuracy. That means 5% of changes are misclassified.

Questions:

  • If the router marks a critical security change as "DOCS," it bypasses deep review. How do you prevent this?
  • Should you use a "confidence threshold"? (e.g., "If router is less than 80% confident, route to slow path")
  • How do you measure router performance in production? False negatives are silent failures.

Challenge 2: The Context Explosion Problem

A function is called from many places. You can't fit everything in the context window.

Questions:

  • Should you run multiple passes? (Review with context A, then review with context B, then merge findings)
  • What if the most relevant caller is in a test file? Do tests count as "production callers"?
  • Should you let the LLM see that you've omitted some callers, or pretend it has complete context?

Challenge 3: The Feedback Bias Problem

Only unhappy users leave feedback. Your negative patterns grow much faster than positive patterns.

Questions:

  • How do you avoid becoming overly conservative? (Suppressing so many patterns that you stop finding real issues)
  • Should you periodically "forget" old negative patterns to re-evaluate them?
  • How do you weight feedback from expert developers vs. beginners?

System Comparison: Naive vs. Intelligent Architecture

Dimension Naive Approach Intelligent Architecture
Model Usage Large model for everything Cascaded models (70% to fast path)
Context Strategy Dump whole files GraphRAG with relevance ranking
Comment Delivery All findings, no filtering Severity-based grouping
Cost per PR High (inefficient) Low (optimized routing)
Developer Trust Low (too noisy) High (focused, relevant)
False Positive Rate High (no context) Low (rich context + memory)
Improvement Strategy Manual prompt tuning Automated feedback learning
graph TD
    subgraph Naive["Naive System"]
        N1[Every change] --> N2[Large model]
        N2 --> N3[Many comments]
        N3 --> N4[Low trust]
        style N4 fill:#ffebee,stroke:#b71c1c
    end
    
    subgraph Intelligent["Intelligent System"]
        I1[Every change] --> I2[Router]
        I2 --> I3[70 percent fast path]
        I2 --> I4[30 percent deep review]
        I3 --> I5[Minimal comments]
        I4 --> I6[Contextual comments]
        I5 --> I7[High trust]
        I6 --> I7
        style I7 fill:#e8f5e9,stroke:#388e3c
    end

Key Architectural Patterns Summary

Pattern Problem Solved Key Benefit Cost Impact
Model Cascade Expensive models for trivial work 70% cost reduction High
GraphRAG Missing context causes hallucinations 3x accuracy improvement Moderate
Signal-to-Noise Filter Developer fatigue from too many comments Higher adoption rate Free
Feedback Loop AI never improves from mistakes Continuous quality improvement Free

Discussion Points for Engineers

1. The Cold Start Problem

You launch your AI code reviewer. You have no feedback data yet. How do you bootstrap the system?

Questions:

  • Do you start conservative (only flag critical issues) or aggressive (show everything)?
  • Should you run a "shadow mode" for a week to gather feedback before going live?
  • How do you seed your negative patterns database? Use synthetic examples? Import from similar tools?

2. The Multi-Language Challenge

Your router works great for Python. But what about Go, Rust, TypeScript, Java...?

Questions:

  • Do you train separate routers per language? (High cost, high quality)
  • Use a universal router? (Low cost, lower quality)
  • Fine-tune a single model on all languages? (Middle ground)

3. The Privacy-Cost Trade-Off

Building GraphRAG requires indexing the entire codebase. For large repos, this is expensive.

Questions:

  • Do you index on every commit? (Fresh but expensive)
  • Index nightly? (Stale but cheap)
  • Incremental updates? (Complex but optimal)
  • How do you handle private repositories? Can you use external graph builders?

The Complete Architecture: All Three Parts

We've now built the full system. Let's see how all three parts work together:

graph TD
    subgraph Part1["Part 1 Ingestion"]
        GH[GitHub Webhook] --> Buffer[Buffer Pattern]
        Buffer --> Filter[Gatekeeper]
    end
    
    subgraph Part2["Part 2 Orchestration"]
        Filter --> Workflow[Durable Workflow]
        Workflow --> Parallel[Parallel Execution]
        Parallel --> State[(State Persistence)]
    end
    
    subgraph Part3["Part 3 Intelligence"]
        State --> Router[Model Cascade]
        Router --> GraphRAG[Context Builder]
        GraphRAG --> LLM[Reasoning Model]
        LLM --> Aggregator[Signal Filter]
        Aggregator --> Memory[(Feedback Loop)]
    end
    
    Memory --> Output[Post to PR]
    
    style Part1 fill:#e8f5e9,stroke:#388e3c
    style Part2 fill:#fff9c4,stroke:#fbc02d
    style Part3 fill:#e3f2fd,stroke:#0d47a1

The Data Flow:

  1. Ingestion (Part 1): Webhook arrives → Buffered → Filtered → Valid code changes extracted
  2. Orchestration (Part 2): Workflow spawns parallel tasks → Manages state → Handles failures
  3. Intelligence (Part 3): Routes by complexity → Builds context → Reviews → Filters noise → Learns from feedback

The Result: A production-grade AI code review system that is:

  • Reliable: Survives crashes and API failures
  • Efficient: Routes intelligently to minimize cost
  • Accurate: Uses GraphRAG to understand code relationships
  • Trusted: Filters noise and learns from feedback

Takeaways

The Intelligence Pyramid

graph TD
    A[Intelligent AI Agent] --> B[1. Smart Routing]
    A --> C[2. Rich Context]
    A --> D[3. Noise Reduction]
    A --> E[4. Continuous Learning]
    
    B --> F[Minimize Cost]
    C --> G[Maximize Accuracy]
    D --> H[Maximize Trust]
    E --> I[Maximize Improvement]
    
    style A fill:#e3f2fd,stroke:#0d47a1
    style F fill:#e8f5e9,stroke:#388e3c
    style G fill:#e8f5e9,stroke:#388e3c
    style H fill:#e8f5e9,stroke:#388e3c
    style I fill:#e8f5e9,stroke:#388e3c

Key Insights

  • Not all code is equal — A typo fix and a security vulnerability deserve different treatment. Model cascading recognizes this, routing trivial changes to fast paths and complex logic to deep reasoning.

  • Context is the secret weapon — GraphRAG transforms the AI from a text analyzer into a system architect. By understanding dependencies, it catches bugs that flat-file analysis would miss.

  • Less is more for developers — Flooding a PR with comments destroys trust. Intelligent filtering and grouping ensures developers see what matters, building long-term adoption.

  • AI systems must learn — Unlike traditional software, AI makes mistakes. The feedback loop turns these mistakes into lessons, creating a system that improves automatically without retraining.

  • Architecture beats prompts — A great prompt with bad architecture produces a bad product. A good architecture with an average prompt produces a great product. The system design matters more than the individual components.

The Cost-Quality-Trust Triangle

Optimization Implementation Impact
Cost Model Cascade 70% cost reduction, no quality loss
Quality GraphRAG 3x accuracy improvement, fewer hallucinations
Trust Signal Filtering Higher adoption, developers actually read reviews
Evolution Feedback Loop Continuous improvement without retraining

The Journey Complete

We've built a production-grade AI code review system from the ground up:

Part 1: The Foundation

  • Event ingestion and buffering
  • Noise filtering and routing
  • Cost optimization through early filtering

Part 2: The Reliability

  • Durable workflows that survive failures
  • Parallel execution with concurrency control
  • Human-in-the-loop for critical decisions

Part 3: The Intelligence

  • Model cascading for cost efficiency
  • GraphRAG for deep code understanding
  • Signal filtering for developer happiness
  • Feedback loops for continuous improvement

The Result: A system that doesn't just "work"—it scales, adapts, and earns developer trust.

This is the difference between a demo and a product. A demo shows what's possible. A product solves what's practical.


What's Next: Beyond Code Review

The patterns we've covered aren't limited to code review. They apply to any production AI agent:

  • Customer Support Bots: Use model cascading to handle simple questions cheaply, escalate complex ones
  • Content Moderation: Use GraphRAG to understand context across posts and threads
  • Data Analysis Agents: Use signal filtering to surface only actionable insights
  • Compliance Checkers: Use feedback loops to learn company-specific rules

The architecture is the same. The domain changes. The principles endure.


For more on building production AI systems at scale, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.