Building Penny: A Private, Deterministic Financial Agent

Param Harrison
24 min read

Share this post

We've all had the same thought: "I wish I could upload my bank statements, credit card bills, and investment portfolio to ChatGPT and just ask it to fix my finances."

But you don't. Why? Trust.

There are two massive barriers to using AI for personal finance:

  1. Privacy: You do not want your transaction history, account balances, and spending patterns living on someone else's server, accessible to employees, vulnerable to breaches, or used for training.

  2. Accuracy (Hallucination): LLMs are brilliant at language but terrible at math. If you ask a standard LLM to "sum up my Uber rides," it will likely miss rows, hallucinate numbers, or just guess. In finance, a wrong number isn't a typo—it's a disaster.

This post is for engineers who want to build a Local Financial Agent. We aren't building a chatbot; we are building a secure, local system that runs on your laptop, keeps your data private, and—most importantly—never guesses the numbers.

Today, we explore the Hybrid Agent Architecture required to make this work.


The Failure Case: The Naive Approach

Before we dive into solutions, let's see what happens when you naively build a financial AI with standard patterns.

graph TD
    User[User Question] --> CloudLLM[Cloud LLM API]
    
    CloudLLM --> Upload[Upload Bank CSV]
    Upload --> CloudStorage[Cloud Storage]
    
    CloudLLM --> Read[Read CSV as Text]
    Read --> Calculate[Calculate in Tokens]
    
    Calculate --> Hallucinate[LLM Guesses Numbers]
    Hallucinate --> WrongAnswer[Incorrect Financial Data]
    
    CloudStorage --> Privacy[Privacy Violation]
    
    style WrongAnswer fill:#ffebee,stroke:#b71c1c
    style Privacy fill:#ffebee,stroke:#b71c1c

The Three Failures:

  1. Privacy Breach: Your financial data is uploaded to cloud servers. Even with encryption, it's now in someone else's control. A data breach, a subpoena, or a policy change could expose everything.

  2. Mathematical Errors: LLMs process text token-by-token. When asked to sum 500 transactions, they approximate, skip rows, or hallucinate. A $1,240.50 expense becomes "$1,200 or so" or worse, "$2,500."

  3. No Determinism: Run the same query twice, get different answers. Financial analysis requires reproducibility. You can't make decisions based on non-deterministic outputs.

Observation: The naive approach treats financial data like any other text. But financial data is structured, mathematical, and sensitive. It requires a fundamentally different architecture.


The Core Problem: The "Right Brain" vs. "Left Brain"

To build a financial assistant, we need two very different types of intelligence.

  1. The Analyst (Left Brain): Needs to answer: "Where did I spend most of my money?"

    • This requires Fact. It needs to query structured data (CSVs, databases), perform aggregations (SUM, AVG, GROUP BY), and be mathematically perfect.
  2. The Advisor (Right Brain): Needs to answer: "How can I save tax this year?"

    • This requires Knowledge. It needs to read unstructured documents (PDFs of tax codes, investment strategies, financial planning guides) and synthesize advice.

A standard RAG (Retrieval-Augmented Generation) system treats everything as text. It fails at the "Analyst" job because vector search cannot "sum" a column. Vector search finds similar text—it doesn't execute SQL.

The Solution: We need a Router-Based Hybrid Agent that intelligently routes questions to the right engine.


Pattern 1: The Hybrid Architecture (Bicameral Mind)

We don't build one bot. We build a system with a central Router that directs traffic to two distinct engines.

The Architecture

graph TD
    User[User Question] --> UI[Local UI]
    
    UI --> Router{The Router Local LLM}
    
    subgraph Left["The Left Brain Deterministic"]
        Router -->|Math / Trends / Aggregations| SQLEngine[Structured Data Engine]
        SQLEngine --> CodeGen[Code Generator]
        CodeGen --> Sandbox[Python Sandbox]
        Sandbox --> Data[(Local CSVs / SQLite)]
        Data --> Result[Exact Numeric Result]
    end
    
    subgraph Right["The Right Brain Semantic"]
        Router -->|Advice / Strategy / Knowledge| VectorEngine[RAG Engine]
        VectorEngine --> Embed[Embedding Model]
        Embed --> VectorDB[(Local Vector DB)]
        VectorDB --> Context[Relevant Documents]
        Context --> Synthesis[LLM Synthesis]
    end
    
    Result --> Synthesizer[Response Synthesizer]
    Synthesis --> Synthesizer
    Synthesizer --> UI
    
    style SQLEngine fill:#e3f2fd,stroke:#0d47a1
    style VectorEngine fill:#e8f5e9,stroke:#388e3c
    style Router fill:#fff9c4,stroke:#fbc02d

How it Works:

  1. Single Entry Point: User asks a question through the local UI.

  2. Intelligent Routing: A small, fast local LLM analyzes the question and routes it:

    • Math/aggregation questions → Structured Data Engine
    • Advice/knowledge questions → RAG Engine
  3. Parallel Processing: Both engines can run simultaneously if needed, but typically only one is invoked per question.

  4. Response Synthesis: The final response combines results from whichever engine was used, formatted naturally.

Observation: This architecture separates concerns cleanly. The Router is the "brain stem" that decides which specialized system handles the request. Each engine is optimized for its domain.


Pattern 2: The Router (Intent Classification)

The Architectural Problem:

How do you reliably distinguish between "What's my total spending?" (needs math) and "What's a good investment strategy?" (needs knowledge)?

You can't use keyword matching—users phrase things differently. You need semantic understanding.

The Architecture

graph TD
    Question[User Question] --> Analyze[Question Analysis]
    
    Analyze --> Features[Extract Features]
    
    Features --> Keywords[Keywords]
    Features --> Structure[Question Structure]
    Features --> Intent[Intent Patterns]
    
    Keywords --> Classifier{Intent Classifier}
    Structure --> Classifier
    Intent --> Classifier
    
    Classifier -->|Math Indicators| Math[Structured Engine]
    Classifier -->|Knowledge Indicators| Knowledge[RAG Engine]
    Classifier -->|Ambiguous| Clarify[Ask for Clarification]
    
    Math --> Indicators1[SUM, COUNT, AVERAGE, TREND, COMPARE]
    Knowledge --> Indicators2[SHOULD, RECOMMEND, STRATEGY, ADVICE, TAX]
    
    style Classifier fill:#fff9c4,stroke:#fbc02d

How it Works:

  1. Feature Extraction: Analyze the question for:

    • Math Keywords: "total," "sum," "average," "compare," "trend," "spent"
    • Knowledge Keywords: "should," "recommend," "strategy," "best," "tax," "invest"
    • Question Structure: Questions starting with "How much" or "What's my" often need math
  2. Confidence Scoring: Each engine gets a relevance score:

    scores = {
        "structured": calculate_math_score(question),  # 0-100
        "rag": calculate_knowledge_score(question),     # 0-100
    }
    
  3. Threshold-Based Routing: If one score is >70 and the other is <30, route confidently. If both are close, ask for clarification.

Concrete Example: Routing Decisions

Question 1: "How much did I spend on food last month?"

Analysis:

keywords = ["how much", "spend", "food", "last month"]
math_score = 95  # Strong math indicators
knowledge_score = 5  # No knowledge indicators
decision = "structured"

Question 2: "Should I open a Roth IRA?"

Analysis:

keywords = ["should", "open", "Roth IRA"]
math_score = 10  # No math needed
knowledge_score = 90  # Strong advice/knowledge indicators
decision = "rag"

Question 3: "What's my spending trend over the past year?"

Analysis:

keywords = ["spending", "trend", "year"]
math_score = 85  # Trend analysis requires aggregation
knowledge_score = 15  # Minimal knowledge needed
decision = "structured"

Question 4: "Is a Roth IRA better than a 401k for tax savings?"

Analysis:

keywords = ["better", "tax savings", "Roth IRA", "401k"]
math_score = 20  # Comparison might need some numbers
knowledge_score = 80  # Strong knowledge/advice indicators
decision = "rag"  # Knowledge wins (tax strategy is knowledge)

Observation: The router doesn't need to be perfect—it just needs to be right 90%+ of the time. For edge cases, the system can ask "Are you asking about numbers or advice?" and route based on the clarification.

Think About It: Should the router be a separate model, or can you use the same LLM with a system prompt? A dedicated small model (Llama 3 8B) is faster and cheaper for routing, but a single model with good prompting is simpler. Most production systems use a dedicated router for latency and cost reasons.


Pattern 3: The Structured Data Engine (Deterministic Math)

The Architectural Problem:

This is the most critical part of the system. We do not let the LLM do math.

If we let an LLM read a CSV and calculate the sum token-by-token, it will fail. LLMs are probabilistic text generators—they approximate, they hallucinate, they make mistakes.

The Solution: We use the LLM as a Translator, not a Calculator.

The Architecture

sequenceDiagram
    participant User
    participant Router
    participant Translator
    participant Validator
    participant Sandbox
    participant Data
    
    User->>Router: "Total Uber spend in 2024?"
    Router->>Translator: Route to Structured Engine
    
    Note over Translator: LLM translates natural language<br/>to code. Does NOT calculate.
    
    Translator->>Translator: Generate: df[df['desc'].str.contains('Uber') & (df['date'].str.startswith('2024'))]['amt'].sum()
    
    Translator->>Validator: Validate code safety
    Validator->>Validator: Check for dangerous operations
    Validator->>Sandbox: Code is safe
    
    Sandbox->>Data: Execute query
    Data-->>Sandbox: Return rows
    
    Note over Sandbox: CPU calculates sum.<br/>Zero hallucinations.
    
    Sandbox-->>Translator: Result: 450.00
    Translator->>User: "You spent $450.00 on Uber in 2024."

How it Works:

  1. Translation: The LLM translates the user's natural language question into deterministic code (SQL or Pandas).

  2. Validation: Before execution, validate the code:

    • No file system access outside data directory
    • No network calls
    • No system commands
    • Only read operations (no data modification)
  3. Sandboxed Execution: Run the code in an isolated Python environment with restricted permissions.

  4. Result Extraction: The code returns exact numbers. The LLM never touches the math—it just formats the result into natural language.

Concrete Example: The Translation Process

User Question: "What's my average monthly spending on restaurants this year?"

Step 1: LLM Translation

# LLM generates this code:
query = """
import pandas as pd
df = pd.read_csv('transactions.csv')
df['date'] = pd.to_datetime(df['date'])
df_2024 = df[df['date'].dt.year == 2024]
df_restaurants = df_2024[df_2024['description'].str.contains('restaurant|dining|food', case=False, na=False)]
df_restaurants['month'] = df_2024['date'].dt.month
monthly_totals = df_restaurants.groupby('month')['amount'].sum()
average = monthly_totals.mean()
result = {'average_monthly_spending': average}
"""

Step 2: Code Validation

# Security checks
forbidden_patterns = [
    'import os',
    'import subprocess',
    'open(',
    '__import__',
    'eval(',
    'exec(',
]

if any(pattern in query for pattern in forbidden_patterns):
    raise SecurityError("Dangerous operation detected")

Step 3: Sandboxed Execution

# Execute in restricted environment
restricted_globals = {
    'pd': pandas,
    '__builtins__': safe_builtins,  # No file I/O, no network
}

result = exec(query, restricted_globals, {})
# result: {'average_monthly_spending': 234.50}

Step 4: Natural Language Response

# LLM formats the result
response = f"You spent an average of ${result['average_monthly_spending']:.2f} per month on restaurants in 2024."
# "You spent an average of $234.50 per month on restaurants in 2024."

The Accuracy Comparison:

Approach Method Accuracy Example Result
Naive LLM Token-by-token calculation ~70% "$200 or so"
Code Execution Deterministic computation 100% "$234.50"

Observation: Code execution is deterministic. 2 + 2 in Python is always 4, regardless of model temperature or randomness. We use the LLM for its strength (understanding intent) and the CPU for its strength (calculation). This is the "LLM as Translator, Code as Calculator" pattern.

Think About It: What if the LLM generates incorrect code? The code might be syntactically valid but logically wrong. You need validation beyond security—check that the query makes sense (e.g., "Did it actually filter by the right criteria?"). Some systems use a two-step process: generate code, then have the LLM review its own code before execution.


Pattern 4: The RAG Engine (Private Knowledge)

The Architectural Problem:

For questions like "Where should I invest to save tax?", we cannot query a CSV. We need external knowledge from documents like tax codes, investment guides, and financial planning resources.

But we can't use cloud-based RAG—that would violate privacy. We need a local RAG system.

The Architecture

graph TD
    subgraph Ingestion["Document Ingestion"]
        PDF[Tax Code PDFs] --> Parser[PDF Parser]
        DOC[Investment Guides] --> Parser
        Parser --> Chunks[Text Chunks]
        Chunks --> Embed[Local Embedding Model]
        Embed --> VectorDB[(Local Vector DB)]
    end
    
    subgraph Query["Query Processing"]
        Question[User Question] --> EmbedQuery[Embed Question]
        EmbedQuery --> Search[Vector Search]
        Search --> VectorDB
        VectorDB --> Context[Top K Relevant Chunks]
        Context --> LLM[Local LLM]
        LLM --> Answer[Natural Language Answer]
    end
    
    style VectorDB fill:#e8f5e9,stroke:#388e3c
    style LLM fill:#e3f2fd,stroke:#0d47a1

How it Works:

  1. Local Ingestion: You drop PDFs into a local folder. The system:

    • Parses PDFs into text chunks
    • Generates embeddings using a local model (no API calls)
    • Stores embeddings in a local vector database (ChromaDB, Qdrant, or SQLite with vector extension)
  2. Local Retrieval: When you ask a question:

    • Embed the question using the same local model
    • Search the vector database for similar chunks
    • Retrieve the top K most relevant paragraphs
  3. Local Synthesis: The local LLM reads the retrieved context and generates an answer using only that context.

The Privacy Guarantee:

graph LR
    A[Your Documents] --> B[Local Processing]
    B --> C[Local Vector DB]
    C --> D[Local LLM]
    D --> E[Answer]
    
    F[Cloud Services] -.Never Touches Data.-> A
    F -.Never Touches Data.-> B
    F -.Never Touches Data.-> C
    F -.Never Touches Data.-> D
    
    style F fill:#ffebee,stroke:#b71c1c
    style B fill:#e8f5e9,stroke:#388e3c
    style C fill:#e8f5e9,stroke:#388e3c
    style D fill:#e8f5e9,stroke:#388e3c

Observation: Because this runs locally via tools like Ollama, you are essentially "chatting with your private documents" without ever uploading them to the cloud. The entire pipeline—embedding, storage, retrieval, and synthesis—happens on your machine.

Concrete Example: Tax Strategy Query

User Question: "Can I contribute to both a 401k and a Roth IRA in the same year?"

Step 1: Vector Search

# Embed the question
question_embedding = local_embedding_model.embed("Can I contribute to both a 401k and a Roth IRA in the same year?")

# Search vector database
results = vector_db.similarity_search(question_embedding, k=3)

# Retrieved chunks:
# Chunk 1: "401k and IRA contributions are independent. You can contribute up to $22,500 to a 401k and $6,500 to a Roth IRA in 2024..."
# Chunk 2: "Roth IRA contributions have income limits. If your MAGI exceeds $153,000 (single) or $228,000 (married), you cannot contribute..."
# Chunk 3: "Traditional 401k contributions reduce your taxable income, while Roth IRA contributions are made with after-tax dollars..."

Step 2: Context Assembly

context = "\n\n".join([chunk.text for chunk in results])

prompt = f"""
You are a financial advisor. Answer the user's question using ONLY the provided context.

Context:
{context}

Question: Can I contribute to both a 401k and a Roth IRA in the same year?

Answer:
"""

Step 3: Local LLM Response

"Yes, you can contribute to both a 401k and a Roth IRA in the same year. 
These are independent contribution limits. In 2024, you can contribute up to 
$22,500 to a 401k and $6,500 to a Roth IRA. However, note that Roth IRA 
contributions have income limits—if your MAGI exceeds $153,000 (single) or 
$228,000 (married), you cannot contribute to a Roth IRA."

The Local Stack Components:

Component Technology Why
LLM Ollama (Llama 3, Mistral) Runs entirely locally, no API keys
Embeddings Local model (sentence-transformers) No cloud calls for embeddings
Vector DB ChromaDB, Qdrant, or SQLite Stores vectors on local disk
Data Local CSV/PDF files Never leaves your machine
UI Streamlit, Chainlit, or Gradio Runs on localhost

Think About It: How do you keep the knowledge base up to date? Tax codes change yearly. Investment strategies evolve. You need a mechanism to re-ingest documents when they're updated. Some systems watch the document folder for changes and automatically re-index.


Pattern 5: The Local-First Stack

The Architectural Problem:

Every component must run locally. No cloud APIs, no data uploads, no external services. This is non-negotiable for financial privacy.

The Architecture

graph TD
    subgraph Local["Your Machine"]
        UI[Local UI localhost:8000]
        
        subgraph Brain["AI Brain"]
            Router[Router LLM Ollama]
            Structured[Structured Engine]
            RAG[RAG Engine]
        end
        
        subgraph Storage["Local Storage"]
            CSVs[(Transaction CSVs)]
            PDFs[(Financial Documents)]
            VectorDB[(Vector Database)]
            SQLite[(SQLite DB)]
        end
        
        subgraph Execution["Execution Environment"]
            Sandbox[Python Sandbox]
            Embed[Embedding Model]
        end
    end
    
    UI --> Brain
    Brain --> Storage
    Brain --> Execution
    
    Cloud[Cloud Services] -.No Connection.-> Local
    
    style Local fill:#e8f5e9,stroke:#388e3c
    style Cloud fill:#ffebee,stroke:#b71c1c

The Complete Stack:

  1. LLM Runtime: Ollama (or similar) running Llama 3, Mistral, or other open models

    • No API keys required
    • No internet connection needed after initial model download
    • Runs entirely on CPU/GPU locally
  2. Vector Database: ChromaDB or Qdrant running locally

    • Stores embeddings on disk
    • No cloud sync, no external services
  3. Data Storage: Local file system

    • CSVs in a data/ folder
    • PDFs in a documents/ folder
    • SQLite for structured queries
  4. Execution Sandbox: Isolated Python environment

    • Restricted file system access
    • No network permissions
    • Only reads from designated data directories
  5. UI: Streamlit, Chainlit, or Gradio

    • Runs on localhost:8000
    • Never sends data to external servers

The Privacy Guarantee:

graph LR
    A[Your Financial Data] --> B[Local Processing Only]
    B --> C[No Cloud Uploads]
    B --> D[No API Calls with Data]
    B --> E[No External Storage]
    
    C --> F[Privacy Guaranteed]
    D --> F
    E --> F
    
    style F fill:#e8f5e9,stroke:#388e3c

Observation: This "Air-Gapped by Design" architecture is the only responsible way to build financial AI tools today. Even with encryption and privacy policies, cloud-based financial AI is a non-starter for privacy-conscious users. Local-first isn't a feature—it's a requirement.

Think About It: What about model updates? If you're running local models, how do you get improvements? Some systems support "model versioning" where you can download newer models, but the data never leaves. Others use a hybrid approach: download model updates, but all inference stays local.


Pattern 6: Code Generation Safety

The Architectural Problem:

You're letting an LLM generate Python code that executes on your machine. This is a security risk. Malicious code could:

  • Delete files
  • Access network resources
  • Exfiltrate data
  • Install malware

You need sandboxing and validation.

The Architecture

graph TD
    LLM[LLM Generates Code] --> Parse[Parse AST]
    
    Parse --> Security[Security Validator]
    Security --> Check1[No File I/O Outside Data Dir]
    Security --> Check2[No Network Calls]
    Security --> Check3[No System Commands]
    Security --> Check4[No Dangerous Imports]
    
    Check1 -->|Pass| Logic[Logic Validator]
    Check2 -->|Pass| Logic
    Check3 -->|Pass| Logic
    Check4 -->|Pass| Logic
    
    Logic --> Check5[Query Makes Sense]
    Logic --> Check6[No Infinite Loops]
    Logic --> Check7[Reasonable Timeout]
    
    Check5 -->|Pass| Sandbox[Execute in Sandbox]
    Check6 -->|Pass| Sandbox
    Check7 -->|Pass| Sandbox
    
    Security -->|Fail| Reject[Reject Code]
    Logic -->|Fail| Reject
    
    Sandbox --> Result[Return Result]
    Reject --> Error[Return Error to User]
    
    style Security fill:#ffebee,stroke:#b71c1c
    style Sandbox fill:#e8f5e9,stroke:#388e3c

How it Works:

  1. AST Parsing: Parse the generated code into an Abstract Syntax Tree. This lets you analyze the code structure without executing it.

  2. Security Validation:

    forbidden_imports = ['os', 'subprocess', 'socket', 'urllib', 'requests']
    forbidden_functions = ['open', 'eval', 'exec', '__import__']
    
    # Check AST for dangerous patterns
    if has_dangerous_import(ast, forbidden_imports):
        raise SecurityError("Dangerous import detected")
    
  3. Logic Validation:

    • Check that the query makes sense (e.g., "Did it actually filter by the requested criteria?")
    • Check for infinite loops (simple heuristic: max iteration count)
    • Set reasonable timeouts (5 seconds max for query execution)
  4. Sandboxed Execution:

    # Restricted execution environment
    safe_globals = {
        'pd': pandas,
        'np': numpy,
        '__builtins__': {
            'len': len,
            'sum': sum,
            'max': max,
            'min': min,
            # Only safe built-ins
        }
    }
    
    # No file I/O, no network, no system calls
    result = exec(code, safe_globals, {})
    

Concrete Example: Security Validation

Generated Code (Malicious):

import os
os.system('rm -rf /')  # Dangerous!

Security Check:

# AST analysis detects:
# - Import of 'os' module (forbidden)
# - Call to os.system() (forbidden)
# Result: REJECTED

Generated Code (Safe):

import pandas as pd
df = pd.read_csv('data/transactions.csv')
result = df['amount'].sum()

Security Check:

# AST analysis detects:
# - Only pandas import (allowed)
# - File read from 'data/' directory (allowed)
# - Mathematical operation (allowed)
# Result: APPROVED

The Safety Layers:

Layer Protection Example
AST Validation Prevents dangerous code from executing Blocks import os
Sandboxed Environment Restricts what code can do No file I/O outside data dir
Timeout Prevents infinite loops Max 5 seconds execution
Result Validation Checks output makes sense Ensures numeric results are reasonable

Observation: Code generation safety is a multi-layered defense. No single check is perfect, but together they create a robust security model. The key is defense in depth: validate before execution, restrict during execution, and validate after execution.

Think About It: What if the LLM generates code that's syntactically valid but logically wrong? For example, it filters by the wrong date range. You need semantic validation—check that the generated query actually answers the user's question. Some systems use a two-step process: generate code, execute it, then have the LLM review the result for reasonableness.


Putting It All Together: A Real Query Flow

Let's trace a complete query through the entire system.

Scenario: User asks "What's my biggest expense category this year?"

sequenceDiagram
    participant U as User
    participant UI as Local UI
    participant R as Router
    participant S as Structured Engine
    participant V as Validator
    participant SB as Sandbox
    participant D as Data
    participant Syn as Synthesizer
    
    U->>UI: "What's my biggest expense category this year?"
    UI->>R: Route question
    
    Note over R: Analyze keywords: "biggest", "expense", "category"
    Note over R: Math score: 95, Knowledge score: 5
    
    R->>S: Route to Structured Engine
    
    S->>S: Generate code: df.groupby('category')['amount'].sum().idxmax()
    
    S->>V: Validate code safety
    V->>V: Check AST, no dangerous operations
    V->>SB: Code approved
    
    SB->>D: Execute query
    D-->>SB: Result: {'category': 'Restaurants', 'total': 3240.50}
    
    SB-->>S: Return result
    S->>Syn: Format result
    
    Syn->>UI: "Your biggest expense category this year is Restaurants, totaling $3,240.50."
    UI->>U: Display answer

Timeline:

Time Component Action Result
T+0ms User Asks question "What's my biggest expense category?"
T+50ms Router Analyzes intent Routes to Structured Engine
T+200ms Structured Engine Generates code df.groupby('category')['amount'].sum().idxmax()
T+250ms Validator Validates code Approved (safe)
T+300ms Sandbox Executes code Reads CSV, calculates
T+500ms Sandbox Returns result {'category': 'Restaurants', 'total': 3240.50}
T+600ms Synthesizer Formats response Natural language output
T+700ms User Sees answer "Your biggest expense category is Restaurants..."

Key Observations:

  1. Privacy: No data left the machine. All processing was local.

  2. Accuracy: The result is exact ($3,240.50), not an approximation.

  3. Speed: Total latency ~700ms, acceptable for local processing.

  4. Safety: Code was validated before execution, preventing security risks.


Challenge: Design Decisions for Your System

Challenge 1: The Model Size Trade-Off

You want to run everything locally, but large models (70B parameters) give better results than small models (8B parameters).

Options:

  1. Small Model (8B): Fast, runs on CPU, but less accurate routing
  2. Medium Model (13B): Better accuracy, needs GPU, slower
  3. Hybrid: Small model for routing, large model for synthesis

Your Task: How do you balance accuracy, latency, and hardware requirements? Do you require users to have a GPU, or optimize for CPU-only?

Challenge 2: The Data Format Problem

Users have bank statements in different formats:

  • CSV from one bank
  • PDF from another
  • Excel from a third
  • JSON API exports

Questions:

  • Do you build parsers for each format?
  • Or require users to convert to a standard format?
  • How do you handle missing fields or inconsistent schemas?

Challenge 3: The Update Problem

Tax codes change yearly. Investment strategies evolve. Your knowledge base needs updates.

Options:

  1. Manual Updates: User downloads new PDFs, system re-indexes
  2. Automatic Checks: System checks for updates (but requires internet)
  3. Hybrid: Check for updates, but user approves before downloading

Your Task: How do you balance convenience (automatic updates) with privacy (no automatic downloads)?


System Comparison: Cloud vs. Local

Dimension Cloud-Based Local-First
Privacy Data on external servers Data never leaves machine
Accuracy LLM math (approximate) Code execution (exact)
Latency Network dependent Local (faster)
Cost API fees per query One-time hardware cost
Offline Requires internet Works offline
Setup Simple (API keys) Complex (model downloads)
Updates Automatic Manual
Security Trust cloud provider Full control
graph TD
    subgraph Cloud["Cloud-Based System"]
        C1[Fast Setup] --> C2[API Costs]
        C2 --> C3[Privacy Concerns]
        C3 --> C4[Network Dependency]
        style C3 fill:#ffebee,stroke:#b71c1c
    end
    
    subgraph Local["Local-First System"]
        L1[Complex Setup] --> L2[No Ongoing Costs]
        L2 --> L3[Privacy Guaranteed]
        L3 --> L4[Works Offline]
        style L3 fill:#e8f5e9,stroke:#388e3c
    end

Key Architectural Patterns Summary

Pattern Problem Solved Key Benefit Complexity
Hybrid Architecture Math vs. knowledge questions Right tool for right job Medium
Router Intent classification Efficient routing Low
Code Execution LLM math errors Deterministic accuracy High
Local RAG Privacy in knowledge retrieval No cloud uploads Medium
Sandboxing Code generation security Safe execution High
Local-First Stack Privacy guarantees Complete data control Medium

Discussion Points for Engineers

1. The Model Accuracy vs. Privacy Trade-Off

Local models (Llama 3, Mistral) are good but not as capable as cloud models (GPT-4, Claude).

Questions:

  • Do you accept lower accuracy for privacy?
  • Or use a hybrid: local for sensitive data, cloud for general knowledge?
  • How do you explain accuracy differences to users?

2. The Data Schema Problem

Every bank exports data differently. One CSV has "amount," another has "transaction_amount," a third has "amt."

Questions:

  • Do you build a universal schema mapper?
  • Or require users to normalize their data first?
  • How do you handle schema drift when banks change formats?

3. The Verification Challenge

How do users verify the system is working correctly? If it says "You spent $1,240," how do they know that's right?

Questions:

  • Do you show the generated code so users can audit it?
  • Or provide a "verify" mode that shows intermediate steps?
  • How do you build trust in a black-box system?

Takeaways

The Three Pillars of Trust

graph TD
    A[Trustworthy Financial AI] --> B[1. Privacy]
    A --> C[2. Accuracy]
    A --> D[3. Transparency]
    
    B --> E[Local-First Architecture]
    C --> F[Code Execution Pattern]
    D --> G[Auditable Queries]
    
    style A fill:#e3f2fd,stroke:#0d47a1
    style E fill:#e8f5e9,stroke:#388e3c
    style F fill:#e8f5e9,stroke:#388e3c
    style G fill:#e8f5e9,stroke:#388e3c

Key Insights

  • Privacy is non-negotiable — Financial data is too sensitive for cloud processing. Local-first isn't optional—it's the only responsible architecture.

  • LLMs are translators, not calculators — Use LLMs to understand intent and generate code. Use CPUs to execute that code. This pattern guarantees mathematical accuracy.

  • Hybrid architecture is mandatory — Financial questions split into "math" and "knowledge." You need different engines for each. A router intelligently directs traffic.

  • Code generation requires defense in depth — AST validation, sandboxing, timeouts, and result validation create a robust security model. Never trust generated code blindly.

  • Local-first enables offline operation — Once models are downloaded, the system works completely offline. This is critical for users in areas with unreliable internet or strict data sovereignty requirements.

The Implementation Roadmap

Phase Focus Why
Phase 1 Local LLM setup Prove the local stack works
Phase 2 Structured data engine Validate code execution pattern
Phase 3 Router implementation Enable hybrid routing
Phase 4 Local RAG engine Add knowledge capabilities
Phase 5 Security hardening Production-ready sandboxing

What's Next: Beyond Personal Finance

The patterns in this post extend beyond financial agents:

  • Medical Records: Private, local analysis of health data with deterministic calculations
  • Legal Documents: Local RAG for contract analysis without uploading sensitive documents
  • Research Data: Academic researchers analyzing proprietary datasets locally
  • Business Intelligence: Companies analyzing internal financials without cloud exposure

The architecture is the same. The domain changes. The privacy and accuracy requirements endure.

The Result: You've built a system that doesn't just answer questions—it does so with mathematical precision and complete privacy. It's not a chatbot. It's a trusted financial advisor that lives on your machine, understands your data, and never guesses the numbers.

This is what responsible financial AI looks like.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.