Designing Memory Systems for AI Agentic Applications

Param Harrison
18 min read

Share this post

In our previous posts, we built agents that can "think" (reasoning) and "act" (tools). But they still have a fatal flaw: Amnesia.

If you tell your agent, "My name is Param, and I prefer Python over JavaScript," and then chat for 20 minutes about code architecture, it will eventually ask, "What's your name again?" If you teach it a new coding pattern today, it will forget it by tomorrow.

Memory is essential. Without it, agents act blindly, unable to learn from mistakes, adapt to user preferences, or maintain context across sessions.

This post is for engineers who want to move beyond simple "context windows" and build agents with Persistent, Adaptive Memory. We will explore four architectural levels of memory, from basic conversation buffers to sophisticated episodic learning systems.


The Failure Case: The "Goldfish" LLM

Before we dive into solutions, let's see what happens when you naively build an agent without proper memory architecture.

graph TD
    subgraph Naive["Naive Approach - No Memory"]
        T1[Turn 1: User says name] --> T2[Turn 2: Agent remembers]
        T2 --> T3[Turn 10: Still remembers]
        T3 --> T4[Turn 50: Context full]
        T4 --> T5[Turn 51: Forgets everything]
        T5 --> T6[Agent asks name again]
        
        style T5 fill:#ffebee,stroke:#b71c1c
        style T6 fill:#ffebee,stroke:#b71c1c
    end
    
    subgraph Cost["Cost Explosion"]
        T1 --> C1[Prompt: 100 tokens]
        T2 --> C2[Prompt: 200 tokens]
        T3 --> C3[Prompt: 1000 tokens]
        T4 --> C4[Prompt: 8000 tokens]
        T4 --> C5[Cost: 10x higher]
        
        style C5 fill:#ffebee,stroke:#b71c1c
    end

The Three Problems:

  1. Context Window Overflow: As conversations grow, you hit the token limit. Older messages get dropped, and the agent forgets critical information.

  2. Cost Explosion: Every message includes the entire history. A 50-turn conversation costs 50x more than a single turn.

  3. No Learning: The agent makes the same mistakes repeatedly. It doesn't remember that calling a broken API failed yesterday, so it tries again today.

Observation: LLMs are stateless functions: output = f(input). They don't "remember" previous calls. To fake memory, we stuff the previous conversation into the prompt every time. This works for short conversations but breaks at scale.


Pattern 1: Short-Term Memory (The Conversation Buffer)

The Architectural Problem:

You need to keep track of the current conversation—what was just said, what the user is responding to, the immediate context. But you can't keep everything forever, or you'll hit context limits and cost ceilings.

The Solution:

We implement a Sliding Window Buffer with intelligent summarization.

The Architecture

graph TD
    subgraph Buffer["Conversation Buffer Management"]
        New[New Message] --> Window[Sliding Window]
        Window --> Recent[Keep Last N Messages]
        Recent --> Check{Window Full?}
        
        Check -->|No| Add[Add to Buffer]
        Check -->|Yes| Summarize[Summarize Old Messages]
        
        Summarize --> Summary[System Summary]
        Summary --> Add
        Add --> Final[Final Buffer]
    end
    
    subgraph Context["Context Assembly"]
        Final --> System[System Prompt]
        Summary --> System
        System --> LLM[LLM Call]
    end
    
    style Window fill:#fff9c4,stroke:#fbc02d
    style Summary fill:#e3f2fd,stroke:#0d47a1

How it Works:

  1. Sliding Window: Keep the last K messages (e.g., last 20 turns) in full detail.

  2. Summarization: When messages fall out of the window, compress them into a "System Summary" that retains key facts without verbatim text.

  3. Context Assembly: Build the prompt with: System Summary (compressed history) + Recent Messages (full detail) + Current Query.

Concrete Example: The Buffer in Action

Scenario: User is debugging a Python script with the agent over 30 turns.

Turn 1-10:

User: "I'm getting an error in my script"
Agent: "What's the error message?"
User: "AttributeError: 'NoneType' object has no attribute 'split'"
Agent: "This means you're calling .split() on None. Where in your code?"
...
[10 turns of debugging]

Turn 11-20:

[More debugging turns]
Buffer: [Turns 1-20 in full detail]

Turn 21:

Buffer is full (20 turns max)
Action: Summarize turns 1-10
Summary: "User debugging Python AttributeError. Issue is in line 45 where 
         variable 'text' can be None. We've tried adding None checks."
Buffer: [Summary of 1-10] + [Turns 11-21 in full detail]

Turn 30:

Buffer is full again
Action: Summarize turns 11-20
Summary: "Continued debugging. Added try-except block. Error persists. 
         Suspecting issue with data loading function."
Buffer: [Summary of 1-10] + [Summary of 11-20] + [Turns 21-30 in full detail]

The Impact:

Approach Context Size Cost per Turn Information Loss
Naive (keep all) Grows linearly Grows linearly None (until overflow)
Sliding Window Fixed size Fixed cost Minimal (summarized)

Observation: The sliding window maintains immediate context (the last 20 turns) while preserving the gist of older conversation through summarization. This allows the agent to make coherent, context-aware decisions without carrying the baggage of the entire history.

Think About It: How do you decide what to summarize? Some systems summarize after a fixed number of turns (every 10 turns). Others use token-based thresholds (summarize when buffer exceeds 4000 tokens). Which is better? Token-based is more accurate but requires constant monitoring.


Pattern 2: Semantic Memory (The Knowledge Library)

The Architectural Problem:

What if the user asks, "What are the safety rules for this machine?" or "What's our company's coding style guide?" This is semantic knowledge—facts, concepts, and rules that persist across sessions.

The sliding window deleted this information. We need External Storage that can be retrieved on demand.

The Solution:

We build a Vector Database for semantic retrieval.

The Architecture

graph TD
    subgraph Ingestion["Knowledge Ingestion"]
        Docs[Documents] --> Parse[Parse & Chunk]
        Parse --> Embed[Generate Embeddings]
        Embed --> VectorDB[(Vector Database)]
    end
    
    subgraph Retrieval["Query-Time Retrieval"]
        Query[User Question] --> EmbedQuery[Embed Question]
        EmbedQuery --> Search[Similarity Search]
        Search --> VectorDB
        VectorDB --> Relevant[Top K Relevant Chunks]
        Relevant --> Inject[Inject into Context]
        Inject --> LLM[LLM with Knowledge]
    end
    
    style VectorDB fill:#e8f5e9,stroke:#388e3c
    style Inject fill:#e3f2fd,stroke:#0d47a1

How it Works:

  1. Knowledge Ingestion: Load documents (style guides, safety rules, domain knowledge) into a vector database. Each document is chunked and embedded.

  2. Retrieval Mechanism: When a query comes in, embed it and perform similarity search. Retrieve only the top K most relevant chunks.

  3. Context Injection: Inject the retrieved knowledge into the system prompt, giving the agent access to relevant information without storing it in the conversation buffer.

Concrete Example: The Coding Style Guide

Scenario: User asks "What's our naming convention for functions?"

Step 1: Query Processing

query = "What's our naming convention for functions?"
query_embedding = embedding_model.embed(query)

Step 2: Semantic Search

results = vector_db.similarity_search(query_embedding, k=3)

# Retrieved chunks:
# Chunk 1: "Function names should be snake_case. Use verbs: get_user(), 
#          calculate_total(), validate_input()."
# Chunk 2: "Private functions start with underscore: _internal_helper()."
# Chunk 3: "Async functions end with _async: fetch_data_async()."

Step 3: Context Assembly

system_prompt = f"""
You are a coding assistant.

RELEVANT KNOWLEDGE:
{retrieved_chunks}

Answer the user's question using this knowledge.
"""

Step 4: Agent Response

"According to our style guide, function names should be snake_case using verbs. 
For example: get_user(), calculate_total(). Private functions start with an 
underscore, and async functions end with _async."

The Knowledge Types:

Knowledge Type Example Storage Retrieval
Style Guides "Use snake_case for functions" Vector DB Semantic search
Domain Rules "Never call this API without auth" Vector DB Semantic search
User Preferences "User prefers Python over JS" Separate DB Direct lookup
Project Context "This project uses FastAPI" Vector DB Semantic search

Observation: Semantic memory transforms the agent from a generic assistant into a domain expert. By retrieving relevant knowledge on-demand, the agent can answer questions about facts, rules, and concepts that were never in the conversation history.

Think About It: How do you handle conflicting knowledge? If the style guide says "use snake_case" but the user just wrote a function in camelCase, does the agent correct them or adapt? Most systems prioritize user's recent actions over stored knowledge, but flag the inconsistency.


Pattern 3: Episodic Memory (Learning from Experience)

The Architectural Problem:

The agent makes a mistake: it calls a broken API, the user corrects it, and the agent learns. But tomorrow, it makes the same mistake again. Why? Because it doesn't remember the episode.

Episodic Memory stores specific events—what happened, what action was taken, and what the result was. This is how agents learn from experience.

The Solution:

We implement a Reflection Loop that stores and retrieves past episodes.

The Architecture

graph TD
    subgraph Episode["Episode Capture"]
        Action[Agent Takes Action] --> Result[Result Observed]
        Result --> Feedback[User Feedback]
        Feedback --> Store[Store Episode]
        
        Store --> EpisodeDB[(Episode Database)]
    end
    
    subgraph Retrieval["Episode Retrieval"]
        NewTask[New Task] --> Search[Search Similar Episodes]
        Search --> EpisodeDB
        EpisodeDB --> Similar[Similar Past Episodes]
        
        Similar --> Analyze{Analyze Outcomes}
        Analyze -->|Success| Reuse[Reuse Successful Strategy]
        Analyze -->|Failure| Avoid[Avoid Failed Strategy]
        
        Reuse --> Execute[Execute Action]
        Avoid --> Execute
    end
    
    Execute --> Store
    
    style EpisodeDB fill:#e8f5e9,stroke:#388e3c
    style Analyze fill:#fff9c4,stroke:#fbc02d

How it Works:

  1. Episode Capture: After each action, store:

    • State: What was the situation?
    • Action: What did the agent do?
    • Outcome: What happened?
    • Feedback: User reaction (positive/negative)
  2. Episode Retrieval: Before taking a new action, search for similar past episodes:

    • "Have I done this before?"
    • "Did it work?"
    • "What went wrong?"
  3. Strategy Adjustment: Use past episodes to inform current decisions:

    • If similar action succeeded → reuse the strategy
    • If similar action failed → try a different approach

Concrete Example: Learning from API Failures

Scenario: Agent tries to call a weather API that's been down for maintenance.

Episode 1: The Failure

# Episode stored:
{
    "state": "User asked for weather in NYC",
    "action": "Called get_weather_api('NYC')",
    "outcome": "API returned 503 Service Unavailable",
    "feedback": "User said 'That didn't work'",
    "timestamp": "2025-11-20 10:00:00"
}

Episode 2: The Learning

# Next day, user asks for weather again
# Agent searches episodic memory:
similar_episodes = search_episodes("weather API call")

# Finds Episode 1
# Decision: "Last time I called this API, it failed. Let me try a different approach."
# Action: Use backup weather service instead

Episode 3: The Success

# Episode stored:
{
    "state": "User asked for weather in NYC",
    "action": "Called backup_weather_service('NYC')",
    "outcome": "Successfully retrieved weather data",
    "feedback": "User said 'Thanks!'",
    "timestamp": "2025-11-21 14:00:00"
}

The Reflection Loop:

sequenceDiagram
    participant User
    participant Agent
    participant Memory
    
    User->>Agent: "Get weather for NYC"
    Agent->>Memory: Search similar episodes
    Memory-->>Agent: Found: API failed yesterday
    
    Note over Agent: Adjust strategy - use backup
    
    Agent->>Agent: Call backup service
    Agent->>User: Weather data
    
    User->>Agent: "Thanks!"
    
    Agent->>Memory: Store successful episode
    Note over Memory: Learn: backup service works

Observation: Episodic memory transforms the agent from a reactive system into a learning system. It avoids repeating mistakes and doubles down on successful strategies. This is the foundation of continuous improvement in AI agents.

Think About It: How do you determine if two episodes are "similar"? Do you use semantic similarity on the state description? Or exact matching on the action? Most systems use a hybrid: exact match on action type, semantic similarity on state. This catches "same action, similar situation" without being too strict.


Pattern 4: Planning Memory (Multi-Tasking & Long-Term Goals)

The Architectural Problem:

An agent is building an app and answering emails at the same time. It needs to keep those memories separate. It also needs to remember future steps—what it planned to do, what it's already done, and what's next.

The Solution:

We implement Threading for task isolation and Scratchpads for planning.

The Architecture

graph TD
    subgraph Threading["Task Threading"]
        Task1[Task: Build App] --> Thread1[Thread ID: app_001]
        Task2[Task: Answer Emails] --> Thread2[Thread ID: email_002]
        
        Thread1 --> Memory1[(Thread 1 Memory)]
        Thread2 --> Memory2[(Thread 2 Memory)]
        
        Memory1 -.Isolated.-> Memory2
    end
    
    subgraph Planning["Planning Scratchpad"]
        Goal[Long-Term Goal] --> Plan[Break into Steps]
        Plan --> Progress[Track Progress]
        Progress --> Obstacles[Log Obstacles]
        Obstacles --> Adjust[Adjust Plan]
        
        Plan --> Scratchpad[(Planning Memory)]
    end
    
    style Threading fill:#e3f2fd,stroke:#0d47a1
    style Planning fill:#fff9c4,stroke:#fbc02d

How it Works:

  1. Threading: Assign a unique thread_id to every workflow. Store separate memory states for each thread. This prevents context bleeding between tasks.

  2. Planning Scratchpad: Maintain a dedicated memory block for plans:

    • Subgoals: "1. Research, 2. Draft, 3. Review"
    • Progress: "Step 1 complete, working on Step 2"
    • Obstacles: "API timed out, need to retry"
  3. Context Switching: When switching tasks, save current thread state and load the target thread state.

Concrete Example: Multi-Tasking Agent

Scenario: Agent is working on two projects simultaneously.

Thread 1: Building a Dashboard

thread_1_state = {
    "thread_id": "dashboard_001",
    "goal": "Build analytics dashboard",
    "plan": [
        "1. Design schema (complete)",
        "2. Build API endpoints (in progress)",
        "3. Create frontend (pending)"
    ],
    "context": "Working on /api/metrics endpoint. Need to aggregate user data.",
    "obstacles": ["Database query is slow, need to optimize"]
}

Thread 2: Answering Support Emails

thread_2_state = {
    "thread_id": "support_002",
    "goal": "Respond to customer inquiries",
    "plan": [
        "1. Categorize emails (complete)",
        "2. Draft responses (in progress)",
        "3. Send replies (pending)"
    ],
    "context": "Drafting response to Bob about login issue. Need to check if it's a known bug.",
    "obstacles": []
}

Context Switch:

sequenceDiagram
    participant User
    participant Agent
    participant Memory
    
    User->>Agent: "Switch to email task"
    Agent->>Memory: Save Thread 1 state
    Agent->>Memory: Load Thread 2 state
    Memory-->>Agent: "You were drafting response to Bob"
    Agent->>User: "Back to Bob's email. Ready to continue?"

The Planning Scratchpad Structure:

planning_memory = {
    "goal": "Build user authentication system",
    "subgoals": [
        {"id": 1, "task": "Design database schema", "status": "complete"},
        {"id": 2, "task": "Implement login endpoint", "status": "in_progress"},
        {"id": 3, "task": "Add password reset", "status": "pending"},
        {"id": 4, "task": "Write tests", "status": "pending"}
    ],
    "current_focus": "Implementing JWT token generation",
    "blockers": ["Need to decide on token expiration time"],
    "notes": "User requested 7-day expiration, but security best practice is 1 hour"
}

Observation: Planning memory transforms the agent from a reactive chatbot into a proactive project manager. It can handle long-term strategy, track progress across multiple tasks, and maintain context when switching between work streams.

Think About It: How do you handle plan changes? If the user says "Actually, skip step 3 and go straight to step 4," how does the agent update the planning memory? Most systems support plan mutations: mark steps as skipped, add new steps, reorder existing steps.


Putting It All Together: The Complete Memory Stack

Let's see how all four memory types work together in a real scenario.

Scenario: User is building a web app with the agent over multiple sessions.

graph TD
    subgraph Session1["Session 1: Planning"]
        S1[User: Build a todo app] --> P1[Planning Memory: Create plan]
        P1 --> S2[Agent: Break into steps]
        S2 --> SM1[Semantic Memory: Load best practices]
        SM1 --> P2[Planning Memory: Store plan]
    end
    
    subgraph Session2["Session 2: Development"]
        S3[User: Start coding] --> ST1[Short-Term: Current conversation]
        ST1 --> P3[Planning Memory: Load plan]
        P3 --> E1[Episodic Memory: Check past mistakes]
        E1 --> S4[Agent: Avoid known pitfalls]
        S4 --> ST2[Short-Term: Track progress]
    end
    
    subgraph Session3["Session 3: Debugging"]
        S5[User: App is broken] --> ST3[Short-Term: Debug session]
        ST3 --> E2[Episodic Memory: Search similar bugs]
        E2 --> SM2[Semantic Memory: Load debugging guides]
        SM2 --> S6[Agent: Apply learned solutions]
        S6 --> E3[Episodic Memory: Store new solution]
    end
    
    style P2 fill:#fff9c4,stroke:#fbc02d
    style E3 fill:#e8f5e9,stroke:#388e3c
    style SM2 fill:#e3f2fd,stroke:#0d47a1

The Memory Interaction:

Memory Type What It Provides Example
Short-Term Immediate context "We're debugging the login function"
Semantic Domain knowledge "JWT tokens should expire in 1 hour"
Episodic Past experiences "Last time we had this error, it was a CORS issue"
Planning Future goals "Step 2 of 5: Building API endpoints"

Observation: The four memory types complement each other. Short-term provides immediate context, semantic provides domain knowledge, episodic provides learned strategies, and planning provides long-term direction. Together, they create a comprehensive memory system.


Challenge: Design Decisions for Your Memory System

Challenge 1: The Memory Conflict Problem

Your semantic memory says "Use snake_case for functions," but the user just wrote a function in camelCase. Your episodic memory says "User prefers camelCase based on past code reviews."

Questions:

  • Which memory takes precedence?
  • Do you correct the user or adapt to their preference?
  • How do you resolve conflicts between memory types?

Challenge 2: The Episodic Memory Explosion

After 1000 interactions, you have 1000 episodes. Searching through all of them is slow.

Options:

  1. Time-Based Pruning: Delete episodes older than 30 days
  2. Relevance-Based Pruning: Keep only episodes with high success/failure signals
  3. Clustering: Group similar episodes, keep only representative examples

Your Task: Which pruning strategy balances performance with learning? How do you ensure you don't delete critical lessons?

Challenge 3: The Planning Drift Problem

Your agent created a plan 2 weeks ago. The project requirements have changed, but the plan hasn't been updated.

Questions:

  • How do you detect when a plan is stale?
  • Should the agent proactively suggest plan updates?
  • How do you handle plan changes without losing progress tracking?

System Comparison: Memory Architectures

Dimension No Memory Short-Term Only Full Stack
Context Retention None Last N turns Persistent across sessions
Learning None None Learns from experience
Domain Knowledge None None Retrieves on demand
Multi-Tasking Impossible Confused Isolated threads
Cost Low Medium Higher (but justified)
Complexity Low Low High
graph TD
    subgraph None["No Memory"]
        N1[Each turn independent] --> N2[No learning]
        N2 --> N3[Repeats mistakes]
        style N3 fill:#ffebee,stroke:#b71c1c
    end
    
    subgraph Full["Full Memory Stack"]
        F1[Short-term context] --> F2[Semantic knowledge]
        F2 --> F3[Episodic learning]
        F3 --> F4[Planning capability]
        F4 --> F5[Adaptive agent]
        style F5 fill:#e8f5e9,stroke:#388e3c
    end

Key Architectural Patterns Summary

Pattern Problem Solved Key Benefit Complexity
Short-Term Buffer Context window overflow Maintains immediate context Low
Semantic Memory Domain knowledge access On-demand knowledge retrieval Medium
Episodic Memory Learning from experience Avoids repeating mistakes High
Planning Memory Long-term goal tracking Multi-tasking and strategy Medium

Discussion Points for Engineers

1. The Memory Update Strategy

When does memory get updated? After every interaction? On a schedule? On explicit user feedback?

Questions:

  • Do you update episodic memory after every action, or only on failures/successes?
  • How do you handle conflicting feedback? (User says "good" but the action actually failed)
  • Should memory updates be synchronous (blocking) or asynchronous (background)?

2. The Privacy vs. Learning Trade-Off

Episodic memory stores user interactions. This is sensitive data.

Questions:

  • Do you store full conversations or just summaries?
  • How do you handle GDPR "right to be forgotten" requests?
  • Can users opt out of episodic memory while keeping other memory types?

3. The Memory Versioning Problem

Your semantic memory has a style guide. The style guide gets updated. How do you handle versioning?

Questions:

  • Do you re-embed all documents when they change?
  • How do you handle queries that might match old vs. new versions?
  • Should the agent know which version of knowledge it's using?

Takeaways

The Four Layers of Memory

graph TD
    A[Complete Memory System] --> B[Layer 1: Short-Term]
    A --> C[Layer 2: Semantic]
    A --> D[Layer 3: Episodic]
    A --> E[Layer 4: Planning]
    
    B --> F[Immediate Context]
    C --> G[Domain Knowledge]
    D --> H[Learned Experience]
    E --> I[Future Strategy]
    
    style A fill:#e3f2fd,stroke:#0d47a1
    style F fill:#e8f5e9,stroke:#388e3c
    style G fill:#e8f5e9,stroke:#388e3c
    style H fill:#e8f5e9,stroke:#388e3c
    style I fill:#e8f5e9,stroke:#388e3c

Key Insights

  • Memory is not optional — Agents without memory are goldfish. They can't learn, adapt, or maintain context. Memory transforms agents from stateless functions into persistent, learning systems.

  • Different memories for different needs — Short-term for immediate context, semantic for domain knowledge, episodic for learning, planning for strategy. Each serves a distinct purpose.

  • Episodic memory enables learning — The ability to remember past experiences and avoid repeating mistakes is what separates intelligent agents from simple chatbots.

  • Planning memory enables agency — Agents that can track long-term goals and switch between tasks feel more like collaborators than tools.

  • Memory architecture is a trade-off — More memory means more complexity, higher costs, and privacy concerns. But for production agents, it's non-negotiable.

The Implementation Roadmap

Phase Focus Why
Phase 1 Short-term buffer Basic context management
Phase 2 Semantic memory Domain knowledge access
Phase 3 Episodic memory Learning from experience
Phase 4 Planning memory Multi-tasking capability

What's Next: Beyond Basic Memory

The patterns in this post are the foundation, but production systems go further:

  • Memory Compression: Advanced summarization techniques to reduce storage
  • Memory Indexing: Fast retrieval from massive episodic databases
  • Memory Forgetting: Intelligent pruning to prevent information overload
  • Cross-Agent Memory: Shared memory pools for multi-agent systems

The architecture is the same. The scale changes. The principles endure.

The Result: You've built a system that doesn't just answer questions—it remembers, learns, and adapts. It's not a chatbot. It's a persistent AI collaborator that gets smarter over time.

This is what memory-enabled agents look like in production.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.