Designing Memory Systems for AI Agentic Applications
In our previous posts, we built agents that can "think" (reasoning) and "act" (tools). But they still have a fatal flaw: Amnesia.
If you tell your agent, "My name is Param, and I prefer Python over JavaScript," and then chat for 20 minutes about code architecture, it will eventually ask, "What's your name again?" If you teach it a new coding pattern today, it will forget it by tomorrow.
Memory is essential. Without it, agents act blindly, unable to learn from mistakes, adapt to user preferences, or maintain context across sessions.
This post is for engineers who want to move beyond simple "context windows" and build agents with Persistent, Adaptive Memory. We will explore four architectural levels of memory, from basic conversation buffers to sophisticated episodic learning systems.
The Failure Case: The "Goldfish" LLM
Before we dive into solutions, let's see what happens when you naively build an agent without proper memory architecture.
graph TD
subgraph Naive["Naive Approach - No Memory"]
T1[Turn 1: User says name] --> T2[Turn 2: Agent remembers]
T2 --> T3[Turn 10: Still remembers]
T3 --> T4[Turn 50: Context full]
T4 --> T5[Turn 51: Forgets everything]
T5 --> T6[Agent asks name again]
style T5 fill:#ffebee,stroke:#b71c1c
style T6 fill:#ffebee,stroke:#b71c1c
end
subgraph Cost["Cost Explosion"]
T1 --> C1[Prompt: 100 tokens]
T2 --> C2[Prompt: 200 tokens]
T3 --> C3[Prompt: 1000 tokens]
T4 --> C4[Prompt: 8000 tokens]
T4 --> C5[Cost: 10x higher]
style C5 fill:#ffebee,stroke:#b71c1c
end
The Three Problems:
-
Context Window Overflow: As conversations grow, you hit the token limit. Older messages get dropped, and the agent forgets critical information.
-
Cost Explosion: Every message includes the entire history. A 50-turn conversation costs 50x more than a single turn.
-
No Learning: The agent makes the same mistakes repeatedly. It doesn't remember that calling a broken API failed yesterday, so it tries again today.
Observation: LLMs are stateless functions: output = f(input). They don't "remember" previous calls. To fake memory, we stuff the previous conversation into the prompt every time. This works for short conversations but breaks at scale.
Pattern 1: Short-Term Memory (The Conversation Buffer)
The Architectural Problem:
You need to keep track of the current conversation—what was just said, what the user is responding to, the immediate context. But you can't keep everything forever, or you'll hit context limits and cost ceilings.
The Solution:
We implement a Sliding Window Buffer with intelligent summarization.
The Architecture
graph TD
subgraph Buffer["Conversation Buffer Management"]
New[New Message] --> Window[Sliding Window]
Window --> Recent[Keep Last N Messages]
Recent --> Check{Window Full?}
Check -->|No| Add[Add to Buffer]
Check -->|Yes| Summarize[Summarize Old Messages]
Summarize --> Summary[System Summary]
Summary --> Add
Add --> Final[Final Buffer]
end
subgraph Context["Context Assembly"]
Final --> System[System Prompt]
Summary --> System
System --> LLM[LLM Call]
end
style Window fill:#fff9c4,stroke:#fbc02d
style Summary fill:#e3f2fd,stroke:#0d47a1
How it Works:
-
Sliding Window: Keep the last K messages (e.g., last 20 turns) in full detail.
-
Summarization: When messages fall out of the window, compress them into a "System Summary" that retains key facts without verbatim text.
-
Context Assembly: Build the prompt with: System Summary (compressed history) + Recent Messages (full detail) + Current Query.
Concrete Example: The Buffer in Action
Scenario: User is debugging a Python script with the agent over 30 turns.
Turn 1-10:
User: "I'm getting an error in my script"
Agent: "What's the error message?"
User: "AttributeError: 'NoneType' object has no attribute 'split'"
Agent: "This means you're calling .split() on None. Where in your code?"
...
[10 turns of debugging]
Turn 11-20:
[More debugging turns]
Buffer: [Turns 1-20 in full detail]
Turn 21:
Buffer is full (20 turns max)
Action: Summarize turns 1-10
Summary: "User debugging Python AttributeError. Issue is in line 45 where
variable 'text' can be None. We've tried adding None checks."
Buffer: [Summary of 1-10] + [Turns 11-21 in full detail]
Turn 30:
Buffer is full again
Action: Summarize turns 11-20
Summary: "Continued debugging. Added try-except block. Error persists.
Suspecting issue with data loading function."
Buffer: [Summary of 1-10] + [Summary of 11-20] + [Turns 21-30 in full detail]
The Impact:
| Approach | Context Size | Cost per Turn | Information Loss |
|---|---|---|---|
| Naive (keep all) | Grows linearly | Grows linearly | None (until overflow) |
| Sliding Window | Fixed size | Fixed cost | Minimal (summarized) |
Observation: The sliding window maintains immediate context (the last 20 turns) while preserving the gist of older conversation through summarization. This allows the agent to make coherent, context-aware decisions without carrying the baggage of the entire history.
Think About It: How do you decide what to summarize? Some systems summarize after a fixed number of turns (every 10 turns). Others use token-based thresholds (summarize when buffer exceeds 4000 tokens). Which is better? Token-based is more accurate but requires constant monitoring.
Pattern 2: Semantic Memory (The Knowledge Library)
The Architectural Problem:
What if the user asks, "What are the safety rules for this machine?" or "What's our company's coding style guide?" This is semantic knowledge—facts, concepts, and rules that persist across sessions.
The sliding window deleted this information. We need External Storage that can be retrieved on demand.
The Solution:
We build a Vector Database for semantic retrieval.
The Architecture
graph TD
subgraph Ingestion["Knowledge Ingestion"]
Docs[Documents] --> Parse[Parse & Chunk]
Parse --> Embed[Generate Embeddings]
Embed --> VectorDB[(Vector Database)]
end
subgraph Retrieval["Query-Time Retrieval"]
Query[User Question] --> EmbedQuery[Embed Question]
EmbedQuery --> Search[Similarity Search]
Search --> VectorDB
VectorDB --> Relevant[Top K Relevant Chunks]
Relevant --> Inject[Inject into Context]
Inject --> LLM[LLM with Knowledge]
end
style VectorDB fill:#e8f5e9,stroke:#388e3c
style Inject fill:#e3f2fd,stroke:#0d47a1
How it Works:
-
Knowledge Ingestion: Load documents (style guides, safety rules, domain knowledge) into a vector database. Each document is chunked and embedded.
-
Retrieval Mechanism: When a query comes in, embed it and perform similarity search. Retrieve only the top K most relevant chunks.
-
Context Injection: Inject the retrieved knowledge into the system prompt, giving the agent access to relevant information without storing it in the conversation buffer.
Concrete Example: The Coding Style Guide
Scenario: User asks "What's our naming convention for functions?"
Step 1: Query Processing
query = "What's our naming convention for functions?"
query_embedding = embedding_model.embed(query)
Step 2: Semantic Search
results = vector_db.similarity_search(query_embedding, k=3)
# Retrieved chunks:
# Chunk 1: "Function names should be snake_case. Use verbs: get_user(),
# calculate_total(), validate_input()."
# Chunk 2: "Private functions start with underscore: _internal_helper()."
# Chunk 3: "Async functions end with _async: fetch_data_async()."
Step 3: Context Assembly
system_prompt = f"""
You are a coding assistant.
RELEVANT KNOWLEDGE:
{retrieved_chunks}
Answer the user's question using this knowledge.
"""
Step 4: Agent Response
"According to our style guide, function names should be snake_case using verbs.
For example: get_user(), calculate_total(). Private functions start with an
underscore, and async functions end with _async."
The Knowledge Types:
| Knowledge Type | Example | Storage | Retrieval |
|---|---|---|---|
| Style Guides | "Use snake_case for functions" | Vector DB | Semantic search |
| Domain Rules | "Never call this API without auth" | Vector DB | Semantic search |
| User Preferences | "User prefers Python over JS" | Separate DB | Direct lookup |
| Project Context | "This project uses FastAPI" | Vector DB | Semantic search |
Observation: Semantic memory transforms the agent from a generic assistant into a domain expert. By retrieving relevant knowledge on-demand, the agent can answer questions about facts, rules, and concepts that were never in the conversation history.
Think About It: How do you handle conflicting knowledge? If the style guide says "use snake_case" but the user just wrote a function in camelCase, does the agent correct them or adapt? Most systems prioritize user's recent actions over stored knowledge, but flag the inconsistency.
Pattern 3: Episodic Memory (Learning from Experience)
The Architectural Problem:
The agent makes a mistake: it calls a broken API, the user corrects it, and the agent learns. But tomorrow, it makes the same mistake again. Why? Because it doesn't remember the episode.
Episodic Memory stores specific events—what happened, what action was taken, and what the result was. This is how agents learn from experience.
The Solution:
We implement a Reflection Loop that stores and retrieves past episodes.
The Architecture
graph TD
subgraph Episode["Episode Capture"]
Action[Agent Takes Action] --> Result[Result Observed]
Result --> Feedback[User Feedback]
Feedback --> Store[Store Episode]
Store --> EpisodeDB[(Episode Database)]
end
subgraph Retrieval["Episode Retrieval"]
NewTask[New Task] --> Search[Search Similar Episodes]
Search --> EpisodeDB
EpisodeDB --> Similar[Similar Past Episodes]
Similar --> Analyze{Analyze Outcomes}
Analyze -->|Success| Reuse[Reuse Successful Strategy]
Analyze -->|Failure| Avoid[Avoid Failed Strategy]
Reuse --> Execute[Execute Action]
Avoid --> Execute
end
Execute --> Store
style EpisodeDB fill:#e8f5e9,stroke:#388e3c
style Analyze fill:#fff9c4,stroke:#fbc02d
How it Works:
-
Episode Capture: After each action, store:
- State: What was the situation?
- Action: What did the agent do?
- Outcome: What happened?
- Feedback: User reaction (positive/negative)
-
Episode Retrieval: Before taking a new action, search for similar past episodes:
- "Have I done this before?"
- "Did it work?"
- "What went wrong?"
-
Strategy Adjustment: Use past episodes to inform current decisions:
- If similar action succeeded → reuse the strategy
- If similar action failed → try a different approach
Concrete Example: Learning from API Failures
Scenario: Agent tries to call a weather API that's been down for maintenance.
Episode 1: The Failure
# Episode stored:
{
"state": "User asked for weather in NYC",
"action": "Called get_weather_api('NYC')",
"outcome": "API returned 503 Service Unavailable",
"feedback": "User said 'That didn't work'",
"timestamp": "2025-11-20 10:00:00"
}
Episode 2: The Learning
# Next day, user asks for weather again
# Agent searches episodic memory:
similar_episodes = search_episodes("weather API call")
# Finds Episode 1
# Decision: "Last time I called this API, it failed. Let me try a different approach."
# Action: Use backup weather service instead
Episode 3: The Success
# Episode stored:
{
"state": "User asked for weather in NYC",
"action": "Called backup_weather_service('NYC')",
"outcome": "Successfully retrieved weather data",
"feedback": "User said 'Thanks!'",
"timestamp": "2025-11-21 14:00:00"
}
The Reflection Loop:
sequenceDiagram
participant User
participant Agent
participant Memory
User->>Agent: "Get weather for NYC"
Agent->>Memory: Search similar episodes
Memory-->>Agent: Found: API failed yesterday
Note over Agent: Adjust strategy - use backup
Agent->>Agent: Call backup service
Agent->>User: Weather data
User->>Agent: "Thanks!"
Agent->>Memory: Store successful episode
Note over Memory: Learn: backup service works
Observation: Episodic memory transforms the agent from a reactive system into a learning system. It avoids repeating mistakes and doubles down on successful strategies. This is the foundation of continuous improvement in AI agents.
Think About It: How do you determine if two episodes are "similar"? Do you use semantic similarity on the state description? Or exact matching on the action? Most systems use a hybrid: exact match on action type, semantic similarity on state. This catches "same action, similar situation" without being too strict.
Pattern 4: Planning Memory (Multi-Tasking & Long-Term Goals)
The Architectural Problem:
An agent is building an app and answering emails at the same time. It needs to keep those memories separate. It also needs to remember future steps—what it planned to do, what it's already done, and what's next.
The Solution:
We implement Threading for task isolation and Scratchpads for planning.
The Architecture
graph TD
subgraph Threading["Task Threading"]
Task1[Task: Build App] --> Thread1[Thread ID: app_001]
Task2[Task: Answer Emails] --> Thread2[Thread ID: email_002]
Thread1 --> Memory1[(Thread 1 Memory)]
Thread2 --> Memory2[(Thread 2 Memory)]
Memory1 -.Isolated.-> Memory2
end
subgraph Planning["Planning Scratchpad"]
Goal[Long-Term Goal] --> Plan[Break into Steps]
Plan --> Progress[Track Progress]
Progress --> Obstacles[Log Obstacles]
Obstacles --> Adjust[Adjust Plan]
Plan --> Scratchpad[(Planning Memory)]
end
style Threading fill:#e3f2fd,stroke:#0d47a1
style Planning fill:#fff9c4,stroke:#fbc02d
How it Works:
-
Threading: Assign a unique
thread_idto every workflow. Store separate memory states for each thread. This prevents context bleeding between tasks. -
Planning Scratchpad: Maintain a dedicated memory block for plans:
- Subgoals: "1. Research, 2. Draft, 3. Review"
- Progress: "Step 1 complete, working on Step 2"
- Obstacles: "API timed out, need to retry"
-
Context Switching: When switching tasks, save current thread state and load the target thread state.
Concrete Example: Multi-Tasking Agent
Scenario: Agent is working on two projects simultaneously.
Thread 1: Building a Dashboard
thread_1_state = {
"thread_id": "dashboard_001",
"goal": "Build analytics dashboard",
"plan": [
"1. Design schema (complete)",
"2. Build API endpoints (in progress)",
"3. Create frontend (pending)"
],
"context": "Working on /api/metrics endpoint. Need to aggregate user data.",
"obstacles": ["Database query is slow, need to optimize"]
}
Thread 2: Answering Support Emails
thread_2_state = {
"thread_id": "support_002",
"goal": "Respond to customer inquiries",
"plan": [
"1. Categorize emails (complete)",
"2. Draft responses (in progress)",
"3. Send replies (pending)"
],
"context": "Drafting response to Bob about login issue. Need to check if it's a known bug.",
"obstacles": []
}
Context Switch:
sequenceDiagram
participant User
participant Agent
participant Memory
User->>Agent: "Switch to email task"
Agent->>Memory: Save Thread 1 state
Agent->>Memory: Load Thread 2 state
Memory-->>Agent: "You were drafting response to Bob"
Agent->>User: "Back to Bob's email. Ready to continue?"
The Planning Scratchpad Structure:
planning_memory = {
"goal": "Build user authentication system",
"subgoals": [
{"id": 1, "task": "Design database schema", "status": "complete"},
{"id": 2, "task": "Implement login endpoint", "status": "in_progress"},
{"id": 3, "task": "Add password reset", "status": "pending"},
{"id": 4, "task": "Write tests", "status": "pending"}
],
"current_focus": "Implementing JWT token generation",
"blockers": ["Need to decide on token expiration time"],
"notes": "User requested 7-day expiration, but security best practice is 1 hour"
}
Observation: Planning memory transforms the agent from a reactive chatbot into a proactive project manager. It can handle long-term strategy, track progress across multiple tasks, and maintain context when switching between work streams.
Think About It: How do you handle plan changes? If the user says "Actually, skip step 3 and go straight to step 4," how does the agent update the planning memory? Most systems support plan mutations: mark steps as skipped, add new steps, reorder existing steps.
Putting It All Together: The Complete Memory Stack
Let's see how all four memory types work together in a real scenario.
Scenario: User is building a web app with the agent over multiple sessions.
graph TD
subgraph Session1["Session 1: Planning"]
S1[User: Build a todo app] --> P1[Planning Memory: Create plan]
P1 --> S2[Agent: Break into steps]
S2 --> SM1[Semantic Memory: Load best practices]
SM1 --> P2[Planning Memory: Store plan]
end
subgraph Session2["Session 2: Development"]
S3[User: Start coding] --> ST1[Short-Term: Current conversation]
ST1 --> P3[Planning Memory: Load plan]
P3 --> E1[Episodic Memory: Check past mistakes]
E1 --> S4[Agent: Avoid known pitfalls]
S4 --> ST2[Short-Term: Track progress]
end
subgraph Session3["Session 3: Debugging"]
S5[User: App is broken] --> ST3[Short-Term: Debug session]
ST3 --> E2[Episodic Memory: Search similar bugs]
E2 --> SM2[Semantic Memory: Load debugging guides]
SM2 --> S6[Agent: Apply learned solutions]
S6 --> E3[Episodic Memory: Store new solution]
end
style P2 fill:#fff9c4,stroke:#fbc02d
style E3 fill:#e8f5e9,stroke:#388e3c
style SM2 fill:#e3f2fd,stroke:#0d47a1
The Memory Interaction:
| Memory Type | What It Provides | Example |
|---|---|---|
| Short-Term | Immediate context | "We're debugging the login function" |
| Semantic | Domain knowledge | "JWT tokens should expire in 1 hour" |
| Episodic | Past experiences | "Last time we had this error, it was a CORS issue" |
| Planning | Future goals | "Step 2 of 5: Building API endpoints" |
Observation: The four memory types complement each other. Short-term provides immediate context, semantic provides domain knowledge, episodic provides learned strategies, and planning provides long-term direction. Together, they create a comprehensive memory system.
Challenge: Design Decisions for Your Memory System
Challenge 1: The Memory Conflict Problem
Your semantic memory says "Use snake_case for functions," but the user just wrote a function in camelCase. Your episodic memory says "User prefers camelCase based on past code reviews."
Questions:
- Which memory takes precedence?
- Do you correct the user or adapt to their preference?
- How do you resolve conflicts between memory types?
Challenge 2: The Episodic Memory Explosion
After 1000 interactions, you have 1000 episodes. Searching through all of them is slow.
Options:
- Time-Based Pruning: Delete episodes older than 30 days
- Relevance-Based Pruning: Keep only episodes with high success/failure signals
- Clustering: Group similar episodes, keep only representative examples
Your Task: Which pruning strategy balances performance with learning? How do you ensure you don't delete critical lessons?
Challenge 3: The Planning Drift Problem
Your agent created a plan 2 weeks ago. The project requirements have changed, but the plan hasn't been updated.
Questions:
- How do you detect when a plan is stale?
- Should the agent proactively suggest plan updates?
- How do you handle plan changes without losing progress tracking?
System Comparison: Memory Architectures
| Dimension | No Memory | Short-Term Only | Full Stack |
|---|---|---|---|
| Context Retention | None | Last N turns | Persistent across sessions |
| Learning | None | None | Learns from experience |
| Domain Knowledge | None | None | Retrieves on demand |
| Multi-Tasking | Impossible | Confused | Isolated threads |
| Cost | Low | Medium | Higher (but justified) |
| Complexity | Low | Low | High |
graph TD
subgraph None["No Memory"]
N1[Each turn independent] --> N2[No learning]
N2 --> N3[Repeats mistakes]
style N3 fill:#ffebee,stroke:#b71c1c
end
subgraph Full["Full Memory Stack"]
F1[Short-term context] --> F2[Semantic knowledge]
F2 --> F3[Episodic learning]
F3 --> F4[Planning capability]
F4 --> F5[Adaptive agent]
style F5 fill:#e8f5e9,stroke:#388e3c
end
Key Architectural Patterns Summary
| Pattern | Problem Solved | Key Benefit | Complexity |
|---|---|---|---|
| Short-Term Buffer | Context window overflow | Maintains immediate context | Low |
| Semantic Memory | Domain knowledge access | On-demand knowledge retrieval | Medium |
| Episodic Memory | Learning from experience | Avoids repeating mistakes | High |
| Planning Memory | Long-term goal tracking | Multi-tasking and strategy | Medium |
Discussion Points for Engineers
1. The Memory Update Strategy
When does memory get updated? After every interaction? On a schedule? On explicit user feedback?
Questions:
- Do you update episodic memory after every action, or only on failures/successes?
- How do you handle conflicting feedback? (User says "good" but the action actually failed)
- Should memory updates be synchronous (blocking) or asynchronous (background)?
2. The Privacy vs. Learning Trade-Off
Episodic memory stores user interactions. This is sensitive data.
Questions:
- Do you store full conversations or just summaries?
- How do you handle GDPR "right to be forgotten" requests?
- Can users opt out of episodic memory while keeping other memory types?
3. The Memory Versioning Problem
Your semantic memory has a style guide. The style guide gets updated. How do you handle versioning?
Questions:
- Do you re-embed all documents when they change?
- How do you handle queries that might match old vs. new versions?
- Should the agent know which version of knowledge it's using?
Takeaways
The Four Layers of Memory
graph TD
A[Complete Memory System] --> B[Layer 1: Short-Term]
A --> C[Layer 2: Semantic]
A --> D[Layer 3: Episodic]
A --> E[Layer 4: Planning]
B --> F[Immediate Context]
C --> G[Domain Knowledge]
D --> H[Learned Experience]
E --> I[Future Strategy]
style A fill:#e3f2fd,stroke:#0d47a1
style F fill:#e8f5e9,stroke:#388e3c
style G fill:#e8f5e9,stroke:#388e3c
style H fill:#e8f5e9,stroke:#388e3c
style I fill:#e8f5e9,stroke:#388e3c
Key Insights
-
Memory is not optional — Agents without memory are goldfish. They can't learn, adapt, or maintain context. Memory transforms agents from stateless functions into persistent, learning systems.
-
Different memories for different needs — Short-term for immediate context, semantic for domain knowledge, episodic for learning, planning for strategy. Each serves a distinct purpose.
-
Episodic memory enables learning — The ability to remember past experiences and avoid repeating mistakes is what separates intelligent agents from simple chatbots.
-
Planning memory enables agency — Agents that can track long-term goals and switch between tasks feel more like collaborators than tools.
-
Memory architecture is a trade-off — More memory means more complexity, higher costs, and privacy concerns. But for production agents, it's non-negotiable.
The Implementation Roadmap
| Phase | Focus | Why |
|---|---|---|
| Phase 1 | Short-term buffer | Basic context management |
| Phase 2 | Semantic memory | Domain knowledge access |
| Phase 3 | Episodic memory | Learning from experience |
| Phase 4 | Planning memory | Multi-tasking capability |
What's Next: Beyond Basic Memory
The patterns in this post are the foundation, but production systems go further:
- Memory Compression: Advanced summarization techniques to reduce storage
- Memory Indexing: Fast retrieval from massive episodic databases
- Memory Forgetting: Intelligent pruning to prevent information overload
- Cross-Agent Memory: Shared memory pools for multi-agent systems
The architecture is the same. The scale changes. The principles endure.
The Result: You've built a system that doesn't just answer questions—it remembers, learns, and adapts. It's not a chatbot. It's a persistent AI collaborator that gets smarter over time.
This is what memory-enabled agents look like in production.
For more on building production AI systems, check out our AI Bootcamp for Software Engineers.