Voice Conversation Memory: Why Your Bot Forgets Who You Are
In a text chatbot, "memory" is easy. If the user scrolls up, they see the history. If the bot forgets something, the user can just re-read the previous messages.
In Voice AI, the rules change completely.
- Context is Invisible: The user cannot "scroll up." If the bot forgets that my name is Bob, the illusion of intelligence shatters immediately.
- Context is Latency: Every token you send to the LLM adds milliseconds to the response time. Sending a 10-minute transcript (approx. 1,500 words) to GPT-4o doesn't just cost money; it adds a 1-2 second processing delay.
In voice, Latency is the enemy.
This post explores how to manage conversation memory so your bot stays smart enough to remember you, but light enough to respond instantly.
The problem: The "Context Bloat" curve
Imagine a 10-minute customer support call.
- Minute 1: History is short. Latency is 300ms. Snappy.
- Minute 5: History is 2,000 tokens. Latency creeps to 800ms.
- Minute 10: History is 4,000 tokens. Latency spikes to 1.5s. The user starts interrupting the bot because it feels slow.
We cannot simply append every User and Assistant message to the list forever. We need a strategy to prune the history while keeping the meaning.
Strategy 1: The sliding window (Short-term memory)
For a fluid voice conversation, the bot usually only needs the last 3-4 turns to understand immediate context (e.g., "Yes, that works" or "No, the other one").
We implement a Sliding Window manager that keeps the System Prompt fixed (the "Personality") but strictly trims the middle of the conversation.
graph LR
subgraph RAW["Raw Conversation History"]
A[System] --> B[Turn 1]
B --> C[Turn 2]
C --> D[Turn 3]
D --> E[Turn 4]
E --> F[Turn 5]
end
subgraph WINDOW["Sliding Window: Context sent to LLM"]
A2[System] --> D2[Turn 3]
D2 --> E2[Turn 4]
E2 --> F2[Turn 5]
end
style B fill:#ffebee,stroke:#b71c1c
style C fill:#ffebee,stroke:#b71c1c
The Implementation:
In LiveKit agents, the context is often managed automatically, but for production, you want explicit control.
# A simple manual pruner
def prune_context(chat_ctx):
# Always keep the System Prompt (index 0)
system_prompt = chat_ctx.messages[0]
# Get the rest of the history
history = chat_ctx.messages[1:]
# Keep only the last 6 messages (3 turns)
if len(history) > 6:
history = history[-6:]
return [system_prompt] + history
Pros: Zero latency overhead. Extremely cheap.
Cons: The "Goldfish Effect." If I said "My name is Bob" at minute 1, and the window slides past it, the bot forgets my name at minute 3.
Strategy 2: The "Sidecar" summarizer (Long-term persistence)
To solve the Goldfish Effect without bloating the main context, we use a Background Process.
While the main agent is chatting, a second, smaller LLM (the "Sidecar") runs in the background. It watches the conversation and updates a "Summary" section in the System Prompt.
graph TD
A[Voice Conversation Stream] --> B(Main Agent Loop)
A --> C(Background Sidecar Worker)
C --> D[Extract Facts: User is Bob, Wants Pizza]
D --> E[Update System Prompt]
E --> B
style C fill:#fff9c4,stroke:#fbc02d
style E fill:#e3f2fd,stroke:#0d47a1
The Implementation:
We use an async task so we don't block the audio stream.
async def background_summarizer(full_history, agent):
"""
Runs periodically to compress history into facts.
"""
# We use a cheap, fast model (like gpt-4o-mini) for summarization
summary = await cheap_llm.generate(
f"Extract key facts from this conversation history: {full_history}"
)
# We inject these facts into the 'hidden' context of the main agent
new_system_prompt = f"""
You are a helpful assistant.
CORE MEMORY (DO NOT FORGET):
{summary}
"""
# Update the running agent's prompt live
agent.update_system_prompt(new_system_prompt)
Pros: Retains long-term context (names, preferences) without growing token count.
Cons: There is a delay. The summary might update 10 seconds after the user says the fact.
Strategy 3: Structured state extraction (The "Pro" move)
Summaries are fuzzy. "User wants pizza" is text.
For robust applications (like ordering food), we don't want text summaries; we want Structured Data.
Instead of summarizing, we give the agent a tool called update_order or save_profile. The agent "offloads" memory to a structured object.
- User: "I want a pepperoni pizza."
- Agent (Thought): User provided data. I will call
update_order(item="pepperoni pizza"). - System: Updates
order_state = {"items": ["pepperoni pizza"]}. - System: Injects
Current Order: 1x Pepperoni Pizzainto the System Prompt.
This keeps the prompt tiny but the memory perfect.
Engineering trade-off matrix
| Strategy | Latency Impact | Recall Quality | Token Cost | Best For |
|---|---|---|---|---|
| Full History | High (Bad) | Perfect | High | Short demos (< 2 mins) |
| Sliding Window | Low (Good) | Low (Forgets) | Low | Casual chat / Small talk |
| Async Summary | Low (Good) | Medium (Fuzzy) | Medium | Support bots / General Q&A |
| Structured State | Low (Good) | High (Precise) | Low | Transactional Bots (Ordering, Booking) |
Challenge for you
Scenario: You are building a Medical Intake Voice Bot.
- Requirement: The call might last 20 minutes. You must capture every symptom mentioned, even if it was said at minute 1. You cannot lose data.
- Constraint: You cannot simply keep 20 minutes of text in the prompt (latency will be too high).
Your Task:
- Why would Sliding Window fail here?
- Why might Async Summarization be risky (think about "hallucinating" a symptom)?
- Design a Structured State solution. What would your Pydantic schema look like for
PatientData? How would you prompt the agent to save symptoms as they are spoken?
Key takeaways
- Context is latency in voice: Every token adds milliseconds to response time, making memory management critical for sub-500ms latency
- Sliding windows trade recall for speed: Keeping only recent turns enables fast responses but causes the "Goldfish Effect" where early context is lost
- Async summarization preserves facts: Background processes can compress long conversations into key facts without blocking the main audio stream
- Structured state is the most reliable: Using tools to extract and store data in structured formats (Pydantic models) provides precise memory without token bloat
- Different strategies for different use cases: Casual chat needs speed (sliding window), support needs facts (summarization), transactional needs precision (structured state)
- Memory tools enable precise extraction: Giving agents tools like
update_orderorsave_profilelets them offload memory to structured objects - System prompts can hold compressed context: Injecting structured state summaries into system prompts keeps context small but accurate
For more on voice AI systems, see our voice AI fundamentals guide and our streaming guide.
For more on building production AI systems, check out our AI Bootcamp for Software Engineers.