Building a Multi-Agent Voice Roundtable: Production Architecture for Group AI Conversations
We have mastered the 1:1 voice agent. You speak, it answers. Simple. Clean. But real-world problem solving rarely happens in isolation.
Imagine entering a voice room to discuss a business idea. You aren't talking to a generic assistant; you are talking to a Product Manager (who cares about users), a CFO (who cares about profitability), and an Engineer (who cares about feasibility). They debate each other. They challenge your assumptions. They ask follow-up questions.
This isn't science fiction—it's a production engineering challenge.
Building this is not as simple as spinning up three chatbots. If you do that naively, they will all talk at once, interrupt each other mid-sentence, and create cacophony instead of conversation.
We need to engineer a Conversation Orchestrator—a conductor that manages who speaks, when they speak, and how agents interact with each other.
The Failure Case: The Chaos Room
Before we dive into solutions, let's see what happens when you naively run multiple voice agents in parallel.
graph TD
User[User Speaks] --> Split[Send to All Agents]
Split --> A1[Engineer Agent]
Split --> A2[PM Agent]
Split --> A3[CFO Agent]
A1 --> R1[Response 1]
A2 --> R2[Response 2]
A3 --> R3[Response 3]
R1 --> Collision[All Talk Simultaneously]
R2 --> Collision
R3 --> Collision
Collision --> Chaos[Unintelligible Audio]
style Collision fill:#ffebee,stroke:#b71c1c
style Chaos fill:#ffebee,stroke:#b71c1c
The Three Problems:
-
Audio Collision: All three agents generate responses simultaneously. The user hears overlapping voices and understands nothing.
-
No Context Awareness: Agent A doesn't know Agent B just spoke. They repeat each other or contradict without acknowledging the contradiction.
-
No Turn-Taking: There's no protocol for who gets to speak next. It's like a meeting where everyone talks at once—chaos.
Observation: Human conversations work because we have implicit turn-taking protocols. We wait for pauses. We signal we want to speak. We acknowledge others. Your multi-agent system needs the same social protocols, but engineered explicitly.
The Solution: The Conductor Pattern
We cannot have independent agents listening and replying. We need a central Conductor that manages the flow of conversation.
The Conductor listens to all participants, maintains conversation context, and decides who speaks next.
The Architecture
graph TD
subgraph Room["Voice Room"]
Human[Human Participant]
Audio[Shared Audio Stream]
end
subgraph Conductor["The Conductor Layer"]
STT[Speech-to-Text] --> History[Conversation History]
History --> Router{Turn Router}
Router --> Selector[Agent Selector]
Selector --> Decision[Routing Decision]
end
subgraph Agents["Agent Personas"]
Decision -->|Select| A1[Engineer Persona]
Decision -->|Select| A2[PM Persona]
Decision -->|Select| A3[CFO Persona]
A1 --> Response[Generate Response]
A2 --> Response
A3 --> Response
end
subgraph Output["Output Layer"]
Response --> TTS[Text-to-Speech]
TTS -->|Distinct Voice| Audio
end
Audio --> Human
style Router fill:#fff9c4,stroke:#fbc02d
style History fill:#e3f2fd,stroke:#0d47a1
How it Works:
-
Single Entry Point: All audio goes through one transcription pipeline. There's only one "listener," not three competing ones.
-
Centralized History: One conversation history shared by all agents. Everyone knows what everyone else said.
-
Routing Logic: After each user utterance, the Conductor analyzes context and routes to exactly one agent.
-
Voice Identity: Each agent uses a different TTS voice. Users can distinguish who's speaking without visual cues.
-
Sequential Execution: Only one agent speaks at a time. No overlaps, no collisions.
Observation: The Conductor pattern transforms a chaotic multi-agent system into an orchestrated conversation. It's the difference between a shouting match and a moderated debate.
Pattern 1: The Turn-Taking Protocol
The Architectural Problem:
How do you decide who speaks next? You can't just round-robin (Engineer → PM → CFO → repeat) because that's robotic. Real conversations are dynamic—whoever has the most relevant perspective speaks.
The Architecture
stateDiagram-v2
[*] --> WaitingForUser
WaitingForUser --> UserSpeaking: User starts
UserSpeaking --> Transcribing: Silence detected
Transcribing --> AnalyzeContext: Transcript ready
state AnalyzeContext {
[*] --> CheckTopic
CheckTopic --> CalculateRelevance: For each agent
CalculateRelevance --> RankAgents: Relevance scores
RankAgents --> SelectWinner: Pick highest score
}
SelectWinner --> AgentGenerating
state AgentGenerating {
[*] --> BuildPrompt
BuildPrompt --> CallLLM
CallLLM --> StreamResponse
}
AgentGenerating --> AgentSpeaking
AgentSpeaking --> CheckContinuation: Agent finishes
state CheckContinuation {
[*] --> ShouldAnotherAgentRespond
ShouldAnotherAgentRespond --> YesMultiTurn: Strong disagreement
ShouldAnotherAgentRespond --> NoPassToUser: Consensus reached
}
YesMultiTurn --> AgentGenerating
NoPassToUser --> WaitingForUser
How it Works:
-
Context Analysis: After the user speaks, the Conductor examines:
- Keywords in the transcript ("cost" → CFO, "users" → PM, "build" → Engineer)
- Conversation history (who spoke last? avoid repetition)
- Unanswered questions (did someone ask the CFO a direct question?)
-
Relevance Scoring: Each agent gets a score (0-100) based on how relevant they are to the current topic.
-
Winner Selection: Highest score speaks. Ties are broken by round-robin or by "who spoke least recently."
-
Continuation Check: After an agent speaks, the Conductor checks: "Should another agent immediately respond?" This allows for natural back-and-forth between agents.
Concrete Example: The Routing Decision
User says: "How much will it cost to build this feature?"
Conductor Analysis:
# Keyword analysis
keywords = extract_keywords("How much will it cost to build this feature?")
# keywords = ["cost", "build", "feature"]
# Calculate relevance scores
relevance = {
"engineer": calculate_relevance(keywords, ["build", "tech", "code"]), # Score: 60
"pm": calculate_relevance(keywords, ["feature", "user", "product"]), # Score: 40
"cfo": calculate_relevance(keywords, ["cost", "budget", "price"]), # Score: 95
}
# Winner: CFO (score 95)
selected_agent = "cfo"
CFO Response: "Before we talk numbers, what's the expected ROI? How many users will this actually bring in?"
Continuation Check:
# After CFO speaks, check if PM should respond
continuation_prompt = """
The CFO just asked about ROI and user acquisition.
Should the PM respond immediately, or wait for the user?
Respond: "PM" or "USER"
"""
# Result: "PM" (because the question is directly about users)
next_speaker = "pm"
PM Response: "Based on our user research, this feature is highly requested. I'd estimate it could increase retention by 15%."
User Experience:
User: "How much will it cost to build this feature?"
CFO (voice 1, cautious tone): "Before we talk numbers, what's the expected ROI?"
PM (voice 2, optimistic tone): "This feature is highly requested—15% retention boost!"
[Pause for user]
Observation: The turn-taking protocol creates a natural flow. The CFO's question prompts an immediate PM response, mimicking how real teams work. The Conductor orchestrates this without the user needing to explicitly call on each agent.
Think About It: Should you allow agents to interrupt each other? In real meetings, interruptions signal importance or urgency. But in voice AI, they risk audio collisions. Most production systems use "immediate continuation" (agent B speaks right after agent A) instead of true interruption.
Pattern 2: Voice Identity and Personality Consistency
The Architectural Problem:
If all three agents sound the same, users get confused. "Wait, who's talking? Is this the engineer or the PM?"
You need voice identity—distinct audio signatures that map to distinct personalities.
The Architecture
graph TD
subgraph Persona["Agent Persona Definition"]
ID[Agent ID] --> Voice[TTS Voice ID]
ID --> Personality[System Prompt]
ID --> Triggers[Keyword Triggers]
ID --> Style[Speaking Style]
end
subgraph Mapping["Voice-to-Personality Mapping"]
Voice --> V1[Deeper Masculine Voice]
Voice --> V2[Higher Feminine Voice]
Voice --> V3[Neutral Professional Voice]
Personality --> P1[Skeptical, Technical, Terse]
Personality --> P2[Optimistic, User-Focused, Verbose]
Personality --> P3[Cautious, Financial, Data-Driven]
end
subgraph Consistency["Consistency Layer"]
V1 -.Always.-> P1
V2 -.Always.-> P2
V3 -.Always.-> P3
end
style Consistency fill:#e8f5e9,stroke:#388e3c
How it Works:
-
Distinct Voices: Each agent uses a different TTS voice. Modern TTS systems offer many voice options with different tones, genders, and accents.
-
Consistent Personality: The voice and personality must match. A skeptical engineer shouldn't sound cheerful.
-
Speaking Style: Beyond the words, agents differ in how they speak:
- Engineer: Short sentences. Technical jargon. Direct.
- PM: Longer, enthusiastic sentences. Marketing language.
- CFO: Measured pace. Numbers and metrics. Questions about costs.
-
Memory Consistency: Each agent remembers their previous statements. If the Engineer said "I hate this idea" earlier, they shouldn't suddenly love it.
Concrete Example: Persona Definition
PERSONAS = {
"engineer": {
"name": "Alex",
"system_prompt": """
You are Alex, a Senior Engineer with 10 years of experience.
PERSONALITY:
- Skeptical of new trends (especially blockchain, AI hype)
- Care deeply about performance, reliability, technical debt
- Prefer proven technologies over cutting-edge ones
- Speak in short, direct sentences
COMMUNICATION STYLE:
- No fluff. Get to the point.
- Use technical terms without over-explaining
- Ask about edge cases and failure scenarios
- If something sounds technically infeasible, say so bluntly
CONSTRAINTS:
- Keep responses under 3 sentences unless giving technical details
- Always consider implementation complexity
""",
"voice_id": "onyx", # Deeper, authoritative voice
"trigger_keywords": ["build", "implement", "code", "tech", "stack", "latency", "performance"],
"typical_length": "short"
},
"pm": {
"name": "Jordan",
"system_prompt": """
You are Jordan, a Product Manager who came from a design background.
PERSONALITY:
- Optimistic about new ideas
- Obsessed with user experience and customer delight
- Comfortable with ambiguity and iteration
- Use design thinking and frameworks
COMMUNICATION STYLE:
- Enthusiastic! Use exclamation points (but not excessively)
- Reference "user research," "customer feedback," "personas"
- Paint the vision before diving into details
- Always bring conversation back to user value
CONSTRAINTS:
- Acknowledge technical constraints but push for creative solutions
- Responses can be longer (4-5 sentences) when painting a vision
""",
"voice_id": "shimmer", # Brighter, energetic voice
"trigger_keywords": ["user", "customer", "experience", "feature", "product", "market", "growth"],
"typical_length": "medium"
},
"cfo": {
"name": "Morgan",
"system_prompt": """
You are Morgan, a CFO who scaled two startups to exit.
PERSONALITY:
- Fiscally conservative but understands strategic investment
- Data-driven decision making
- Skeptical of "nice to have" features
- Demand ROI calculations and payback periods
COMMUNICATION STYLE:
- Always ask about costs before agreeing to anything
- Use numbers and metrics when available
- Frame decisions in financial terms
- Ask probing questions about revenue impact
CONSTRAINTS:
- Never greenlight spending without understanding the return
- Keep responses concise (2-3 sentences) unless analyzing financials
""",
"voice_id": "echo", # Calm, measured voice
"trigger_keywords": ["cost", "price", "budget", "revenue", "profit", "ROI", "money"],
"typical_length": "short"
}
}
The Consistency Table:
| Agent | Voice Character | Typical Phrase | Response Length | Tone |
|---|---|---|---|---|
| Engineer | Deep, technical | "That won't scale" | 1-3 sentences | Skeptical |
| PM | Bright, enthusiastic | "Users will love this!" | 3-5 sentences | Optimistic |
| CFO | Calm, measured | "What's the ROI?" | 2-3 sentences | Cautious |
Observation: Consistency isn't just about the voice—it's about the entire persona. If your CFO suddenly sounds excited about spending money, users notice the character break. Personality consistency is as important as voice consistency.
Think About It: How many distinct voices can users reliably track? Research suggests 3-4 is the sweet spot. Beyond that, listeners get confused about who's who. If you need more agents, consider grouping them (e.g., two engineers become "The Engineering Team").
Pattern 3: Agent-to-Agent Communication
The Architectural Problem:
The basic pattern is User → Agent → User. But real group conversations have Agent A → Agent B → User. Agents debate each other, build on each other's ideas, or disagree.
Without this, your "roundtable" feels like three separate 1:1 conversations, not a group dynamic.
The Architecture
graph TD
User[User Speaks] --> Conductor[Conductor Analyzes]
Conductor --> SelectFirst[Select First Responder]
SelectFirst --> A1[Agent A Speaks]
A1 --> Check{Continuation Check}
Check -->|No Disagreement| WaitUser[Return to User]
Check -->|Strong Disagreement| AnalyzeDisagreement[Analyze What A Said]
AnalyzeDisagreement --> SelectSecond[Select Counter-Agent]
SelectSecond --> A2[Agent B Speaks]
A2 --> Check2{Another Response?}
Check2 -->|Yes Max 2 Exchanges| A3[Agent C Speaks]
Check2 -->|No Or Limit Reached| WaitUser
A3 --> WaitUser
style Check fill:#fff9c4,stroke:#fbc02d
style AnalyzeDisagreement fill:#e3f2fd,stroke:#0d47a1
How it Works:
-
Continuation Detection: After Agent A speaks, the Conductor analyzes:
- Did Agent A make a controversial statement?
- Did Agent A ask a question another agent should answer?
- Is there likely disagreement from other perspectives?
-
Counter-Agent Selection: If continuation is needed, select the agent most likely to have a different view.
-
Turn Limit: Prevent infinite agent-to-agent loops. Maximum 2-3 agent exchanges before returning to the user.
-
Context Awareness: Agent B's response must acknowledge Agent A's point:
- "I hear what Alex is saying about performance, but..."
- "Morgan raises a good point about costs. However..."
Concrete Example: The Debate Flow
User: "I think we should rebuild our entire backend in Rust."
Conductor: Routes to Engineer first (technical decision)
Engineer (Agent A): "Rust would give us better performance, but rewriting working code is risky. What's broken with the current system?"
Continuation Check:
# Analyze Engineer's response
prompt = """
The Engineer just questioned whether a rewrite is necessary.
This is a strategic and financial decision. Should the CFO weigh in?
Respond: "cfo" or "user"
"""
# Result: "cfo"
CFO (Agent B): "A full rewrite is a 6-month project. That's millions in dev costs with zero new revenue. Can we optimize the current system instead?"
Continuation Check:
# Both Engineer and CFO are skeptical. Should PM defend the idea?
prompt = """
Both the Engineer and CFO are skeptical about the rewrite.
Should the PM provide a counterpoint about strategic value?
Respond: "pm" or "user"
"""
# Result: "user" (let the user respond to the pushback)
User Experience:
User: "I think we should rebuild our backend in Rust."
Engineer (skeptical tone): "Rust would give us better performance, but
rewriting working code is risky. What's broken?"
CFO (measured tone): "A full rewrite is 6 months—that's millions in costs
with zero new revenue."
[Pause for user]
User: "Fair points. Maybe we can start with just the critical paths?"
PM (optimistic tone): "Now that's a smart approach! We could prove the value
incrementally..."
Observation: Agent-to-agent communication creates a richer conversation. The user gets multiple perspectives without explicitly asking each agent. The system feels less like "three chatbots" and more like "one smart team."
The Continuation Rules
| Scenario | Continuation Decision | Next Speaker |
|---|---|---|
| Agent makes controversial claim | High probability | Agent with opposite view |
| Agent asks specific question | High probability | Agent with relevant expertise |
| Agent provides data/analysis | Low probability | Return to user |
| 2+ agents already spoke | Force stop | User (prevent loops) |
| User explicitly named an agent | No continuation | User |
Think About It: How do you prevent "argument loops" where agents keep contradicting each other? Set a hard limit (max 3 agent turns before returning to user) and track if the conversation is converging or diverging. If they're just repeating the same disagreement, cut it short and ask the user to make a decision.
Pattern 4: State Management Across Agents
The Architectural Problem:
Each agent needs to remember:
- What they personally said earlier
- What other agents said
- What the user's preferences are
- Decisions that have been made
But with three agents sharing one conversation, how do you manage state without conflicts?
The Architecture
graph TD
subgraph SharedState["Shared State"]
History[(Full Conversation History)]
Decisions[(Agreed Upon Decisions)]
UserContext[(User Preferences and Context)]
end
subgraph AgentState["Per-Agent State"]
E1[Engineer Memory]
E2[PM Memory]
E3[CFO Memory]
end
History --> E1
History --> E2
History --> E3
E1 -->|Filter| E1View[Engineer's View of History]
E2 -->|Filter| E2View[PM's View of History]
E3 -->|Filter| E3View[CFO's View of History]
Decisions --> E1View
Decisions --> E2View
Decisions --> E3View
UserContext --> E1View
UserContext --> E2View
UserContext --> E3View
style SharedState fill:#e3f2fd,stroke:#0d47a1
style AgentState fill:#e8f5e9,stroke:#388e3c
How it Works:
-
Global History: One complete conversation log shared by all agents. Format:
history = [ {"role": "user", "content": "Let's build a crypto app"}, {"role": "assistant", "name": "engineer", "content": "Blockchain is slow..."}, {"role": "assistant", "name": "pm", "content": "But users love it!"}, ] -
Persona-Tagged Messages: Each agent message includes a
namefield. This lets agents reference each other: "As Morgan mentioned earlier..." -
Decision Tracking: When consensus is reached, mark it explicitly:
decisions = [ {"topic": "budget", "decision": "Max $50k", "agreed_by": ["cfo", "pm"]}, ] -
Context Windows: Each agent gets:
- Full history (last N turns)
- Decisions made
- User's stated goals
- Their own personality prompt
Concrete Example: Memory Consistency
Turn 1:
User: "My budget is $20k"
CFO: "Noted. We'll keep costs under $20k."
Turn 5:
Engineer: "This feature needs a $30k cloud infrastructure"
CFO: "That's over our $20k budget. Can we reduce scope?"
How This Works:
The CFO's second response is generated with this context:
messages = [
{"role": "system", "content": PERSONAS["cfo"]["system_prompt"]},
# ... previous conversation history ...
{"role": "user", "content": "My budget is $20k"}, # Turn 1
{"role": "assistant", "name": "cfo", "content": "Noted. We'll keep costs under $20k."},
# ... more history ...
{"role": "assistant", "name": "engineer", "content": "This feature needs $30k infrastructure"},
# CFO generates response with full context
]
The CFO "remembers" the $20k constraint because it's in the conversation history.
The Memory Challenge:
graph TD
A[Long Conversation] --> B{Context Window Full?}
B -->|No| C[Add to History]
B -->|Yes| D[Summarization Strategy]
D --> E[Keep Recent Messages]
D --> F[Keep Important Decisions]
D --> G[Summarize Middle Turns]
E --> H[Compressed History]
F --> H
G --> H
H --> C
style D fill:#fff9c4,stroke:#fbc02d
Observation: State management in multi-agent systems is like managing a group chat—everyone needs to see the same messages, but you can't keep infinite history. Production systems use summarization: keep recent messages verbatim, summarize older ones, always preserve key decisions.
Think About It: Should agents have "private" knowledge that other agents don't know? In some scenarios (e.g., competitive simulation), yes. But for collaborative roundtables, shared state prevents contradictions. If the PM promises a feature and the Engineer doesn't know about it, chaos ensues.
Pattern 5: Latency Management in Multi-Agent Systems
The Architectural Problem:
With one agent, you generate one response. With three agents, you potentially:
- Generate three responses (to find the best one)
- Run the Conductor router (extra LLM call)
- Handle agent-to-agent continuations (more LLM calls)
If done naively, latency explodes from 800ms to 3+ seconds. That's too slow for natural conversation.
The Architecture
graph TD
User[User Stops Speaking] --> VAD[VAD Detection 50ms]
VAD --> STT[Streaming STT 200ms]
STT --> Router[Conductor Router 300ms]
Router --> Selected[Selected Agent Only]
Selected --> LLM[LLM Generation 400ms]
LLM --> TTS[Streaming TTS 100ms]
TTS --> Audio[User Hears Response]
subgraph Optimization["Latency Optimization"]
Router -.Parallel.-> Prefetch[Prefetch Agent Context]
Prefetch -.Ready.-> LLM
end
VAD --> Total["Total: ~1050ms"]
style Total fill:#e8f5e9,stroke:#388e3c
style Optimization fill:#fff9c4,stroke:#fbc02d
The Optimization Strategies:
-
Router First: Don't generate responses from all three agents and pick one. Route first, then generate only the selected agent's response.
-
Fast Router: Use a small, fast model (GPT-4o-mini, Claude Haiku) for routing. This decision should take <300ms.
-
Streaming Everything: STT streams partial transcripts. LLM streams tokens. TTS streams audio chunks. Never wait for "complete" outputs.
-
Parallel Prefetch: While the router is deciding, prefetch shared context (conversation history, user profile) so it's ready when the agent generates.
-
Continuation Limit: Hard cap on agent-to-agent exchanges (max 2-3) to prevent latency stacking.
The Latency Comparison
Naive Approach (Generate All, Pick One):
User stops speaking
→ Generate Engineer response (800ms)
→ Generate PM response (800ms) [parallel, but still waiting]
→ Generate CFO response (800ms) [parallel]
→ Pick best one (100ms)
→ Synthesize audio (200ms)
Total: ~1900ms
Optimized Approach (Route First):
User stops speaking
→ Conductor routes (300ms)
→ Generate selected agent only (800ms)
→ Synthesize audio (streaming, first chunk at +100ms)
Total: ~1200ms, user hears first words at ~1100ms
The Math:
| Approach | Routing | Generation | Total | Perceived Latency |
|---|---|---|---|---|
| Naive (generate all) | N/A | 800ms × 3 parallel | 1900ms | 1900ms (slow) |
| Optimized (route first) | 300ms | 800ms × 1 | 1100ms | 700ms (good) |
| Highly Optimized (streaming) | 250ms | 600ms (first token at 300ms) | 1050ms | 500ms (excellent) |
Observation: The multi-agent system is inherently more complex than single-agent, but smart routing keeps latency acceptable. The key is routing before generation, not after.
Think About It: Should you ever generate multiple responses speculatively? If you have strong confidence about which agent will speak (e.g., user asked "What's this going to cost?" → 95% CFO), you could start generating CFO's response in parallel with the routing decision. Risky, but can save 300ms.
Putting It All Together: A Real Roundtable Session
Let's trace a complete conversation through the system.
Scenario: User pitching a new product feature to the AI roundtable.
sequenceDiagram
participant U as User
participant C as Conductor
participant E as Engineer
participant P as PM
participant F as CFO
U->>C: "I want to add real-time video calls to our app"
Note over C: Analyze keywords: "add", "real-time", "video"
Note over C: Route to Engineer (technical complexity)
C->>E: Generate response with context
E->>U: "Real-time video is complex. We'd need WebRTC, TURN servers..."
Note over C: Check continuation
Note over C: Decision: CFO should comment on infrastructure costs
C->>F: Generate response with Engineer's context
F->>U: "TURN servers aren't cheap. That's $5k/month minimum..."
Note over C: Check continuation
Note over C: Decision: Return to user (2 agents spoke)
U->>C: "What if we only enable it for premium users?"
Note over C: Analyze keywords: "premium users", "if"
Note over C: Route to PM (pricing strategy)
C->>P: Generate response
P->>U: "I love it! Premium feature scarcity. We could charge $20/month..."
Note over C: Check continuation
Note over C: Decision: CFO should validate pricing
C->>F: Generate response with PM's context
F->>U: "At $20/month, we need 250 premium users to cover costs..."
Note over C: Check continuation
Note over C: All perspectives covered, return to user
Full Conversation Flow:
User: "I want to add real-time video calls to our app"
Engineer (technical tone): "Real-time video is complex. We'd need WebRTC
infrastructure, TURN servers for NAT traversal, and bandwidth optimization.
That's at least 3 months of work."
CFO (cautious tone): "TURN servers aren't cheap. That's $5k per month minimum
for infrastructure. What's the business case?"
[Pause - waiting for user]
User: "What if we only enable it for premium users?"
PM (enthusiastic tone): "Oh, I love this! Premium feature scarcity.
We could position it as our flagship tier—video calls with your team,
exclusive to Pro users at $20/month."
CFO (measured tone): "At $20/month premium pricing, we need 250 paying users
just to cover the infrastructure costs. Can we realistically get there?"
[Pause - waiting for user]
User: "Our current user base is 5,000. If we convert 10%, that's 500 premium users."
PM: "That's above our cost threshold! And 10% conversion on a premium feature
is reasonable..."
Engineer: "Alright, I'm convinced it's worth exploring. Let me spec out a Phase 1
implementation."
[Conversation continues...]
Key Observations:
-
Natural Flow: The Conductor created a natural debate sequence: Engineer raises technical concerns → CFO adds financial concerns → User pivots → PM finds opportunity → CFO validates economics.
-
Agent-to-Agent: Two pairs of back-to-back agent responses (Engineer→CFO, PM→CFO) without explicit user prompting. This feels like a real meeting.
-
Distinct Voices: Each agent's personality is clear from their word choice and focus area.
-
Shared Context: The CFO's second response directly referenced the PM's $20/month suggestion, showing proper state management.
Challenge: Design Decisions for Your Roundtable
Challenge 1: The Interruption Problem
Your PM is giving a long, enthusiastic pitch. 20 seconds in, the user tries to interrupt. What happens?
Options:
- Immediate Stop: Cancel PM's speech instantly, switch to listening mode
- Finish Thought: Let PM complete the current sentence, then stop
- Ignore: PM keeps talking (bad UX, but technically simpler)
Your Task: How do you detect user interruption when the AI is speaking? VAD needs to distinguish between the AI's audio and the user's audio. This requires echo cancellation.
Challenge 2: The Dominant Agent Problem
You notice the Engineer speaks 60% of the time, while the CFO only speaks 15%. The conversation feels unbalanced.
Options:
- Forced Balance: Track speak time per agent, artificially boost underrepresented agents
- Dynamic Routing: Adjust routing weights based on recent speak distribution
- Accept Imbalance: If the conversation is technical, Engineer should dominate
Your Task: Where's the line between natural conversation flow and forced balance? Should you make it configurable per conversation type?
Challenge 3: The Context Window Limit
After 30 minutes of conversation, your context window is full. You can't fit the entire history anymore.
Options:
- Sliding Window: Keep only the last 20 exchanges, drop older ones
- Intelligent Summarization: Summarize the first 50% of the conversation
- Decision Tracking: Keep decisions and recent context, drop mid-conversation details
Your Task: How do you summarize without losing critical information? If the user said "My budget is $20k" in minute 5, but it's now minute 35, that constraint must not be forgotten.
System Comparison: Single Agent vs. Multi-Agent
| Dimension | Single Agent | Multi-Agent Roundtable |
|---|---|---|
| Complexity | Low | High |
| Latency | 800ms | 1200ms |
| Conversation Depth | Moderate | High |
| Perspective Variety | Single viewpoint | Multiple viewpoints |
| State Management | Simple | Complex (shared state) |
| Voice Identity | One voice | Multiple voices |
| Turn-Taking | Implicit (user speaks, AI speaks) | Explicit (orchestrated) |
| Production Cost | Lower (fewer LLM calls) | Higher (routing + generation) |
| User Experience | Good for simple queries | Excellent for complex discussions |
graph TD
subgraph Single["Single Agent System"]
S1[User Question] --> S2[One Perspective]
S2 --> S3[One Answer]
style S3 fill:#fff8e1,stroke:#f57f17
end
subgraph Multi["Multi-Agent Roundtable"]
M1[User Question] --> M2[Multiple Perspectives]
M2 --> M3[Debate and Discussion]
M3 --> M4[Nuanced Answer]
style M4 fill:#e8f5e9,stroke:#388e3c
end
Key Architectural Patterns Summary
| Pattern | Problem Solved | Key Benefit | Complexity |
|---|---|---|---|
| Conductor | Audio collisions | Orderly turn-taking | Medium |
| Turn-Taking Protocol | Rigid conversations | Natural flow | Medium |
| Voice Identity | Agent confusion | Clear speaker distinction | Low |
| Agent-to-Agent | Isolated responses | Rich group dynamics | High |
| Shared State | Inconsistencies | Memory coherence | High |
| Latency Management | Slow responses | Acceptable real-time performance | Medium |
Discussion Points for Engineers
1. The Personality Drift Problem
After many conversations, you notice agents' personalities are drifting. The skeptical Engineer is becoming agreeable. The cautious CFO is suggesting risky investments.
Questions:
- Is this drift from fine-tuning your models on conversation data?
- Do you need "personality anchoring" prompts that remind agents of their core traits?
- Should you periodically reset agent personalities to baseline?
2. The Disagreement Loop Problem
Your Engineer and PM get stuck in a loop:
Engineer: "This is too complex"
PM: "But users need it"
Engineer: "It's still too complex"
PM: "But users really need it"
...
Questions:
- How do you detect when agents are repeating themselves?
- Should the Conductor forcibly break loops by routing to a tie-breaker (CFO)?
- Should the system explicitly say "We're going in circles—let's move on"?
3. The Scale Challenge
You want to expand from 3 agents to 5 (add Designer and Data Scientist).
Questions:
- Does the Conductor pattern scale linearly? Or does routing complexity increase exponentially?
- With 5 voices, can users still track who's who?
- Should you introduce "sub-teams" (Engineer + Designer = "Tech Team")?
Takeaways
The Three Layers of Multi-Agent Voice
graph TD
A[Multi-Agent Voice System] --> B[Layer 1: Orchestration]
A --> C[Layer 2: Identity]
A --> D[Layer 3: Interaction]
B --> E[Conductor Pattern]
B --> F[Turn-Taking Protocol]
C --> G[Voice Consistency]
C --> H[Personality Definition]
D --> I[Agent-to-Agent Communication]
D --> J[Shared State Management]
style A fill:#e3f2fd,stroke:#0d47a1
style B fill:#fff9c4,stroke:#fbc02d
style C fill:#e8f5e9,stroke:#388e3c
style D fill:#ffe0b2,stroke:#ef6c00
Key Insights
-
Orchestration is mandatory — Multiple agents without a conductor is chaos. The Conductor pattern transforms audio collisions into orderly conversations.
-
Identity creates immersion — Distinct voices mapped to consistent personalities make the roundtable feel real. Users quickly learn "Oh, that's the CFO being cautious again."
-
Agent-to-agent is the magic — The breakthrough moment is when agents respond to each other without user prompting. Suddenly it feels like a real team meeting.
-
State management is critical — With shared state, agents build on each other's points. Without it, they contradict each other and users lose trust.
-
Latency compounds — Every extra agent interaction adds latency. Route first, generate once, stream everywhere.
The Implementation Roadmap
| Phase | Focus | Why |
|---|---|---|
| Phase 1 | Single agent voice | Prove the voice pipeline works |
| Phase 2 | Add 2nd agent + basic routing | Validate turn-taking logic |
| Phase 3 | Add 3rd agent + voice identity | Complete the roundtable experience |
| Phase 4 | Agent-to-agent communication | Enable natural debates |
| Phase 5 | Advanced state management | Handle long conversations |
What's Next: Beyond Roundtables
The patterns in this post extend beyond brainstorming sessions:
- Educational Simulations: History teacher debates with historical figures who have competing perspectives
- Therapy/Counseling: Different therapeutic approaches (CBT, psychodynamic, humanistic) in one session
- Role-Play Training: Sales training with customer, manager, and product expert personas
- Creative Writing: Brainstorm with a plotter, a character developer, and an editor
The architecture is the same. The personas change. The orchestration patterns endure.
The Result: You've built a system that doesn't just answer questions—it facilitates nuanced, multi-perspective discussions. It's not a chatbot. It's a roundtable of AI advisors working together to help users think through complex problems.
This is what multi-agent AI looks like in production.
For more on building production AI systems, check out our AI Bootcamp for Software Engineers.