Streaming at Scale: SSE, WebSockets & Designing Real-Time AI APIs

Param Harrison
5 min read

Share this post

The Challenge

Your chat app's users start typing faster than your LLM replies.

  • Requests pile up
  • Some clients disconnect mid-generation
  • Metrics show spikes in open connections and token waste

Discussion: You're streaming tokens — why does latency still feel high?

1. Streaming is a system design problem — not a transport choice

Most engineers treat streaming as "pick SSE or WebSocket."

In reality, it's a system-wide coordination model.

flowchart LR
    A[User Input] --> B[Gateway]
    B --> C[Coordinator]
    C --> D[Model Runner]
    D --> E[Streamer]
    E --> F[Client UI]
    C --> G[Metrics & Backpressure]

Every component influences perceived streaming speed.

Streaming = how data flows through your architecture, not just the protocol.

2. The latency stack

Layer Typical Delay (ms) Design Lever
Model first token 300–2000 Prompt trimming, model choice
Token serialization 10–50 per token Buffered vs unbuffered I/O
Gateway routing 20–100 Keep-alive, chunk flush
Client render 30–100 Frame batching

Challenge: How can you deliver the first visible token < 250ms, even if total generation = 5s?

(Hint: predictive buffering and async render.)

3. Protocol trade-offs

Server-Sent Events (SSE)

Unidirectional HTTP stream.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: POST /chat
    loop tokens
        Server-->>Client: event: message\n data: token
    end
    Server-->>Client: event: done

Pros: Simple, cache-friendly

Cons: One-way, limited reconnection control

WebSockets

Bidirectional persistent connection.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: connect(ws)
    Client->>Server: prompt
    loop tokens
        Server-->>Client: token
    end
    Client->>Server: stop / feedback

Pros: Great for agents, interactive tools

Cons: Stateful, requires connection management + load balancing

Hybrid model (real systems)

graph TD
    A[Frontend] -->|prompt| B[HTTP Gateway]
    B -->|subscribe| C[Message Broker]
    C --> D[Streamer Service]
    D -->|SSE→UI| E[User]
    A -->|control via WS| B
  • Control plane (WebSocket): cancel, feedback
  • Data plane (SSE): token flow

This hybrid is used in Anthropic, OpenAI, and Claude-style UIs.

4. Designing for backpressure and buffer control

When the model produces faster than clients consume:

graph LR
    M[Model Stream] -->|tokens| Q[Bounded Buffer]
    Q -->|flush chunks| C[Client]
    Q -->|pressure signal| M

Patterns:

  • Dynamic chunk size (flush every N tokens or T ms)
  • Drop policy for late clients
  • Stream heartbeat: event:ping

Challenge: How would you prevent OOM if one user leaves a tab open and never reads from the stream?

5. Resilience & reconnection strategy

Streaming is fragile — clients disconnect often.

Reconnection Checklist:

  • Client sends Last-Event-ID
  • Server resumes from token index or summary
  • Partial state persisted (ephemeral store / Redis)
sequenceDiagram
    participant C as Client
    participant S as Server
    
    C->>S: connect (Last-Event-ID=250)
    S-->>C: resume token 251…

Think of it like "TCP for tokens."

6. Observability in streaming systems

You need continuous metrics, not post-hoc logs.

flowchart TD
    A[Streamer] --> B[Metrics]
    B --> C[Dashboard]
    B --> D[Alerting]

Metrics to collect:

  • First token latency
  • Tokens/sec throughput
  • Stream duration distribution
  • Error rate (4xx, 5xx, disconnects)

Visualize these in real time — latency histograms tell you where UX pain hides.

7. Challenge example: AI pair coder streaming

Scenario: A live pair coder streams code suggestions as you type.

Goals:

  • Show first token < 200ms
  • Allow "stop generation" feedback
  • Resume after disconnection
graph TD
    A[IDE Plugin] -->|HTTP prompt| B[Gateway]
    B --> C[Model Runner]
    C --> D[Streamer]
    D -->|SSE→IDE| A
    A -->|WS feedback: stop| B
    D --> E[Telemetry]

Design questions:

  • What's your maximum open connections per instance?
  • How do you cancel a token stream mid-flight without leaking GPU cycles?
  • Can you back-off stream rate when client CPU usage spikes?

8. Optimizing perceived latency

Perceived latency ≠ actual latency.

Users judge responsiveness by first visible token and smoothness.

Design tricks:

  • Emit "typing animation" placeholders before tokens arrive
  • Send short prefix predictions fast, then stream full response
  • Adapt chunk size to user network speed

That's why Anthropic and OpenAI feel fast even on slow models.

9. Architectural checklist for streaming readiness

  • Tokenized buffering and flush control
  • Heartbeat and graceful close
  • Reconnection support (Last-Event-ID)
  • Per-stream metrics (first token, duration, tokens/sec)
  • Hybrid data/control channels
  • Cancellation and backpressure design

If any of these are missing → expect timeouts and frustrated users.

Discussion prompts for engineers

  • Would you prioritize throughput or perceived latency for a consumer AI chat app?
  • Where would you place buffer boundaries in a multi-model chain?
  • How could you simulate network jitter and measure UX degradation?
  • If you had to choose one metric to optimize, would it be first-token latency or completion latency? Why?

Takeaway

  • Real-time AI systems aren't about protocols; they're about flow discipline.
  • Streaming forces you to engineer for asynchrony, faults, and human perception.

You're no longer serving responses — you're orchestrating continuous conversations between humans and models.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI — Newsletter by Param

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.