What's the time commitment for this bootcamp?

The bootcamp requires 10 hours per week over 6 weeks. This includes live sessions, hands-on projects, and self-paced learning. Most students find this manageable alongside their full-time jobs.

Do I need prior AI experience to join?

No prior AI experience is required, but you should have few years of software development experience. The bootcamp is designed for software engineers who want to upskill in AI engineering.

What if I can't make a live session?

All live sessions are recorded and available for replay. We also offer multiple office hours throughout the week, so you can catch up on any missed content or get help with assignments.

How much should I budget for APIs and resources?

We estimate €10-50 for the entire bootcamp, covering API costs for OpenAI and other services. We'll show you how to optimize costs and use free tiers when possible.

What happens if I can't attend this cohort?

You can defer to the next cohort at no additional cost. We run cohorts every 2-3 months, so you won't have to wait long to join.

How long will I have access to materials after the bootcamp?

You'll have lifetime access to all course materials, recordings, and the private community. This includes future updates and new content we add to the bootcamp.

What's the refund policy?

Yes, you can get a 100% refund if you've progressed less than 10% of the bootcamp or it's within 7 days of your purchase. We're confident in our curriculum and instructor quality, which is why we offer this guarantee.

Do you offer team discounts?

Yes! We offer 20%+ discounts for teams of 3 or more. Contact us at param@learnwithparam.com for team pricing and bulk enrollment options.

Streaming at Scale: SSE, WebSockets & Designing Real-Time AI APIs

The Challenge

Your chat app's users start typing faster than your LLM replies.

Requests pile up
Some clients disconnect mid-generation
Metrics show spikes in open connections and token waste

Discussion: You're streaming tokens — why does latency still feel high?

1. Streaming is a system design problem — not a transport choice

Most engineers treat streaming as "pick SSE or WebSocket."

In reality, it's a system-wide coordination model.

flowchart LR
    A[User Input] --> B[Gateway]
    B --> C[Coordinator]
    C --> D[Model Runner]
    D --> E[Streamer]
    E --> F[Client UI]
    C --> G[Metrics & Backpressure]

Every component influences perceived streaming speed.

Streaming = how data flows through your architecture, not just the protocol.

2. The latency stack

Layer	Typical Delay (ms)	Design Lever
Model first token	300–2000	Prompt trimming, model choice
Token serialization	10–50 per token	Buffered vs unbuffered I/O
Gateway routing	20–100	Keep-alive, chunk flush
Client render	30–100	Frame batching

Challenge: How can you deliver the first visible token < 250ms, even if total generation = 5s?

(Hint: predictive buffering and async render.)

3. Protocol trade-offs

Server-Sent Events (SSE)

Unidirectional HTTP stream.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: POST /chat
    loop tokens
        Server-->>Client: event: message\n data: token
    end
    Server-->>Client: event: done

Pros: Simple, cache-friendly

Cons: One-way, limited reconnection control

WebSockets

Bidirectional persistent connection.

sequenceDiagram
    participant Client
    participant Server
    
    Client->>Server: connect(ws)
    Client->>Server: prompt
    loop tokens
        Server-->>Client: token
    end
    Client->>Server: stop / feedback

Pros: Great for agents, interactive tools

Cons: Stateful, requires connection management + load balancing

Hybrid model (real systems)

graph TD
    A[Frontend] -->|prompt| B[HTTP Gateway]
    B -->|subscribe| C[Message Broker]
    C --> D[Streamer Service]
    D -->|SSE→UI| E[User]
    A -->|control via WS| B

Control plane (WebSocket): cancel, feedback
Data plane (SSE): token flow

This hybrid is used in Anthropic, OpenAI, and Claude-style UIs.

4. Designing for backpressure and buffer control

When the model produces faster than clients consume:

graph LR
    M[Model Stream] -->|tokens| Q[Bounded Buffer]
    Q -->|flush chunks| C[Client]
    Q -->|pressure signal| M

Patterns:

Dynamic chunk size (flush every N tokens or T ms)
Drop policy for late clients
Stream heartbeat: event:ping

Challenge: How would you prevent OOM if one user leaves a tab open and never reads from the stream?

5. Resilience & reconnection strategy

Streaming is fragile — clients disconnect often.

Reconnection Checklist:

Client sends Last-Event-ID
Server resumes from token index or summary
Partial state persisted (ephemeral store / Redis)

sequenceDiagram
    participant C as Client
    participant S as Server
    
    C->>S: connect (Last-Event-ID=250)
    S-->>C: resume token 251…

Think of it like "TCP for tokens."

6. Observability in streaming systems

You need continuous metrics, not post-hoc logs.

flowchart TD
    A[Streamer] --> B[Metrics]
    B --> C[Dashboard]
    B --> D[Alerting]

Metrics to collect:

First token latency
Tokens/sec throughput
Stream duration distribution
Error rate (4xx, 5xx, disconnects)

Visualize these in real time — latency histograms tell you where UX pain hides.

7. Challenge example: AI pair coder streaming

Scenario: A live pair coder streams code suggestions as you type.

Goals:

Show first token < 200ms
Allow "stop generation" feedback
Resume after disconnection

graph TD
    A[IDE Plugin] -->|HTTP prompt| B[Gateway]
    B --> C[Model Runner]
    C --> D[Streamer]
    D -->|SSE→IDE| A
    A -->|WS feedback: stop| B
    D --> E[Telemetry]

Design questions:

What's your maximum open connections per instance?
How do you cancel a token stream mid-flight without leaking GPU cycles?
Can you back-off stream rate when client CPU usage spikes?

8. Optimizing perceived latency

Perceived latency ≠ actual latency.

Users judge responsiveness by first visible token and smoothness.

Design tricks:

Emit "typing animation" placeholders before tokens arrive
Send short prefix predictions fast, then stream full response
Adapt chunk size to user network speed

That's why Anthropic and OpenAI feel fast even on slow models.

9. Architectural checklist for streaming readiness

Tokenized buffering and flush control
Heartbeat and graceful close
Reconnection support (Last-Event-ID)
Per-stream metrics (first token, duration, tokens/sec)
Hybrid data/control channels
Cancellation and backpressure design

If any of these are missing → expect timeouts and frustrated users.

Discussion prompts for engineers

Would you prioritize throughput or perceived latency for a consumer AI chat app?
Where would you place buffer boundaries in a multi-model chain?
How could you simulate network jitter and measure UX degradation?
If you had to choose one metric to optimize, would it be first-token latency or completion latency? Why?

Takeaway

Real-time AI systems aren't about protocols; they're about flow discipline.
Streaming forces you to engineer for asynchrony, faults, and human perception.

You're no longer serving responses — you're orchestrating continuous conversations between humans and models.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Streaming at Scale: SSE, WebSockets & Designing Real-Time AI APIs

Share this post

The Challenge

1. Streaming is a system design problem — not a transport choice

2. The latency stack

3. Protocol trade-offs

Server-Sent Events (SSE)

WebSockets

Hybrid model (real systems)

4. Designing for backpressure and buffer control

5. Resilience & reconnection strategy

6. Observability in streaming systems

7. Challenge example: AI pair coder streaming

8. Optimizing perceived latency

9. Architectural checklist for streaming readiness

Discussion prompts for engineers

Takeaway

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Browser Automation: Building Agents That See and Click

Streaming at Scale: SSE, WebSockets & Designing Real-Time AI APIs

Share this post

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Browser Automation: Building Agents That See and Click

Weekly Bytes of AI