Back to Blog

Why Streaming Breaks and How to Fix It

Database engineers have a saying: "It's always DNS." In LLM land, when streaming doesn't work, it's always buffering.

The model logs show tokens emitting every 30ms. The monitoring says streaming is enabled. But users see a blank screen for six seconds, then all the text appears at once. Like a dam breaking.

The problem is never the model. It's the plumbing.

Water Through Pipes

Think of streaming like water through pipes. Your LLM is the source, trickling out tokens one by one. But between source and tap, there are holding tanks. Nginx. CloudFlare. The browser itself. Each one can collect your trickle into a reservoir, then dump it all at once.

Nginx is the usual suspect. It ships with proxy_buffering on by default. The name sounds innocent. What it actually does: collect the entire response from upstream, store it in memory, then forward it complete. Great for serving a 50KB JSON payload. Catastrophic for a 30-second stream.

# This single line destroys streaming
proxy_buffering on;  # Default. Silent killer.

# The fix
proxy_buffering off;

But even with proxy_buffering off, Nginx still buffers the first chunk until it receives headers. And if your upstream is slow to send those headers, users wait. The first byte is still buffered.

For SSE specifically:

proxy_http_version 1.1;
proxy_set_header Connection '';
chunked_transfer_encoding off;
proxy_buffering off;
proxy_cache off;

That Connection '' header is counterintuitive. You'd think you want keep-alive. But for SSE, you want the connection to stay open without Nginx trying to be smart about it.

Headers Matter

The headers your response sends matter just as much. SSE requires Content-Type: text/event-stream. Miss it, and browsers fall back to their default behavior: buffer everything, fire one event at the end.

async def stream():
    return StreamingResponse(
        generate_tokens(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache, no-store",
            "X-Accel-Buffering": "no",  # Nginx escape hatch
        }
    )

That X-Accel-Buffering header is the backdoor. It tells Nginx to skip buffering even when the global config says otherwise. Useful when you can't change nginx.conf but own the application.

The Timeout Trap

HTTP was designed for request-response cycles measured in milliseconds. LLM streams run for seconds, sometimes minutes. Default timeouts of 60s sound generous until your 70B model needs 90 seconds for a complex query. Connection drops. Client retries. User sees nothing.

proxy_read_timeout 300s;  # Five minutes. Not paranoid. Production.

I've seen teams set read timeouts to 5 minutes and call it paranoid. It's not paranoid. It's production.

Client-Side Mistakes

On the client side, the mistake is subtle:

// Looks right. Buffers everything.
const response = await fetch(url);
const text = await response.text();

// Actually streams
const reader = response.body.getReader();
while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    // value arrives as chunks, not all at once
}

The text() method is doing exactly what it promises: returning the complete text. That means waiting for the complete text. Reading the stream manually is more code, but it's the only way to actually stream.

Debugging Sequence

When streaming "doesn't work," here's my debugging sequence:

First, bypass everything:

curl -N http://localhost:8000/stream

The -N flag disables curl's output buffering. If tokens trickle here, your model is fine. The problem is downstream.

Next, add back one layer at a time. Hit the same endpoint through your reverse proxy. Then through your load balancer. Then through your CDN. The layer where streaming stops is the layer with the problem.

The CDN Problem

CloudFlare deserves special mention. CDNs are designed to cache complete responses. Caching means buffering. Your streaming endpoints need explicit bypass rules. In CloudFlare's dashboard, find Page Rules, set Cache Level to Bypass. Or use the Cache-Control: no-store header and hope your CDN respects it.

The Three-Week Optimization

I watched a team spend three weeks optimizing model performance. They got token generation down from 40ms to 25ms. Beautiful numbers in the logs. Their streaming worked perfectly on localhost.

In production, a DevOps engineer had deployed Nginx with default settings six months earlier. Nobody had touched it since. Nobody knew proxy_buffering existed.

One config change. Problem solved. Three weeks of model optimization that users never felt.

Streaming LLM responses is straightforward in concept. In practice, every infrastructure layer defaults to batching. You have to explicitly disable buffering at each one.

The model streams. The question is whether anything between model and user will let those tokens through.