The Streaming Bug That Costs You 3 Seconds

The code is correct. Streaming is enabled. The API returns a stream. But users wait 3 seconds before seeing anything.

This is the most common invisible performance bug in LLM applications. The stream is being buffered somewhere between the model and the user.

Where Streaming Breaks

Model → [API Gateway] → [Load Balancer] → [Reverse Proxy] → [Your Backend] → [CDN] → Browser
              ↑                ↑                 ↑                              ↑
          Buffers?         Buffers?          Buffers?                      Buffers?

Any layer can buffer. Most do by default.

Nginx: The Usual Suspect

Nginx buffers responses by default. This is great for normal HTTP responses. It's terrible for streaming.

# Default behavior: buffers everything
location /api/chat {
    proxy_pass http://backend;
    # Implicitly: proxy_buffering on;
}

# What you need for streaming
location /api/chat {
    proxy_pass http://backend;
    proxy_buffering off;
    proxy_cache off;
    proxy_http_version 1.1;
    proxy_set_header Connection '';
    chunked_transfer_encoding off;
}

The proxy_buffering off is essential. Without it, Nginx collects the entire response before forwarding.

AWS ALB/ELB

AWS load balancers have their own buffering behavior:

# ALB default: buffers responses
# You need to configure your target group

# For streaming, ensure:
# 1. Target group protocol is HTTP/1.1 or HTTP/2
# 2. Idle timeout is sufficient (default 60s may be too short for long generations)
# 3. Deregistration delay won't kill active streams

Check your ALB target group settings. The default idle timeout of 60 seconds will kill streams that take longer.

Cloudflare and CDNs

CDNs cache and buffer by default. For streaming endpoints:

# Cloudflare: Disable caching for streaming routes
# Page Rule: /api/chat/* → Cache Level: Bypass

# Or via headers in your response:
Cache-Control: no-cache, no-store, must-revalidate
X-Accel-Buffering: no

The X-Accel-Buffering: no header tells proxies not to buffer.

Your Application Server

Even your application code might buffer:

# Flask: Streams by default with Response generator
# But if you're using a WSGI server...

# Gunicorn with sync workers: May buffer
# Gunicorn with async workers (gevent/eventlet): Usually fine
# uWSGI: Check uwsgi_buffering setting

# FastAPI/Starlette: Streams correctly with StreamingResponse
from fastapi.responses import StreamingResponse

@app.post("/chat")
async def chat(request: ChatRequest):
    async def generate():
        async for token in model.stream(request.prompt):
            yield f"data: {token}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        }
    )

How to Debug

Add timestamps at each layer:

import time

async def generate_with_timing():
    yield f"data: {json.dumps({'debug': 'stream_start', 't': time.time()})}\n\n"

    async for token in model.stream(prompt):
        yield f"data: {json.dumps({'token': token, 't': time.time()})}\n\n"

On the client, log when each chunk arrives:

const response = await fetch('/api/chat', { method: 'POST', body: ... });
const reader = response.body.getReader();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const text = new TextDecoder().decode(value);
  console.log(`Received at ${Date.now()}:`, text);
}

If server timestamps show tokens arriving 50ms apart but client sees them arriving in batches, something between them is buffering.

The Checklist

When streaming feels slow, check in order:

CDN/Edge: Bypass caching for streaming routes
Load balancer: Increase idle timeout, check buffering settings
Reverse proxy (Nginx): proxy_buffering off
Application server: Use async workers, check framework settings
Application code: Return proper streaming response with correct headers
Client code: Process chunks as they arrive, don't wait for complete response

Each layer adds latency when misconfigured. A properly configured stack delivers tokens within milliseconds of generation. A misconfigured one batches them into multi-second chunks.

The stream is only as fast as its slowest buffer.