Balancing Fast Responses and Fair Queuing
A 10,000-token request takes 20 seconds. Behind it, a hundred 50-token requests wait. Is that fair? What even is fair in LLM serving?
Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.
A 10,000-token request takes 20 seconds. Behind it, a hundred 50-token requests wait. Is that fair? What even is fair in LLM serving?
100 requests sounds like 100 requests. But one 50k-token request consumes more resources than 99 short ones combined. Batch by tokens, not requests.
Traffic spikes 10x. Do you queue requests until OOM, drop them randomly, or gracefully degrade? The answer shapes your system's behavior under pressure.
You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.
vLLM doesn't use a faster model. It uses memory smarter. PagedAttention treats KV cache like virtual memory, and the results are dramatic.
Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fills those gaps. The difference is 3-5x throughput.
Single-user latency was 200ms. At 100 concurrent users, it's 3 seconds. The model didn't slow down. Your serving architecture did.
5% of requests fail. You retry 3 times. That's not 5% overhead. It's 15%. And under pressure, it gets much worse.
FlashAttention claims 2-4x speedup. CUDA graphs claim 10x. What actually helps in production, and what's just good marketing?
That benchmark showing 10,000 tokens/second? It probably used batch size 64 and measured mean latency. Here's how to benchmark for reality.