How Speculative Decoding Works
A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.
11 posts tagged with "latency"
A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.
A 10,000-token request takes 20 seconds. Behind it, a hundred 50-token requests wait. Is that fair? What even is fair in LLM serving?
You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.
Model latency is 200ms. End-to-end latency is 800ms. Where did 600ms go? Probably somewhere you're not looking.
Your code says streaming is enabled. Your load balancer says otherwise. Here's where streaming breaks and how to fix it.
Every LLM request has two distinct phases with different performance characteristics. Understanding them is the key to targeted optimization.
Median latency is 200ms. One in a hundred requests takes 8 seconds. Your dashboard shows green. Your users are churning.
Users don't perceive throughput. They perceive the silence before the first token appears. TTFT is the metric that determines whether your app feels fast.
E2EL = TTFT + generation time sounds simple. But where does that time actually go? Understanding the equation reveals where to optimize.
Your code says streaming enabled. Your monitoring shows 0% actual streams. The bytes are getting collected somewhere between your model and the user's screen.