#latency

11 posts tagged with "latency"

Dec 6, 2025

How Speculative Decoding Works

A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.

Mar 26, 2025

Balancing Fast Responses and Fair Queuing

A 10,000-token request takes 20 seconds. Behind it, a hundred 50-token requests wait. Is that fair? What even is fair in LLM serving?

Mar 15, 2025

The Tradeoff Every Inference System Makes

You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.

Feb 19, 2025

The Latency You're Not Measuring

Model latency is 200ms. End-to-end latency is 800ms. Where did 600ms go? Probably somewhere you're not looking.

Feb 15, 2025

The Streaming Bug That Costs You 3 Seconds

Your code says streaming is enabled. Your load balancer says otherwise. Here's where streaming breaks and how to fix it.

Feb 12, 2025

Prefill vs Decode: The Two Phases That Shape Latency

Every LLM request has two distinct phases with different performance characteristics. Understanding them is the key to targeted optimization.

Feb 8, 2025

What P99 Latency Tells You That P50 Hides

Median latency is 200ms. One in a hundred requests takes 8 seconds. Your dashboard shows green. Your users are churning.

Feb 5, 2025

Why First Token Latency Determines User Experience

Users don't perceive throughput. They perceive the silence before the first token appears. TTFT is the metric that determines whether your app feels fast.

Jan 18, 2025

Calculating End-to-End Latency Correctly

E2EL = TTFT + generation time sounds simple. But where does that time actually go? Understanding the equation reveals where to optimize.

Jan 4, 2025

Why Streaming Breaks and How to Fix It

Your code says streaming enabled. Your monitoring shows 0% actual streams. The bytes are getting collected somewhere between your model and the user's screen.