Four Metrics That Actually Matter for LLM Inference

Database engineers learned this lesson decades ago: mean latency is a lie. One slow query in a hundred is enough to make users think the whole system is broken. LLM inference is learning the same lesson, the hard way.

I've watched teams optimize mean latency for months, then wonder why users still complain. The dashboard showed 180ms. Users experienced 3 seconds. Both numbers were accurate. They were just measuring different things.

What Users Actually Feel

When you send a prompt to an LLM, the response doesn't arrive as a single packet. It streams. And that stream has a rhythm that a single "latency" number can't capture.

Time to First Token (TTFT) is the silence before the model speaks. The seconds between hitting send and seeing that first character appear. It's what users perceive as "thinking time."

TTFT is like Lambda cold start latency. The first request pays a setup cost that subsequent tokens don't. Under the hood, this is prefill: the model processes your entire prompt through attention, building the KV cache it needs for generation. Long prompts mean longer TTFT. There's no escaping the physics.

For interactive use cases, 500ms TTFT is the upper bound of acceptable. Beyond that, users start wondering if the request went through.

Inter-Token Latency (ITL) is the gap between tokens once streaming starts. Consistent ITL feels smooth. Variable ITL feels choppy, like a video buffering.

avg_itl = (time_last_token - time_first_token) / (total_tokens - 1)
# Target: 20-50ms for natural reading speed
# Humans read at ~250 words/minute anyway

ITL above 100ms is perceptible. Users notice the text... arriving... in... chunks.

End-to-End Latency (E2EL) is the full duration from request to complete response. For streaming, it's TTFT plus all the inter-token gaps. For batch APIs that don't stream, it's the only number you get.

The relationship matters: E2EL = TTFT + (ITL × token_count). A 10-second E2EL with 200ms TTFT feels responsive. A 3-second E2EL with 2.5-second TTFT feels broken. Same total time, opposite experience.

The Throughput Trap

Throughput measures tokens per second across all concurrent requests. It's the capacity of your system.

throughput = total_tokens_generated / time_window

The trap: throughput and latency are not friends. Higher batch sizes increase throughput but also increase latency. You can have 1000 tokens/second with 3-second latency, or 200 tokens/second with 200ms latency.

Teams celebrating raw throughput numbers without checking latency are about to learn why their users are churning.

P99 Tells the Truth

Here's a production example:

Percentile	Latency
P50	180ms
P95	420ms
P99	2,400ms
Mean	250ms

Mean says 250ms. Looks great on a dashboard. But 1 in 100 users waits 10x longer than average. That's the user who tweets about your slow AI. That's the enterprise customer whose IT department blocks your app.

P99 is like the "fastest route" calculation on Google Maps. The average time doesn't help if you're the one stuck in the 1-in-100 traffic jam. You need to know the worst case, not the typical case.

Always track P99. Preferably P99.9 if you have the volume. Mean is a vanity metric.

Goodput: The Metric That Matters

Here's one most teams miss entirely.

Goodput is to throughput what revenue is to GMV. It's the tokens that actually count.

goodput = successful_useful_tokens / time_window

A system generating 1000 tok/s with 20% failed or retried requests has 800 tok/s goodput. A system generating 500 tok/s with 99% success rate has 495 tok/s goodput. Which would you rather run in production?

Failed requests, timeouts, malformed outputs, retries: they all eat into goodput. If you're not tracking it, you don't know how much value you're actually delivering.

Where to Start

If you're instrumenting from scratch:

P99 TTFT. This is user experience.
P99 E2EL. This is SLA compliance.
Goodput. This is actual value delivered.

Ignore mean latency. Ignore raw throughput. Those metrics feel good on dashboards but hide the truth about your system.

The metrics you choose to track shape the system you build. Choose the ones that tell you what users actually experience.