What P99 Latency Tells You That P50 Hides
Database engineers learned this decades ago: mean latency is a lie. One slow query in a hundred is enough to make users think the whole system is broken.
LLM inference is learning the same lesson.
The Math of Bad Experiences
Your P50 latency is 200ms. Your P99 is 8 seconds. Sounds like a rare edge case, right?
A user makes 10 requests per session. The probability they hit at least one P99 outlier:
P(at least one outlier) = 1 - (0.99)^10 = 9.6%
Nearly one in ten sessions includes an 8-second wait. That's not an edge case. That's a retention problem.
At 50 requests per session (a reasonable chat conversation):
P(at least one outlier) = 1 - (0.99)^50 = 39.5%
Four in ten users have a terrible experience. Your P50 dashboard is green.
Where Tail Latency Comes From
LLM tail latency has specific causes:
Long output generation: P50 might be 100 tokens. P99 might be 2,000 tokens. Generation time scales linearly with output length.
# P50 output: 100 tokens @ 50 tok/s = 2s
# P99 output: 2000 tokens @ 50 tok/s = 40s
Cold starts: First request to a scaled-down instance pays model loading time. Subsequent requests are fast.
Queue depth: Under load, some requests wait. The unlucky ones wait longest.
Long input processing: A request with 50,000 tokens takes longer to prefill than one with 500 tokens.
Measuring the Right Percentiles
P50 (median): What half your requests achieve. Useless for understanding user experience.
P95: What 95% of requests achieve. Useful for capacity planning.
P99: What 99% of requests achieve. What your unluckiest users experience.
P99.9: What 999 in 1000 requests achieve. What your most engaged users experience (they make more requests, so they're more likely to hit outliers).
import numpy as np
def latency_report(latencies: list[float]) -> dict:
return {
"p50": np.percentile(latencies, 50),
"p95": np.percentile(latencies, 95),
"p99": np.percentile(latencies, 99),
"p999": np.percentile(latencies, 99.9),
"max": max(latencies),
"count": len(latencies)
}
The Hidden P99 Problem
Most dashboards default to mean or P50. You have to explicitly ask for tail latencies.
Worse, aggregation hides problems. A 5-minute P99 of 3 seconds might contain a 1-minute window where P99 was 15 seconds. The spike gets averaged away.
-- Aggregated view hides spikes
SELECT
time_bucket('5 minutes', timestamp) as bucket,
percentile_cont(0.99) WITHIN GROUP (ORDER BY latency) as p99
FROM requests
GROUP BY bucket;
-- Per-minute view reveals them
SELECT
time_bucket('1 minute', timestamp) as bucket,
percentile_cont(0.99) WITHIN GROUP (ORDER BY latency) as p99
FROM requests
GROUP BY bucket;
Always look at multiple time resolutions.
Taming Tail Latency
Output limits: If P99 outputs are 20x longer than P50, set a max_tokens limit. Better to truncate than to make users wait.
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
max_tokens=500, # Cap output length
)
Timeout and retry: If a request exceeds your latency budget, cancel and retry. Sometimes you get a faster path.
async def generate_with_timeout(prompt: str, timeout: float = 5.0):
try:
return await asyncio.wait_for(
client.generate(prompt),
timeout=timeout
)
except asyncio.TimeoutError:
# Retry once
return await asyncio.wait_for(
client.generate(prompt),
timeout=timeout
)
Request hedging: Send the same request to two backends. Use whichever responds first. Expensive, but eliminates tail latency.
Load shedding: When queue depth exceeds a threshold, reject new requests immediately rather than letting them queue and contribute to tail latency.
SLAs and Tail Latency
If your SLA says "200ms P50 latency," you haven't promised anything useful. Users don't experience P50.
Better SLA structure:
- P50 < 200ms (normal operation)
- P95 < 500ms (acceptable variation)
- P99 < 2s (worst acceptable experience)
- P99.9 < 5s (extreme edge case)
Each tier should have consequences for breach. P50 violations are bugs. P99 violations are incidents.
The Dashboard That Actually Helps
Stop looking at average latency. Start looking at:
- P99 over time (is tail getting worse?)
- P99 by request type (which features have tail problems?)
- P99/P50 ratio (how variable is your system?)
- Requests above threshold (how many users are affected?)
def tail_health_score(latencies: list[float], threshold: float) -> dict:
p50 = np.percentile(latencies, 50)
p99 = np.percentile(latencies, 99)
above_threshold = sum(1 for l in latencies if l > threshold)
return {
"variability_ratio": p99 / p50, # Lower is better
"pct_above_threshold": above_threshold / len(latencies) * 100,
"tail_health": "good" if p99 / p50 < 5 else "degraded"
}
A P99/P50 ratio above 10x means your system is unpredictable. Users will notice.
Your fastest users don't remember how fast you were. Your slowest users remember how slow you were. Optimize accordingly.