Calculating End-to-End Latency Correctly
A user reports your AI chat takes 8 seconds to respond. Your monitoring shows 2-second latency. You both see the same request, but you're looking at different parts of it.
End-to-end latency isn't one number. It's an equation with terms that hide from different observers.
The Equation
E2EL = TTFT + (ITL × tokens) + network + client rendering
The first two terms live on your servers. The last two live between your servers and the user's eyeballs. Most monitoring systems only see the first two.
TTFT (Time to First Token): The model processing the prompt. This is prefill: attention computed across the entire input sequence. O(n²) in prompt length. A 2,000 token prompt takes roughly 4x longer to prefill than a 1,000 token prompt.
Generation time (ITL × tokens): Each output token generated sequentially. Memory-bound on the GPU. Typically 20-50ms per token for a 70B model on an H100.
Network latency: Round-trip between your server and the user. Depends on geography, CDN configuration, and whether your infrastructure accidentally routes through a continent the user doesn't live on.
Client rendering: The browser parsing SSE events and painting text. Usually negligible, unless you're doing something exotic with the DOM.
Where Teams Get Fooled
A fast TTFT with slow generation feels responsive. Users see text appearing quickly and forgive the wait.
A slow TTFT with fast generation feels broken. Users stare at a blank screen, wonder if it crashed, maybe refresh. By the time tokens start streaming, they've already formed an opinion.
Same total latency. Different experience.
# Scenario A: 500ms TTFT, 50 tok/s generation, 200 tokens
# E2EL = 500ms + 4000ms = 4.5s
# User sees: Quick response, then steady stream
# Scenario B: 3000ms TTFT, 200 tok/s generation, 200 tokens
# E2EL = 3000ms + 1000ms = 4s
# User sees: Long wait, then text explosion
# B is technically faster. A feels better.
Optimizing for total latency without considering distribution is like optimizing for average income without considering that one billionaire skews the number.
The Network Term Nobody Measures
Here's what catches teams off guard: network latency can exceed model latency.
A user in Singapore hitting your us-east-1 endpoint adds 200-300ms per round-trip. For streaming, that's the initial connection. But if your client is polling instead of using proper SSE, each "check for new tokens" pays that penalty.
I've seen systems where network accounted for 60% of perceived latency. The team spent months optimizing the model. They could have moved to a Singapore region in an afternoon.
# Quick check: is network your problem?
curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://your-api.com/health
If TTFB is 500ms for a health check that returns 10 bytes, your model isn't the bottleneck.
Measuring the Right Terms
Instrument each term separately:
import time
class LatencyBreakdown:
def __init__(self):
self.request_received = None
self.first_token_ready = None
self.last_token_ready = None
self.response_sent = None
def ttft(self):
return self.first_token_ready - self.request_received
def generation_time(self):
return self.last_token_ready - self.first_token_ready
def server_e2el(self):
return self.last_token_ready - self.request_received
# Client-side latency requires client instrumentation
# You need both sides to see the full picture
The gap between server_e2el and what users report is everything that happens outside your servers. If that gap is large, you have a network or client problem, not a model problem.
The Optimization Sequence
Different terms require different fixes:
| Term | Cause | Fix |
|---|---|---|
| High TTFT | Long prompt, cold cache | Prefix caching, shorter prompts |
| High ITL | Model size, memory bandwidth | Smaller model, better GPU, quantization |
| High network | Geography, routing | Multi-region, edge deployment |
| High rendering | Client implementation | Proper SSE handling, DOM batching |
Optimizing the wrong term wastes months. Measure first.
When the Math Doesn't Add Up
Sometimes E2EL is longer than the sum of its parts. This usually means something is blocking that shouldn't be.
Common culprits:
Synchronous logging: Writing logs to disk between tokens. Each token waits for I/O.
GC pauses: Python's garbage collector running mid-generation. Adds 50-200ms spikes.
Connection pooling: Waiting for a database connection that you don't even need for this request but your middleware checks anyway.
Profile with traces, not just metrics. Metrics tell you the total. Traces tell you where it went.
The equation looks simple: E2EL = TTFT + generation + network + client. Knowing which term dominates your specific system is the difference between optimizing and guessing.