Back to Blog

Calculating End-to-End Latency Correctly

A user reports your AI chat takes 8 seconds to respond. Your monitoring shows 2-second latency. You both see the same request, but you're looking at different parts of it.

End-to-end latency isn't one number. It's an equation with terms that hide from different observers.

The Equation

E2EL = TTFT + (ITL × tokens) + network + client rendering

The first two terms live on your servers. The last two live between your servers and the user's eyeballs. Most monitoring systems only see the first two.

TTFT (Time to First Token): The model processing the prompt. This is prefill: attention computed across the entire input sequence. O(n²) in prompt length. A 2,000 token prompt takes roughly 4x longer to prefill than a 1,000 token prompt.

Generation time (ITL × tokens): Each output token generated sequentially. Memory-bound on the GPU. Typically 20-50ms per token for a 70B model on an H100.

Network latency: Round-trip between your server and the user. Depends on geography, CDN configuration, and whether your infrastructure accidentally routes through a continent the user doesn't live on.

Client rendering: The browser parsing SSE events and painting text. Usually negligible, unless you're doing something exotic with the DOM.

Where Teams Get Fooled

A fast TTFT with slow generation feels responsive. Users see text appearing quickly and forgive the wait.

A slow TTFT with fast generation feels broken. Users stare at a blank screen, wonder if it crashed, maybe refresh. By the time tokens start streaming, they've already formed an opinion.

Same total latency. Different experience.

# Scenario A: 500ms TTFT, 50 tok/s generation, 200 tokens
# E2EL = 500ms + 4000ms = 4.5s
# User sees: Quick response, then steady stream

# Scenario B: 3000ms TTFT, 200 tok/s generation, 200 tokens
# E2EL = 3000ms + 1000ms = 4s
# User sees: Long wait, then text explosion

# B is technically faster. A feels better.

Optimizing for total latency without considering distribution is like optimizing for average income without considering that one billionaire skews the number.

The Network Term Nobody Measures

Here's what catches teams off guard: network latency can exceed model latency.

A user in Singapore hitting your us-east-1 endpoint adds 200-300ms per round-trip. For streaming, that's the initial connection. But if your client is polling instead of using proper SSE, each "check for new tokens" pays that penalty.

I've seen systems where network accounted for 60% of perceived latency. The team spent months optimizing the model. They could have moved to a Singapore region in an afternoon.

# Quick check: is network your problem?
curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
     -o /dev/null -s https://your-api.com/health

If TTFB is 500ms for a health check that returns 10 bytes, your model isn't the bottleneck.

Measuring the Right Terms

Instrument each term separately:

import time

class LatencyBreakdown:
    def __init__(self):
        self.request_received = None
        self.first_token_ready = None
        self.last_token_ready = None
        self.response_sent = None

    def ttft(self):
        return self.first_token_ready - self.request_received

    def generation_time(self):
        return self.last_token_ready - self.first_token_ready

    def server_e2el(self):
        return self.last_token_ready - self.request_received

    # Client-side latency requires client instrumentation
    # You need both sides to see the full picture

The gap between server_e2el and what users report is everything that happens outside your servers. If that gap is large, you have a network or client problem, not a model problem.

The Optimization Sequence

Different terms require different fixes:

TermCauseFix
High TTFTLong prompt, cold cachePrefix caching, shorter prompts
High ITLModel size, memory bandwidthSmaller model, better GPU, quantization
High networkGeography, routingMulti-region, edge deployment
High renderingClient implementationProper SSE handling, DOM batching

Optimizing the wrong term wastes months. Measure first.

When the Math Doesn't Add Up

Sometimes E2EL is longer than the sum of its parts. This usually means something is blocking that shouldn't be.

Common culprits:

Synchronous logging: Writing logs to disk between tokens. Each token waits for I/O.

GC pauses: Python's garbage collector running mid-generation. Adds 50-200ms spikes.

Connection pooling: Waiting for a database connection that you don't even need for this request but your middleware checks anyway.

Profile with traces, not just metrics. Metrics tell you the total. Traces tell you where it went.

The equation looks simple: E2EL = TTFT + generation + network + client. Knowing which term dominates your specific system is the difference between optimizing and guessing.