The Latency You're Not Measuring
Your model inference takes 200ms. Your users experience 800ms. The difference isn't magic. It's everything you're not measuring.
The Full Picture
User clicks send
↓ 50ms - Browser JavaScript processing
↓ 80ms - DNS + TLS + TCP handshake (first request)
↓ 40ms - Network transit to your edge
↓ 20ms - Load balancer routing
↓ 30ms - Your API processing before LLM call
↓ 200ms - Actual model inference ← This is what you measure
↓ 30ms - Your API processing after LLM call
↓ 40ms - Network transit back to user
↓ 50ms - Browser rendering
↓
User sees response: 540ms total
But wait, you're using an external API:
↓ 60ms - Network to API provider
↓ 200ms - Model inference
↓ 60ms - Network back from API provider
Actual total: 660ms
That 200ms model latency is less than a third of what users experience.
What to Measure
import time
from dataclasses import dataclass
@dataclass
class LatencyBreakdown:
client_to_server: float
preprocessing: float
model_inference: float
postprocessing: float
server_to_client: float
total: float
async def measure_full_latency(request) -> LatencyBreakdown:
# Client timestamp comes in header
client_send_time = float(request.headers.get('X-Client-Timestamp', 0))
server_receive_time = time.time()
# Preprocessing
preprocess_start = time.time()
processed_input = preprocess(request.body)
preprocess_end = time.time()
# Model call
model_start = time.time()
result = await model.generate(processed_input)
model_end = time.time()
# Postprocessing
postprocess_start = time.time()
response = postprocess(result)
postprocess_end = time.time()
server_send_time = time.time()
return LatencyBreakdown(
client_to_server=server_receive_time - client_send_time if client_send_time else 0,
preprocessing=preprocess_end - preprocess_start,
model_inference=model_end - model_start,
postprocessing=postprocess_end - postprocess_start,
server_to_client=0, # Measured on client
total=server_send_time - server_receive_time
)
Geographic Reality
Your server is in us-east-1. Your user is in Tokyo.
Speed of light in fiber: ~200,000 km/s
Tokyo to Virginia: ~11,000 km
Minimum round-trip: 110ms (physics)
Actual round-trip: 150-200ms (routing, hops)
No amount of model optimization fixes geography. A user 10,000 km away experiences 100ms+ network latency each way.
Solutions:
- Edge deployment (run inference closer to users)
- Regional API endpoints
- Caching where possible
The Preprocessing Tax
Common preprocessing steps that add latency:
# Tokenization: 5-20ms for long prompts
tokens = tokenizer.encode(prompt)
# Embedding lookup: 10-50ms if hitting a database
user_context = await db.get_user_context(user_id)
# Template rendering: 1-5ms
full_prompt = template.render(context=user_context, query=prompt)
# Moderation check: 50-200ms if using another API
safety_result = await moderation_api.check(prompt)
A "simple" preprocessing pipeline can add 100-300ms before the model even sees the request.
The Postprocessing Tax
# JSON parsing: 1-5ms
parsed = json.loads(model_response)
# Validation: 1-10ms
validated = schema.validate(parsed)
# Database writes: 20-100ms
await db.save_conversation(user_id, prompt, response)
# Analytics: 5-50ms (hopefully async)
analytics.track("completion", {...})
If these happen synchronously before sending the response, they add directly to user-perceived latency.
Measuring from the Client
The only latency that matters is what users experience:
// Client-side measurement
const start = performance.now();
const response = await fetch('/api/chat', {
method: 'POST',
headers: {
'X-Client-Timestamp': (Date.now() / 1000).toString(),
},
body: JSON.stringify({ prompt })
});
// For streaming: time to first byte
const ttfb = performance.now() - start;
// Read the stream
const reader = response.body.getReader();
let firstChunk = true;
while (true) {
const { done, value } = await reader.read();
if (done) break;
if (firstChunk) {
console.log(`Time to first token: ${performance.now() - start}ms`);
firstChunk = false;
}
}
console.log(`Total time: ${performance.now() - start}ms`);
The Optimization Priority
When you see 800ms end-to-end latency and 200ms model latency:
- First, understand where the other 600ms goes
- Then optimize the largest contributor
- Model optimization matters, but it's often not the bottleneck
A 50% improvement in model latency (200ms → 100ms) saves 100ms. A 50% improvement in network latency (400ms → 200ms) saves 200ms.
Measure everything. Optimize what matters.