The Latency You're Not Measuring

Your model inference takes 200ms. Your users experience 800ms. The difference isn't magic. It's everything you're not measuring.

The Full Picture

User clicks send
    ↓ 50ms  - Browser JavaScript processing
    ↓ 80ms  - DNS + TLS + TCP handshake (first request)
    ↓ 40ms  - Network transit to your edge
    ↓ 20ms  - Load balancer routing
    ↓ 30ms  - Your API processing before LLM call
    ↓ 200ms - Actual model inference ← This is what you measure
    ↓ 30ms  - Your API processing after LLM call
    ↓ 40ms  - Network transit back to user
    ↓ 50ms  - Browser rendering
    ↓
User sees response: 540ms total

But wait, you're using an external API:
    ↓ 60ms  - Network to API provider
    ↓ 200ms - Model inference
    ↓ 60ms  - Network back from API provider

Actual total: 660ms

That 200ms model latency is less than a third of what users experience.

What to Measure

import time
from dataclasses import dataclass

@dataclass
class LatencyBreakdown:
    client_to_server: float
    preprocessing: float
    model_inference: float
    postprocessing: float
    server_to_client: float
    total: float

async def measure_full_latency(request) -> LatencyBreakdown:
    # Client timestamp comes in header
    client_send_time = float(request.headers.get('X-Client-Timestamp', 0))
    server_receive_time = time.time()

    # Preprocessing
    preprocess_start = time.time()
    processed_input = preprocess(request.body)
    preprocess_end = time.time()

    # Model call
    model_start = time.time()
    result = await model.generate(processed_input)
    model_end = time.time()

    # Postprocessing
    postprocess_start = time.time()
    response = postprocess(result)
    postprocess_end = time.time()

    server_send_time = time.time()

    return LatencyBreakdown(
        client_to_server=server_receive_time - client_send_time if client_send_time else 0,
        preprocessing=preprocess_end - preprocess_start,
        model_inference=model_end - model_start,
        postprocessing=postprocess_end - postprocess_start,
        server_to_client=0,  # Measured on client
        total=server_send_time - server_receive_time
    )

Geographic Reality

Your server is in us-east-1. Your user is in Tokyo.

Speed of light in fiber: ~200,000 km/s
Tokyo to Virginia: ~11,000 km
Minimum round-trip: 110ms (physics)
Actual round-trip: 150-200ms (routing, hops)

No amount of model optimization fixes geography. A user 10,000 km away experiences 100ms+ network latency each way.

Solutions:

Edge deployment (run inference closer to users)
Regional API endpoints
Caching where possible

The Preprocessing Tax

Common preprocessing steps that add latency:

# Tokenization: 5-20ms for long prompts
tokens = tokenizer.encode(prompt)

# Embedding lookup: 10-50ms if hitting a database
user_context = await db.get_user_context(user_id)

# Template rendering: 1-5ms
full_prompt = template.render(context=user_context, query=prompt)

# Moderation check: 50-200ms if using another API
safety_result = await moderation_api.check(prompt)

A "simple" preprocessing pipeline can add 100-300ms before the model even sees the request.

The Postprocessing Tax

# JSON parsing: 1-5ms
parsed = json.loads(model_response)

# Validation: 1-10ms
validated = schema.validate(parsed)

# Database writes: 20-100ms
await db.save_conversation(user_id, prompt, response)

# Analytics: 5-50ms (hopefully async)
analytics.track("completion", {...})

If these happen synchronously before sending the response, they add directly to user-perceived latency.

Measuring from the Client

The only latency that matters is what users experience:

// Client-side measurement
const start = performance.now();

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: {
    'X-Client-Timestamp': (Date.now() / 1000).toString(),
  },
  body: JSON.stringify({ prompt })
});

// For streaming: time to first byte
const ttfb = performance.now() - start;

// Read the stream
const reader = response.body.getReader();
let firstChunk = true;

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  if (firstChunk) {
    console.log(`Time to first token: ${performance.now() - start}ms`);
    firstChunk = false;
  }
}

console.log(`Total time: ${performance.now() - start}ms`);

The Optimization Priority

When you see 800ms end-to-end latency and 200ms model latency:

First, understand where the other 600ms goes
Then optimize the largest contributor
Model optimization matters, but it's often not the bottleneck

A 50% improvement in model latency (200ms → 100ms) saves 100ms. A 50% improvement in network latency (400ms → 200ms) saves 200ms.

Measure everything. Optimize what matters.