Back to Blog

The Latency You're Not Measuring

Your model inference takes 200ms. Your users experience 800ms. The difference isn't magic. It's everything you're not measuring.

The Full Picture

User clicks send
    ↓ 50ms  - Browser JavaScript processing
    ↓ 80ms  - DNS + TLS + TCP handshake (first request)
    ↓ 40ms  - Network transit to your edge
    ↓ 20ms  - Load balancer routing
    ↓ 30ms  - Your API processing before LLM call
    ↓ 200ms - Actual model inference ← This is what you measure
    ↓ 30ms  - Your API processing after LLM call
    ↓ 40ms  - Network transit back to user
    ↓ 50ms  - Browser rendering
    ↓
User sees response: 540ms total

But wait, you're using an external API:
    ↓ 60ms  - Network to API provider
    ↓ 200ms - Model inference
    ↓ 60ms  - Network back from API provider

Actual total: 660ms

That 200ms model latency is less than a third of what users experience.

What to Measure

import time
from dataclasses import dataclass

@dataclass
class LatencyBreakdown:
    client_to_server: float
    preprocessing: float
    model_inference: float
    postprocessing: float
    server_to_client: float
    total: float

async def measure_full_latency(request) -> LatencyBreakdown:
    # Client timestamp comes in header
    client_send_time = float(request.headers.get('X-Client-Timestamp', 0))
    server_receive_time = time.time()

    # Preprocessing
    preprocess_start = time.time()
    processed_input = preprocess(request.body)
    preprocess_end = time.time()

    # Model call
    model_start = time.time()
    result = await model.generate(processed_input)
    model_end = time.time()

    # Postprocessing
    postprocess_start = time.time()
    response = postprocess(result)
    postprocess_end = time.time()

    server_send_time = time.time()

    return LatencyBreakdown(
        client_to_server=server_receive_time - client_send_time if client_send_time else 0,
        preprocessing=preprocess_end - preprocess_start,
        model_inference=model_end - model_start,
        postprocessing=postprocess_end - postprocess_start,
        server_to_client=0,  # Measured on client
        total=server_send_time - server_receive_time
    )

Geographic Reality

Your server is in us-east-1. Your user is in Tokyo.

Speed of light in fiber: ~200,000 km/s
Tokyo to Virginia: ~11,000 km
Minimum round-trip: 110ms (physics)
Actual round-trip: 150-200ms (routing, hops)

No amount of model optimization fixes geography. A user 10,000 km away experiences 100ms+ network latency each way.

Solutions:

  • Edge deployment (run inference closer to users)
  • Regional API endpoints
  • Caching where possible

The Preprocessing Tax

Common preprocessing steps that add latency:

# Tokenization: 5-20ms for long prompts
tokens = tokenizer.encode(prompt)

# Embedding lookup: 10-50ms if hitting a database
user_context = await db.get_user_context(user_id)

# Template rendering: 1-5ms
full_prompt = template.render(context=user_context, query=prompt)

# Moderation check: 50-200ms if using another API
safety_result = await moderation_api.check(prompt)

A "simple" preprocessing pipeline can add 100-300ms before the model even sees the request.

The Postprocessing Tax

# JSON parsing: 1-5ms
parsed = json.loads(model_response)

# Validation: 1-10ms
validated = schema.validate(parsed)

# Database writes: 20-100ms
await db.save_conversation(user_id, prompt, response)

# Analytics: 5-50ms (hopefully async)
analytics.track("completion", {...})

If these happen synchronously before sending the response, they add directly to user-perceived latency.

Measuring from the Client

The only latency that matters is what users experience:

// Client-side measurement
const start = performance.now();

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: {
    'X-Client-Timestamp': (Date.now() / 1000).toString(),
  },
  body: JSON.stringify({ prompt })
});

// For streaming: time to first byte
const ttfb = performance.now() - start;

// Read the stream
const reader = response.body.getReader();
let firstChunk = true;

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  if (firstChunk) {
    console.log(`Time to first token: ${performance.now() - start}ms`);
    firstChunk = false;
  }
}

console.log(`Total time: ${performance.now() - start}ms`);

The Optimization Priority

When you see 800ms end-to-end latency and 200ms model latency:

  1. First, understand where the other 600ms goes
  2. Then optimize the largest contributor
  3. Model optimization matters, but it's often not the bottleneck

A 50% improvement in model latency (200ms → 100ms) saves 100ms. A 50% improvement in network latency (400ms → 200ms) saves 200ms.

Measure everything. Optimize what matters.