Back to Blog

Why First Token Latency Determines User Experience

Your throughput is 50 tokens per second. Your user sees a loading spinner for 2.5 seconds before anything appears. They think your app is broken.

This is the TTFT problem. Time to First Token is what users perceive as "thinking time." Everything else is just watching text appear.

What Happens During TTFT

When a request arrives, two things must happen before the first token can be generated:

  1. Prefill: The model processes the entire input prompt through attention, computing key-value pairs for every token
  2. KV cache allocation: Memory is allocated to store those key-value pairs for the decode phase

Only after prefill completes can the first output token be generated. A 4,000 token prompt requires 4,000 forward passes through the attention mechanism before anything comes out.

def measure_ttft(prompt: str, model) -> tuple[float, str]:
    start = time.time()
    stream = model.generate_stream(prompt)

    first_token = next(stream)  # Blocks until prefill completes
    ttft = time.time() - start

    return ttft, first_token

Why Long Prompts Kill TTFT

Prefill is compute-bound, not memory-bound. The attention computation scales quadratically with sequence length for standard attention, linearly for optimized implementations like FlashAttention.

Prompt LengthTypical TTFT (70B model)
500 tokens200ms
2,000 tokens600ms
8,000 tokens1.8s
32,000 tokens6s+

That 32,000 token context window you're using? It comes with a 6-second thinking delay before anything happens.

The TTFT vs Throughput Tradeoff

Here's the uncomfortable truth: optimizing for throughput often hurts TTFT.

Larger batch sizes increase throughput (more tokens per second across all requests) but increase latency for individual requests. The request that would have started immediately now waits for a batch to fill.

Batch size 1:  Request arrives → Prefill starts → TTFT: 400ms
Batch size 8:  Request arrives → Wait for batch → Prefill starts → TTFT: 1200ms

Throughput went up. User experience went down.

Optimizing TTFT

Shorter prompts: Every token in your prompt adds to prefill time. That 2,000 token system prompt? It costs 400ms of TTFT on every request. Trim ruthlessly.

Prefix caching: If your system prompt is the same across requests, cache its KV pairs. The next request skips prefill for those tokens.

# Without caching: 2000 token system prompt + 500 token user message
# TTFT = prefill(2500 tokens) = ~700ms

# With caching: Only user message needs prefill
# TTFT = prefill(500 tokens) = ~200ms

Smaller models for classification: If the first step is routing or classification, use a fast small model. 8B model TTFT: 50ms. 70B model TTFT: 400ms.

Speculative decoding: Use a small draft model to predict tokens, verify with the large model. When predictions match, you skip decode steps.

TTFT in Streaming vs Non-Streaming

Streaming doesn't change TTFT. It changes perception of the decode phase.

Non-streaming: [TTFT: 500ms] → [Nothing] → [Full response at 3s]
Streaming:     [TTFT: 500ms] → [Token stream for 2.5s]

Users perceive both as having a 500ms "thinking" delay. But streaming feels faster because they see progress.

The takeaway: streaming helps the decode phase feel faster, but TTFT is the fixed cost you always pay.

Measuring TTFT Correctly

Don't measure from when your backend receives the request. Measure from the user's perspective:

# Wrong: Measures only model time
ttft = model.generate_ttft(prompt)

# Better: Measures from API call
start = time.time()
response = await client.chat.completions.create(...)
first_chunk = await anext(response)
ttft = time.time() - start

# Best: Measures from user action (requires client instrumentation)
# button_click_time → first_token_render_time

Network latency, load balancer queuing, and token parsing all add to what the user experiences. A 200ms model TTFT can become 800ms end-to-end.

When TTFT Matters Most

Not all use cases are equally sensitive:

Chat: TTFT is critical. Users expect immediate feedback. Above 1 second feels broken.

Code completion: TTFT is critical. The cursor is blinking. Above 200ms breaks flow.

Batch processing: TTFT doesn't matter. Total completion time matters.

Summarization: TTFT matters less. Users expect long documents to take time.

Match your optimization to your use case. Don't sacrifice throughput for TTFT on batch jobs. Don't sacrifice TTFT for throughput on chat.

The first token is a promise that the rest are coming. Make that promise quickly.