Prefill vs Decode: The Two Phases That Shape Latency

Your prompt has 2,000 tokens. Your output has 200 tokens. Which phase dominates latency?

It depends on which phase you're talking about. LLM inference has two fundamentally different phases, and optimizing the wrong one wastes engineering effort.

The Two Phases

Prefill (also called prompt processing): The model processes your entire input prompt in parallel, computing attention across all input tokens. This builds the KV cache that decode will use.

Decode (also called generation): The model generates output tokens one at a time, each token attending to all previous tokens via the KV cache.

Request lifecycle:
[Input prompt] → [PREFILL] → [First token] → [DECODE] → [DECODE] → ... → [Done]
                    ↓              ↓             ↓
              Compute-bound    TTFT happens   Memory-bound

Different Bottlenecks

Prefill is compute-bound. You're doing massive matrix multiplications across thousands of tokens in parallel. The GPU's compute units are the bottleneck.

Decode is memory-bound. You're generating one token at a time, but each token requires reading the entire model weights and KV cache from memory. Memory bandwidth is the bottleneck.

This explains a counterintuitive fact: doubling your batch size during decode barely increases latency (you're already memory-bound), but doubling prompt length during prefill roughly doubles prefill time (you're compute-bound).

Measuring Each Phase

import time

def measure_phases(client, prompt: str) -> dict:
    start = time.time()

    stream = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    first_token_time = None
    token_count = 0

    for chunk in stream:
        if chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.time()
            token_count += 1

    end = time.time()

    prefill_time = first_token_time - start  # TTFT
    decode_time = end - first_token_time

    return {
        "prefill_ms": prefill_time * 1000,
        "decode_ms": decode_time * 1000,
        "tokens_generated": token_count,
        "ms_per_token": (decode_time * 1000) / token_count if token_count > 0 else 0
    }

Optimization Strategies Differ

For prefill optimization:

Reduce prompt length (fewer tokens = less compute)
Use prefix caching (skip redundant prefill for repeated content)
FlashAttention (more efficient attention computation)
Quantization helps less (still compute-bound)

For decode optimization:

Increase memory bandwidth (faster GPU, HBM3 vs HBM2)
Reduce model size via quantization (less memory to read)
Speculative decoding (generate multiple tokens per forward pass)
Larger batch sizes (amortize memory reads across requests)

When Each Phase Dominates

Short prompts, long outputs: Decode dominates. A 100-token prompt generating 1,000 tokens spends most time in decode.

Long prompts, short outputs: Prefill dominates. A 10,000-token document with a 50-token summary spends most time in prefill.

# Prefill-dominated workload
summarize_document(document_50k_tokens)  # 95% prefill, 5% decode

# Decode-dominated workload
write_essay(topic_100_tokens)  # 10% prefill, 90% decode

# Balanced workload
chat_with_context(history_2k_tokens)  # ~50/50

The Batching Tradeoff

This is where it gets interesting for serving systems.

During prefill, batching requests together increases latency for each individual request (more compute to do). During decode, batching requests together barely affects per-request latency (memory bandwidth was already the bottleneck).

Smart serving systems like vLLM exploit this: they batch decode steps aggressively but are more careful about prefill batching.

Practical Implications

If your users complain about slow "thinking time" (long pause before response starts), optimize prefill:

Shorten system prompts
Enable prefix caching
Consider a faster model for the prefill phase

If your users complain about slow "typing speed" (text appears slowly), optimize decode:

Use a quantized model
Ensure you're not CPU-bound on token processing
Check your streaming implementation

The phase that's slow determines the optimization that matters.