Prefill vs Decode: The Two Phases That Shape Latency
Your prompt has 2,000 tokens. Your output has 200 tokens. Which phase dominates latency?
It depends on which phase you're talking about. LLM inference has two fundamentally different phases, and optimizing the wrong one wastes engineering effort.
The Two Phases
Prefill (also called prompt processing): The model processes your entire input prompt in parallel, computing attention across all input tokens. This builds the KV cache that decode will use.
Decode (also called generation): The model generates output tokens one at a time, each token attending to all previous tokens via the KV cache.
Request lifecycle:
[Input prompt] → [PREFILL] → [First token] → [DECODE] → [DECODE] → ... → [Done]
↓ ↓ ↓
Compute-bound TTFT happens Memory-bound
Different Bottlenecks
Prefill is compute-bound. You're doing massive matrix multiplications across thousands of tokens in parallel. The GPU's compute units are the bottleneck.
Decode is memory-bound. You're generating one token at a time, but each token requires reading the entire model weights and KV cache from memory. Memory bandwidth is the bottleneck.
This explains a counterintuitive fact: doubling your batch size during decode barely increases latency (you're already memory-bound), but doubling prompt length during prefill roughly doubles prefill time (you're compute-bound).
Measuring Each Phase
import time
def measure_phases(client, prompt: str) -> dict:
start = time.time()
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
stream=True
)
first_token_time = None
token_count = 0
for chunk in stream:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time()
token_count += 1
end = time.time()
prefill_time = first_token_time - start # TTFT
decode_time = end - first_token_time
return {
"prefill_ms": prefill_time * 1000,
"decode_ms": decode_time * 1000,
"tokens_generated": token_count,
"ms_per_token": (decode_time * 1000) / token_count if token_count > 0 else 0
}
Optimization Strategies Differ
For prefill optimization:
- Reduce prompt length (fewer tokens = less compute)
- Use prefix caching (skip redundant prefill for repeated content)
- FlashAttention (more efficient attention computation)
- Quantization helps less (still compute-bound)
For decode optimization:
- Increase memory bandwidth (faster GPU, HBM3 vs HBM2)
- Reduce model size via quantization (less memory to read)
- Speculative decoding (generate multiple tokens per forward pass)
- Larger batch sizes (amortize memory reads across requests)
When Each Phase Dominates
Short prompts, long outputs: Decode dominates. A 100-token prompt generating 1,000 tokens spends most time in decode.
Long prompts, short outputs: Prefill dominates. A 10,000-token document with a 50-token summary spends most time in prefill.
# Prefill-dominated workload
summarize_document(document_50k_tokens) # 95% prefill, 5% decode
# Decode-dominated workload
write_essay(topic_100_tokens) # 10% prefill, 90% decode
# Balanced workload
chat_with_context(history_2k_tokens) # ~50/50
The Batching Tradeoff
This is where it gets interesting for serving systems.
During prefill, batching requests together increases latency for each individual request (more compute to do). During decode, batching requests together barely affects per-request latency (memory bandwidth was already the bottleneck).
Smart serving systems like vLLM exploit this: they batch decode steps aggressively but are more careful about prefill batching.
Practical Implications
If your users complain about slow "thinking time" (long pause before response starts), optimize prefill:
- Shorten system prompts
- Enable prefix caching
- Consider a faster model for the prefill phase
If your users complain about slow "typing speed" (text appears slowly), optimize decode:
- Use a quantized model
- Ensure you're not CPU-bound on token processing
- Check your streaming implementation
The phase that's slow determines the optimization that matters.