Separating Real Speedups from Benchmarketing
Every optimization claims 2-10x speedup. If you stacked them all, inference would be faster than the speed of light. Something doesn't add up.
The claims aren't lies. They're measured under conditions that may not match yours.
FlashAttention: Real, With Caveats
FlashAttention genuinely helps. It reduces memory usage and speeds up attention computation by avoiding materialization of the full attention matrix.
Real speedup: 1.5-3x for long sequences, minimal for short sequences.
The caveat: The speedup is most pronounced during prefill with long contexts. For short prompts or during decode, the improvement is modest.
# Where FlashAttention shines
prompt_length = 8192 # Long context
# FlashAttention: 400ms prefill
# Standard attention: 1200ms prefill
# Speedup: 3x
# Where it matters less
prompt_length = 256 # Short prompt
# FlashAttention: 25ms prefill
# Standard attention: 35ms prefill
# Speedup: 1.4x
If your workload is short prompts, don't expect miracles.
CUDA Graphs: Real, But Narrow
CUDA graphs capture a sequence of GPU operations and replay them without CPU overhead. The "10x speedup" claims come from eliminating kernel launch overhead.
Real speedup: Significant for small batches with many kernel launches. Minimal for large batches where kernel execution dominates.
# Where CUDA graphs help
batch_size = 1
tokens_per_request = 50
# Without graphs: 40ms (30ms kernel launches, 10ms compute)
# With graphs: 15ms (5ms graph replay, 10ms compute)
# Speedup: 2.6x
# Where they don't help much
batch_size = 32
tokens_per_request = 50
# Without graphs: 180ms (30ms kernel launches, 150ms compute)
# With graphs: 155ms (5ms graph replay, 150ms compute)
# Speedup: 1.16x
CUDA graphs work best for latency-sensitive single-request inference. For throughput-oriented batch processing, the benefit is marginal.
Quantization: Real, With Quality Tradeoffs
INT8 quantization reduces memory bandwidth requirements by 2x compared to FP16. This directly speeds up memory-bound decode.
Real speedup: 1.5-2x for decode phase on memory-bound workloads.
The tradeoff: Quality degradation varies by task. Some tasks tolerate INT8 well. Others don't.
# Measure before deploying
def evaluate_quantization_impact(model_fp16, model_int8, eval_set):
results_fp16 = [model_fp16.generate(p) for p in eval_set]
results_int8 = [model_int8.generate(p) for p in eval_set]
quality_diff = compare_quality(results_fp16, results_int8)
speed_diff = compare_speed(model_fp16, model_int8)
return {
"quality_regression": quality_diff, # e.g., -2% on your eval
"speed_improvement": speed_diff, # e.g., +80%
"worth_it": quality_diff > -5 and speed_diff > 50
}
Continuous Batching: Real and Underrated
Continuous batching (iteration-level scheduling) is genuinely one of the biggest practical improvements. Instead of waiting for all requests in a batch to complete, new requests can join mid-generation.
Real speedup: 2-5x throughput improvement under load.
Static batching:
[Req A: 100 tokens][Req B: 500 tokens][Req C: 50 tokens]
↓
All wait for B to finish (500 iterations)
↓
Average wait: 500 iterations
Continuous batching:
[A finishes at 100] → [D joins]
[C finishes at 50] → [E joins]
[B continues...]
↓
A and C don't wait for B
This is why vLLM and similar systems significantly outperform naive implementations.
Speculative Decoding: Real, Situationally
Use a small "draft" model to predict tokens, verify with the large model. When predictions match, you generate multiple tokens per forward pass.
Real speedup: 1.5-2.5x when draft model matches well. 1x or worse when it doesn't.
# Works well when:
# - Draft model is similar to target (same family, smaller size)
# - Output is predictable (code, structured data)
# - You can tolerate the draft model overhead
# Works poorly when:
# - Output is creative/unpredictable
# - Draft model is too different from target
# - Single-request latency is already low
What Actually Moves the Needle
In practice, for most production workloads:
| Optimization | Typical Real Impact | When It Helps |
|---|---|---|
| Continuous batching | 2-5x throughput | High concurrency |
| FlashAttention | 1.5-3x prefill | Long prompts |
| INT8 quantization | 1.5-2x decode | Memory-bound |
| Prefix caching | 2-10x for repeated prefixes | Shared system prompts |
| Speculative decoding | 1.5-2x | Predictable outputs |
| CUDA graphs | 1.2-2x | Low batch sizes |
The biggest gains often come from:
- Not doing unnecessary work (caching, shorter prompts)
- Better scheduling (continuous batching)
- Right-sizing the model (smaller model that's good enough)
The "10x speedup" claims are usually comparing against an unoptimized baseline. Once you've picked the low-hanging fruit, incremental gains get smaller.