Separating Real Speedups from Benchmarketing

Every optimization claims 2-10x speedup. If you stacked them all, inference would be faster than the speed of light. Something doesn't add up.

The claims aren't lies. They're measured under conditions that may not match yours.

FlashAttention: Real, With Caveats

FlashAttention genuinely helps. It reduces memory usage and speeds up attention computation by avoiding materialization of the full attention matrix.

Real speedup: 1.5-3x for long sequences, minimal for short sequences.

The caveat: The speedup is most pronounced during prefill with long contexts. For short prompts or during decode, the improvement is modest.

# Where FlashAttention shines
prompt_length = 8192  # Long context
# FlashAttention: 400ms prefill
# Standard attention: 1200ms prefill
# Speedup: 3x

# Where it matters less
prompt_length = 256  # Short prompt
# FlashAttention: 25ms prefill
# Standard attention: 35ms prefill
# Speedup: 1.4x

If your workload is short prompts, don't expect miracles.

CUDA Graphs: Real, But Narrow

CUDA graphs capture a sequence of GPU operations and replay them without CPU overhead. The "10x speedup" claims come from eliminating kernel launch overhead.

Real speedup: Significant for small batches with many kernel launches. Minimal for large batches where kernel execution dominates.

# Where CUDA graphs help
batch_size = 1
tokens_per_request = 50
# Without graphs: 40ms (30ms kernel launches, 10ms compute)
# With graphs: 15ms (5ms graph replay, 10ms compute)
# Speedup: 2.6x

# Where they don't help much
batch_size = 32
tokens_per_request = 50
# Without graphs: 180ms (30ms kernel launches, 150ms compute)
# With graphs: 155ms (5ms graph replay, 150ms compute)
# Speedup: 1.16x

CUDA graphs work best for latency-sensitive single-request inference. For throughput-oriented batch processing, the benefit is marginal.

Quantization: Real, With Quality Tradeoffs

INT8 quantization reduces memory bandwidth requirements by 2x compared to FP16. This directly speeds up memory-bound decode.

Real speedup: 1.5-2x for decode phase on memory-bound workloads.

The tradeoff: Quality degradation varies by task. Some tasks tolerate INT8 well. Others don't.

# Measure before deploying
def evaluate_quantization_impact(model_fp16, model_int8, eval_set):
    results_fp16 = [model_fp16.generate(p) for p in eval_set]
    results_int8 = [model_int8.generate(p) for p in eval_set]

    quality_diff = compare_quality(results_fp16, results_int8)
    speed_diff = compare_speed(model_fp16, model_int8)

    return {
        "quality_regression": quality_diff,  # e.g., -2% on your eval
        "speed_improvement": speed_diff,      # e.g., +80%
        "worth_it": quality_diff > -5 and speed_diff > 50
    }

Continuous Batching: Real and Underrated

Continuous batching (iteration-level scheduling) is genuinely one of the biggest practical improvements. Instead of waiting for all requests in a batch to complete, new requests can join mid-generation.

Real speedup: 2-5x throughput improvement under load.

Static batching:
[Req A: 100 tokens][Req B: 500 tokens][Req C: 50 tokens]
        ↓
All wait for B to finish (500 iterations)
        ↓
Average wait: 500 iterations

Continuous batching:
[A finishes at 100] → [D joins]
[C finishes at 50] → [E joins]
[B continues...]
        ↓
A and C don't wait for B

This is why vLLM and similar systems significantly outperform naive implementations.

Speculative Decoding: Real, Situationally

Use a small "draft" model to predict tokens, verify with the large model. When predictions match, you generate multiple tokens per forward pass.

Real speedup: 1.5-2.5x when draft model matches well. 1x or worse when it doesn't.

# Works well when:
# - Draft model is similar to target (same family, smaller size)
# - Output is predictable (code, structured data)
# - You can tolerate the draft model overhead

# Works poorly when:
# - Output is creative/unpredictable
# - Draft model is too different from target
# - Single-request latency is already low

What Actually Moves the Needle

In practice, for most production workloads:

Optimization	Typical Real Impact	When It Helps
Continuous batching	2-5x throughput	High concurrency
FlashAttention	1.5-3x prefill	Long prompts
INT8 quantization	1.5-2x decode	Memory-bound
Prefix caching	2-10x for repeated prefixes	Shared system prompts
Speculative decoding	1.5-2x	Predictable outputs
CUDA graphs	1.2-2x	Low batch sizes

The biggest gains often come from:

Not doing unnecessary work (caching, shorter prompts)
Better scheduling (continuous batching)
Right-sizing the model (smaller model that's good enough)

The "10x speedup" claims are usually comparing against an unoptimized baseline. Once you've picked the low-hanging fruit, incremental gains get smaller.