#performance

7 posts tagged with "performance"

Dec 10, 2025

Where Speculative Decoding Actually Helps

Speculative decoding shines when outputs are predictable. Code completion, structured generation, and templates see 2x+ gains. Creative writing doesn't.

Sep 27, 2025

Adding GPUs Without Linear Speedup

Four GPUs don't give you 4x throughput. Communication overhead, load imbalance, and synchronization eat into gains. Know the scaling curve before you buy.

Jul 23, 2025

The Performance Wins from Fusing Kernels

Every CUDA kernel launch has overhead. Fusing three operations into one can be 3x faster. Here's where fusion helps and how to get it.

Apr 2, 2025

Tuning Batch Size for Your Workload

Batch size 1 wastes GPU. Batch size 64 kills latency. Somewhere in between is your sweet spot. Here's how to find it.

Feb 26, 2025

Separating Real Speedups from Benchmarketing

FlashAttention claims 2-4x speedup. CUDA graphs claim 10x. What actually helps in production, and what's just good marketing?

Feb 22, 2025

Choosing Benchmarks That Predict Production

That benchmark showing 10,000 tokens/second? It probably used batch size 64 and measured mean latency. Here's how to benchmark for reality.

Feb 8, 2025

What P99 Latency Tells You That P50 Hides

Median latency is 200ms. One in a hundred requests takes 8 seconds. Your dashboard shows green. Your users are churning.