Separating Real Speedups from Benchmarketing
FlashAttention claims 2-4x speedup. CUDA graphs claim 10x. What actually helps in production, and what's just good marketing?
2 posts tagged with "benchmarks"
FlashAttention claims 2-4x speedup. CUDA graphs claim 10x. What actually helps in production, and what's just good marketing?
That benchmark showing 10,000 tokens/second? It probably used batch size 64 and measured mean latency. Here's how to benchmark for reality.