#optimization

38 posts tagged with "optimization"

Aug 16, 2025

Try the small model first. If it fails or isn't confident, try the large one. Cascade routing gets 80% savings on 80% of requests.

Aug 13, 2025

Send classification to Haiku, reasoning to Opus. Routing requests to the right model saves money without sacrificing quality.

Jul 23, 2025

Every CUDA kernel launch has overhead. Fusing three operations into one can be 3x faster. Here's where fusion helps and how to get it.

Jul 19, 2025

Flash Attention doesn't make attention faster. It makes attention fit in memory. The speedup is a side effect of better memory access.

Jul 9, 2025

3% degradation on summarization? Maybe fine. 3% on code generation? Could break your users. Here's how to set thresholds.

Jul 2, 2025

Everyone quantizes model weights. Few quantize the KV cache. But the cache is often the bigger memory consumer.

Jun 28, 2025

Quantization saves memory. But does it improve cost per token? The ROI depends on whether you're memory-bound or compute-bound.

Jun 25, 2025

FP16 to INT8 is usually safe. INT8 to INT4 requires careful testing. Here's how to choose.

Jun 21, 2025

INT8 gives 2x memory savings. But quality loss varies by layer and task. Here's how to quantize safely.

Jun 18, 2025

Raw PyTorch is 3-5x slower than optimized serving. Here's the gap and how to close it.