Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Aug 13, 2025

Send classification to Haiku, reasoning to Opus. Routing requests to the right model saves money without sacrificing quality.

Aug 9, 2025

Models change. Prompts change. How do you update without breaking clients? Immutable versions and controlled rollout.

Aug 6, 2025

Model changes are high-risk deployments. 1% traffic to new, compare outputs, then gradually expand. Here's the playbook.

Aug 2, 2025

The gap between 'works on my laptop' and 'survives production' is filled with timeouts, retries, fallbacks, and rate limits. Here's the checklist.

Jul 30, 2025

Standard attention needs O(n²) memory. Memory-efficient variants need O(n). Same output, 10x less peak memory.

Jul 26, 2025

Memory grows slowly over hours, then OOM. Here's how to find where the bytes are going before they crash your server.

Jul 23, 2025

Every CUDA kernel launch has overhead. Fusing three operations into one can be 3x faster. Here's where fusion helps and how to get it.

Jul 19, 2025

Flash Attention doesn't make attention faster. It makes attention fit in memory. The speedup is a side effect of better memory access.

Jul 16, 2025

Eval suites catch problems benchmarks miss. Here's how to build testing that prevents quantization regressions from reaching users.

Jul 12, 2025

H100's FP8 gives near-FP16 quality at near-INT8 speed. It's becoming the new default. Here's when and how to use it.