Deciding Which Model Handles Each Request
Send classification to Haiku, reasoning to Opus. Routing requests to the right model saves money without sacrificing quality.
Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.
Send classification to Haiku, reasoning to Opus. Routing requests to the right model saves money without sacrificing quality.
Models change. Prompts change. How do you update without breaking clients? Immutable versions and controlled rollout.
Model changes are high-risk deployments. 1% traffic to new, compare outputs, then gradually expand. Here's the playbook.
The gap between 'works on my laptop' and 'survives production' is filled with timeouts, retries, fallbacks, and rate limits. Here's the checklist.
Standard attention needs O(n²) memory. Memory-efficient variants need O(n). Same output, 10x less peak memory.
Memory grows slowly over hours, then OOM. Here's how to find where the bytes are going before they crash your server.
Every CUDA kernel launch has overhead. Fusing three operations into one can be 3x faster. Here's where fusion helps and how to get it.
Flash Attention doesn't make attention faster. It makes attention fit in memory. The speedup is a side effect of better memory access.
Eval suites catch problems benchmarks miss. Here's how to build testing that prevents quantization regressions from reaching users.
H100's FP8 gives near-FP16 quality at near-INT8 speed. It's becoming the new default. Here's when and how to use it.