Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Safe Rollouts for LLM Changes

Model changes are high-risk deployments. 1% traffic to new, compare outputs, then gradually expand. Here's the playbook.

Attention That Fits in Memory

Standard attention needs O(n²) memory. Memory-efficient variants need O(n). Same output, 10x less peak memory.

What Flash Attention Actually Does

Flash Attention doesn't make attention faster. It makes attention fit in memory. The speedup is a side effect of better memory access.

Testing Quality After Quantization

Eval suites catch problems benchmarks miss. Here's how to build testing that prevents quantization regressions from reaching users.

When to Use FP8 for Inference

H100's FP8 gives near-FP16 quality at near-INT8 speed. It's becoming the new default. Here's when and how to use it.