Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Jun 4, 2025

Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.

May 31, 2025

OOM at 32K context when your GPU 'should' handle it? Here's what's actually happening in GPU memory during long conversations.

May 28, 2025

12 things to check before your LLM goes to production. Most teams skip at least half. That's how incidents happen.

May 24, 2025

Groq, Cerebras, and other custom silicon promise 10x speed. Here's how to evaluate them without getting burned.

May 21, 2025

nvidia-smi says 90% utilization. Actual compute is 30%. Here's what GPU utilization really means and what to measure instead.

May 17, 2025

Spot instances are 50-70% cheaper. But they can disappear. Here's how to use them without breaking production.

May 14, 2025

H100 spot at $0.15/1M tokens. A100 on-demand at $0.40/1M. API at $1.00/1M. Here's the full comparison.

May 10, 2025

Egress $3K, logging $2K, on-call eng time $8K—the costs nobody budgeted for add up to more than you expect.

May 7, 2025

GPU cost is just the beginning. Egress, logging, on-call—add 40% to your compute estimate for the real number.

May 3, 2025

The same model costs different amounts on different providers. Smart routing between them can cut your bill by 30%.