The Cache That Makes LLMs Possible
Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.
Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.
Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.
OOM at 32K context when your GPU 'should' handle it? Here's what's actually happening in GPU memory during long conversations.
12 things to check before your LLM goes to production. Most teams skip at least half. That's how incidents happen.
Groq, Cerebras, and other custom silicon promise 10x speed. Here's how to evaluate them without getting burned.
nvidia-smi says 90% utilization. Actual compute is 30%. Here's what GPU utilization really means and what to measure instead.
Spot instances are 50-70% cheaper. But they can disappear. Here's how to use them without breaking production.
H100 spot at $0.15/1M tokens. A100 on-demand at $0.40/1M. API at $1.00/1M. Here's the full comparison.
Egress $3K, logging $2K, on-call eng time $8K—the costs nobody budgeted for add up to more than you expect.
GPU cost is just the beginning. Egress, logging, on-call—add 40% to your compute estimate for the real number.
The same model costs different amounts on different providers. Smart routing between them can cut your bill by 30%.