Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Dec 31, 2025

A Year of LLM Inference: Lessons Learned

Looking back at what we learned deploying LLM inference in production. What worked, what didn't, and what we'd do differently.

Dec 27, 2025

Matching the Right Model to Each Task

8B models handle classification well. 70B models handle summarization. Code-specialized models beat generalists at code. Match the model to the task.

Dec 24, 2025

What Happens When Your Primary Model Fails

Your primary API will fail. Same model at different provider. Smaller model as backup. Cached responses for emergencies. Have a plan before you need it.

Dec 20, 2025

The Techniques That Actually Cut Costs

Not all optimizations are equal. Prefix caching saves 40%. Quantization saves 50%. Smart routing saves 60%. Know which levers move the needle for your workload.

Dec 17, 2025

Security Considerations for LLM Serving

Prompt injection, model extraction, data leakage. LLM serving has unique attack vectors. Understanding them is the first step to defending against them.

Dec 13, 2025

Running Multiple Customers on One GPU

One GPU can serve many customers without sharing data. Isolation at the request level, not the hardware level. The economics work when you get it right.

Dec 10, 2025

Where Speculative Decoding Actually Helps

Speculative decoding shines when outputs are predictable. Code completion, structured generation, and templates see 2x+ gains. Creative writing doesn't.

Dec 6, 2025

How Speculative Decoding Works

A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.

Dec 3, 2025

The Formula for Offloading Decisions

Transfer cost vs recompute cost. If moving data off GPU costs less than recomputing it, offload. If not, keep it. The math is straightforward.

Nov 29, 2025

Reducing KV Cache Size Without Quality Loss

KV cache is 40% of memory for long contexts. Compression techniques trade compute for memory without significant quality loss. Know when to use them.