A Year of LLM Inference: Lessons Learned
Looking back at what we learned deploying LLM inference in production. What worked, what didn't, and what we'd do differently.
Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.
Looking back at what we learned deploying LLM inference in production. What worked, what didn't, and what we'd do differently.
8B models handle classification well. 70B models handle summarization. Code-specialized models beat generalists at code. Match the model to the task.
Your primary API will fail. Same model at different provider. Smaller model as backup. Cached responses for emergencies. Have a plan before you need it.
Not all optimizations are equal. Prefix caching saves 40%. Quantization saves 50%. Smart routing saves 60%. Know which levers move the needle for your workload.
Prompt injection, model extraction, data leakage. LLM serving has unique attack vectors. Understanding them is the first step to defending against them.
One GPU can serve many customers without sharing data. Isolation at the request level, not the hardware level. The economics work when you get it right.
Speculative decoding shines when outputs are predictable. Code completion, structured generation, and templates see 2x+ gains. Creative writing doesn't.
A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.
Transfer cost vs recompute cost. If moving data off GPU costs less than recomputing it, offload. If not, keep it. The math is straightforward.
KV cache is 40% of memory for long contexts. Compression techniques trade compute for memory without significant quality loss. Know when to use them.