A Year of LLM Inference: Lessons Learned
Looking back at what we learned deploying LLM inference in production. What worked, what didn't, and what we'd do differently.
6 posts tagged with "production"
Looking back at what we learned deploying LLM inference in production. What worked, what didn't, and what we'd do differently.
Fine-tuning a model is the easy part. Running it in production with checkpoints, evals, rollback, and serving is the hard part. Here's the full picture.
Evals at Anthropic, OpenAI, and Google aren't afterthoughts. They're gating functions that block releases. Every prompt change triggers the full suite.
The gap between 'works on my laptop' and 'survives production' is filled with timeouts, retries, fallbacks, and rate limits. Here's the checklist.
Raw PyTorch is 3-5x slower than optimized serving. Here's the gap and how to close it.
12 things to check before your LLM goes to production. Most teams skip at least half. That's how incidents happen.