Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Nov 26, 2025

Understanding What Your Model Attends To

Attention visualization reveals which tokens influence outputs. Debug why the model ignored critical context or fixated on irrelevant tokens.

Nov 22, 2025

Extending Context Beyond Training Length

Models trained on 4K context can work at 32K with position interpolation. Quality degrades, but predictably. Know the tradeoffs before extending.

Nov 19, 2025

Testing Fine-tuned Model Quality

Generic benchmarks don't predict production quality. Domain-specific evals, regression tests, and A/B testing reveal whether your fine-tuning actually worked.

Nov 15, 2025

How Much Data You Actually Need

1,000 high-quality examples often outperforms 100,000 noisy ones. Data quality dominates quantity for fine-tuning. Curation is the work.

Nov 12, 2025

Switching LoRA Adapters at Runtime

S-LoRA enables switching adapters in ~10ms without reloading the base model. One deployment serves hundreds of customizations.

Nov 8, 2025

Deploying and Serving Fine-tuned Models

Merge adapters for single-tenant deployments. Keep them separate for multi-tenant. The serving architecture depends on how many customizations you're running.

Nov 5, 2025

The Real Cost: Fine-tuning vs Prompting

Prompting has high per-call cost but zero upfront investment. Fine-tuning has low per-call cost but significant upfront investment. The crossover point matters.

Nov 1, 2025

What Actually Works with LoRA

LoRA tutorials make it look easy. Production LoRA requires learning rate adjustments, layer selection, rank tuning, and careful validation. Here's what actually works.

Oct 29, 2025

Running Fine-tuned Models in Production

Fine-tuning a model is the easy part. Running it in production with checkpoints, evals, rollback, and serving is the hard part. Here's the full picture.

Oct 25, 2025

When LoRA Makes Sense

Full fine-tuning updates billions of parameters. LoRA updates millions. The 0.1% of parameters can capture 80% of the adaptation. Know when that's enough.