How Much Quality Loss Is Acceptable
3% degradation on summarization? Maybe fine. 3% on code generation? Could break your users. Here's how to set thresholds.
Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.
3% degradation on summarization? Maybe fine. 3% on code generation? Could break your users. Here's how to set thresholds.
Both quantize to INT4. AWQ is faster to quantize. GPTQ sometimes has better quality. When does each win?
Everyone quantizes model weights. Few quantize the KV cache. But the cache is often the bigger memory consumer.
Quantization saves memory. But does it improve cost per token? The ROI depends on whether you're memory-bound or compute-bound.
FP16 to INT8 is usually safe. INT8 to INT4 requires careful testing. Here's how to choose.
INT8 gives 2x memory savings. But quality loss varies by layer and task. Here's how to quantize safely.
Raw PyTorch is 3-5x slower than optimized serving. Here's the gap and how to close it.
Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?
GPU memory is precious. CPU memory is cheap. Moving the right data at the right time can 2x your concurrent requests.
Where does memory go in a 70B model deployment? How do you know if KV cache is your bottleneck? Here's the diagnostic playbook.