Testing Quality After Quantization
Eval suites catch problems benchmarks miss. Here's how to build testing that prevents quantization regressions from reaching users.
6 posts tagged with "quantization"
Eval suites catch problems benchmarks miss. Here's how to build testing that prevents quantization regressions from reaching users.
Both quantize to INT4. AWQ is faster to quantize. GPTQ sometimes has better quality. When does each win?
Everyone quantizes model weights. Few quantize the KV cache. But the cache is often the bigger memory consumer.
Quantization saves memory. But does it improve cost per token? The ROI depends on whether you're memory-bound or compute-bound.
FP16 to INT8 is usually safe. INT8 to INT4 requires careful testing. Here's how to choose.
INT8 gives 2x memory savings. But quality loss varies by layer and task. Here's how to quantize safely.