Reducing KV Cache Size Without Quality Loss
KV cache is 40% of memory for long contexts. Compression techniques trade compute for memory without significant quality loss. Know when to use them.
6 posts tagged with "kv-cache"
KV cache is 40% of memory for long contexts. Compression techniques trade compute for memory without significant quality loss. Know when to use them.
Everyone quantizes model weights. Few quantize the KV cache. But the cache is often the bigger memory consumer.
Where does memory go in a 70B model deployment? How do you know if KV cache is your bottleneck? Here's the diagnostic playbook.
Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.
OOM at 32K context when your GPU 'should' handle it? Here's what's actually happening in GPU memory during long conversations.
A 2,000 token system prompt processed 10 million times a month. Without caching, you're paying to process the same tokens over and over.