The Formula for Offloading Decisions
Transfer cost vs recompute cost. If moving data off GPU costs less than recomputing it, offload. If not, keep it. The math is straightforward.
17 posts tagged with "memory"
Transfer cost vs recompute cost. If moving data off GPU costs less than recomputing it, offload. If not, keep it. The math is straightforward.
KV cache is 40% of memory for long contexts. Compression techniques trade compute for memory without significant quality loss. Know when to use them.
Optimizing for compute when you're memory bound wastes effort. Optimizing for memory when you're compute bound wastes opportunity. Profile first, then optimize.
Four GPUs don't give you 4x the KV cache memory. Communication overhead, activation memory, and synchronization eat into the gains. Plan accordingly.
Standard attention needs O(n²) memory. Memory-efficient variants need O(n). Same output, 10x less peak memory.
Memory grows slowly over hours, then OOM. Here's how to find where the bytes are going before they crash your server.
Flash Attention doesn't make attention faster. It makes attention fit in memory. The speedup is a side effect of better memory access.
Everyone quantizes model weights. Few quantize the KV cache. But the cache is often the bigger memory consumer.
INT8 gives 2x memory savings. But quality loss varies by layer and task. Here's how to quantize safely.
Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?