#memory

17 posts tagged with "memory"

Dec 3, 2025

The Formula for Offloading Decisions

Transfer cost vs recompute cost. If moving data off GPU costs less than recomputing it, offload. If not, keep it. The math is straightforward.

Nov 29, 2025

Reducing KV Cache Size Without Quality Loss

KV cache is 40% of memory for long contexts. Compression techniques trade compute for memory without significant quality loss. Know when to use them.

Oct 4, 2025

Knowing If You're Memory or Compute Limited

Optimizing for compute when you're memory bound wastes effort. Optimizing for memory when you're compute bound wastes opportunity. Profile first, then optimize.

Sep 24, 2025

Memory Planning for Multi-GPU Deployments

Four GPUs don't give you 4x the KV cache memory. Communication overhead, activation memory, and synchronization eat into the gains. Plan accordingly.

Jul 30, 2025

Attention That Fits in Memory

Standard attention needs O(n²) memory. Memory-efficient variants need O(n). Same output, 10x less peak memory.

Jul 26, 2025

Finding Memory Leaks in LLM Serving

Memory grows slowly over hours, then OOM. Here's how to find where the bytes are going before they crash your server.

Jul 19, 2025

What Flash Attention Actually Does

Flash Attention doesn't make attention faster. It makes attention fit in memory. The speedup is a side effect of better memory access.

Jul 2, 2025

Compressing the Cache Not Just the Model

Everyone quantizes model weights. Few quantize the KV cache. But the cache is often the bigger memory consumer.

Jun 21, 2025

Quantizing Without Breaking Your App

INT8 gives 2x memory savings. But quality loss varies by layer and task. Here's how to quantize safely.

Jun 14, 2025

The GPU Memory Techniques That Actually Scale

Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?