The GPU Memory Techniques That Actually Scale
Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?
38 posts tagged with "optimization"
Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?
GPU memory is precious. CPU memory is cheap. Moving the right data at the right time can 2x your concurrent requests.
Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.
nvidia-smi says 90% utilization. Actual compute is 30%. Here's what GPU utilization really means and what to measure instead.
vLLM serves 10x more requests than naive PyTorch. PagedAttention, continuous batching, and memory management make the difference.
Batch size 1 wastes GPU. Batch size 64 kills latency. Somewhere in between is your sweet spot. Here's how to find it.
100 requests sounds like 100 requests. But one 50k-token request consumes more resources than 99 short ones combined. Batch by tokens, not requests.
You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.
vLLM doesn't use a faster model. It uses memory smarter. PagedAttention treats KV cache like virtual memory, and the results are dramatic.
Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fills those gaps. The difference is 3-5x throughput.