When to Move Data Off the GPU
GPU memory is precious. CPU memory is cheap. Moving the right data at the right time can 2x your concurrent requests.
17 posts tagged with "memory"
GPU memory is precious. CPU memory is cheap. Moving the right data at the right time can 2x your concurrent requests.
Where does memory go in a 70B model deployment? How do you know if KV cache is your bottleneck? Here's the diagnostic playbook.
Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.
OOM at 32K context when your GPU 'should' handle it? Here's what's actually happening in GPU memory during long conversations.
vLLM serves 10x more requests than naive PyTorch. PagedAttention, continuous batching, and memory management make the difference.
100 requests sounds like 100 requests. But one 50k-token request consumes more resources than 99 short ones combined. Batch by tokens, not requests.
vLLM doesn't use a faster model. It uses memory smarter. PagedAttention treats KV cache like virtual memory, and the results are dramatic.