#memory

17 posts tagged with "memory"

Jun 11, 2025

GPU memory is precious. CPU memory is cheap. Moving the right data at the right time can 2x your concurrent requests.

Jun 7, 2025

Where does memory go in a 70B model deployment? How do you know if KV cache is your bottleneck? Here's the diagnostic playbook.

Jun 4, 2025

Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.

May 31, 2025

OOM at 32K context when your GPU 'should' handle it? Here's what's actually happening in GPU memory during long conversations.

Apr 19, 2025

vLLM serves 10x more requests than naive PyTorch. PagedAttention, continuous batching, and memory management make the difference.

Mar 22, 2025

100 requests sounds like 100 requests. But one 50k-token request consumes more resources than 99 short ones combined. Batch by tokens, not requests.

Mar 12, 2025

vLLM doesn't use a faster model. It uses memory smarter. PagedAttention treats KV cache like virtual memory, and the results are dramatic.