#optimization

38 posts tagged with "optimization"

Jun 14, 2025

Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?

Jun 11, 2025

GPU memory is precious. CPU memory is cheap. Moving the right data at the right time can 2x your concurrent requests.

Jun 4, 2025

Without the KV cache, generating 100 tokens would take 5,050 forward passes instead of 100. Here's how it works.

May 21, 2025

nvidia-smi says 90% utilization. Actual compute is 30%. Here's what GPU utilization really means and what to measure instead.

Apr 19, 2025

vLLM serves 10x more requests than naive PyTorch. PagedAttention, continuous batching, and memory management make the difference.

Apr 2, 2025

Batch size 1 wastes GPU. Batch size 64 kills latency. Somewhere in between is your sweet spot. Here's how to find it.

Mar 22, 2025

100 requests sounds like 100 requests. But one 50k-token request consumes more resources than 99 short ones combined. Batch by tokens, not requests.

Mar 15, 2025

You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.

Mar 12, 2025

vLLM doesn't use a faster model. It uses memory smarter. PagedAttention treats KV cache like virtual memory, and the results are dramatic.

Mar 8, 2025

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fills those gaps. The difference is 3-5x throughput.