Back to Blog
Understanding What Makes vLLM Fast
Libraries are organized by a cataloging system. Without it, finding a book means searching every shelf. With it, you go directly to the right location.
vLLM applies similar organization to GPU memory. Instead of contiguous blocks that waste space, it manages memory in pages—finding and using every available byte.
The Memory Problem vLLM Solves
class NaiveMemoryAllocation:
"""How most frameworks handle KV cache"""
def allocate_for_request(self, max_length: int):
# Pre-allocate for worst case
# Request says max_tokens=2048? Allocate all 2048.
# Reality: average output is 150 tokens
# Waste: 2048 - 150 = 1898 tokens worth of memory
# With 100 concurrent requests:
# 100 * 1898 * kv_size_per_token = massive waste
pass
class PagedAllocation:
"""How vLLM handles KV cache"""
def allocate_for_request(self):
# Allocate pages as needed
# Start with 1 page (e.g., 16 tokens)
# Add pages as generation continues
# Reality: only allocate what's used
# No waste from over-provisioning
pass
PagedAttention Explained
class PagedAttention:
"""Virtual memory, but for attention"""
def __init__(self, page_size: int = 16):
self.page_size = page_size
self.page_table = {} # request_id -> list of physical pages
self.free_pages = [] # Available physical pages
def allocate_page(self, request_id: str) -> int:
"""Allocate a new page for a request"""
if not self.free_pages:
# No free pages, need to handle (eviction, OOM, etc.)
raise MemoryError("No free pages")
physical_page = self.free_pages.pop()
if request_id not in self.page_table:
self.page_table[request_id] = []
self.page_table[request_id].append(physical_page)
return physical_page
def free_request(self, request_id: str):
"""Return pages when request completes"""
pages = self.page_table.pop(request_id, [])
self.free_pages.extend(pages)
# Pages immediately available for next request
Continuous Batching: The Other Half
class ContinuousBatching:
"""Don't wait for batch to complete"""
def __init__(self):
self.active_requests = []
self.pending_requests = []
def iteration_step(self):
# Traditional batching:
# Wait for ALL requests to finish → then start new batch
# Problem: short requests wait for long ones
# Continuous batching:
# Each iteration, check for completed requests
# Immediately add new requests to fill slots
completed = [r for r in self.active_requests if r.is_done]
for request in completed:
self.active_requests.remove(request)
self._return_memory(request)
# Fill empty slots
while len(self.active_requests) < self.max_batch:
if self.pending_requests:
new_request = self.pending_requests.pop(0)
self.active_requests.append(new_request)
else:
break
# Run one decode step for all active requests
self._decode_step(self.active_requests)
Memory Efficiency in Numbers
def memory_comparison():
"""Real numbers from a 70B model deployment"""
# Without PagedAttention (naive allocation)
naive = {
"max_concurrent_requests": 8,
"memory_per_request_gb": 8.0, # Allocated for max_tokens
"actual_usage_gb": 1.2, # Average actual use
"waste_percentage": 85,
}
# With PagedAttention
paged = {
"max_concurrent_requests": 50,
"memory_per_request_gb": 1.3, # Only what's needed + overhead
"fragmentation_overhead": 0.05, # 5% for page management
"improvement": "6x more concurrent requests",
}
return naive, paged
The vLLM Architecture
class VLLMArchitecture:
components = {
"Scheduler": {
"role": "Decides which requests to process",
"strategy": "Maximize throughput while meeting latency",
},
"PagedAttention": {
"role": "Manages KV cache memory",
"key_insight": "Non-contiguous allocation",
},
"Worker": {
"role": "Runs model inference",
"optimization": "Batched operations across requests",
},
"BlockManager": {
"role": "Allocates and frees memory pages",
"pattern": "Copy-on-write for prompt caching",
},
}
request_flow = """
1. Request arrives → Scheduler queues it
2. Scheduler picks requests for next batch
3. BlockManager allocates KV cache pages
4. Worker runs prefill (input processing)
5. Worker runs decode iterations (token generation)
6. On completion → BlockManager frees pages
7. Scheduler admits new requests
"""
Configuration That Matters
def key_vllm_parameters() -> dict:
return {
"--max-num-seqs": {
"default": 256,
"effect": "Max concurrent requests",
"tune": "Higher = more throughput, more memory",
},
"--max-num-batched-tokens": {
"default": 2048,
"effect": "Tokens processed per iteration",
"tune": "Balance batch efficiency vs latency",
},
"--gpu-memory-utilization": {
"default": 0.9,
"effect": "How much GPU memory to use",
"tune": "Lower for safety margin, higher for throughput",
},
"--block-size": {
"default": 16,
"effect": "Tokens per page",
"tune": "Rarely needs changing",
},
}
Why This Matters
Without these optimizations, serving a 70B model might handle 8 concurrent requests. With vLLM, the same hardware handles 50+.
That's not a small improvement—it's the difference between "needs a cluster" and "runs on two GPUs."
The techniques aren't magic. They're careful memory management applied to a specific problem. vLLM's contribution was making them work together reliably in production.