How vLLM Serves 10x More Requests

In 1961, the Atlas computer introduced virtual memory. Programs could use more memory than physically existed. The trick: pages that moved between RAM and disk, invisible to the program.

PagedAttention applies the same insight to LLM serving. KV cache doesn't need contiguous allocation. It can be paged.

The Contiguous Allocation Problem

Traditional LLM serving pre-allocates KV cache per request:

# Traditional approach
def allocate_kv_cache(max_seq_length: int, batch_size: int) -> torch.Tensor:
    # Must allocate for worst case
    # Even if most requests use 500 tokens, allocate for 2048
    return torch.zeros(
        batch_size,
        max_seq_length,  # Waste if actual length is shorter
        num_layers,
        num_heads,
        head_dim
    )

If you support 2048 tokens but average request uses 500, you're wasting 75% of allocated memory. That waste means fewer concurrent requests.

The PagedAttention Insight

Instead of one contiguous block per request, allocate small fixed-size blocks (pages) on demand:

# PagedAttention approach (conceptual)
BLOCK_SIZE = 16  # tokens per block

class PagedKVCache:
    def __init__(self, num_blocks: int):
        # Pool of blocks, allocated dynamically
        self.block_pool = [KVBlock(BLOCK_SIZE) for _ in range(num_blocks)]
        self.free_blocks = list(range(num_blocks))
        self.block_tables: dict[int, list[int]] = {}  # request_id -> [block_ids]

    def allocate_for_request(self, request_id: int, num_tokens: int):
        num_blocks_needed = (num_tokens + BLOCK_SIZE - 1) // BLOCK_SIZE
        allocated = [self.free_blocks.pop() for _ in range(num_blocks_needed)]
        self.block_tables[request_id] = allocated

    def extend_request(self, request_id: int):
        # Request needs more tokens? Allocate another block
        new_block = self.free_blocks.pop()
        self.block_tables[request_id].append(new_block)

    def free_request(self, request_id: int):
        # Request done? Return blocks to pool
        self.free_blocks.extend(self.block_tables[request_id])
        del self.block_tables[request_id]

Requests only use what they need. No pre-allocation waste.

Memory Efficiency Gains

The impact is substantial:

Traditional allocation:
- Max sequence: 2048 tokens
- KV cache per request: ~2.7 GB (Llama-70B)
- Concurrent requests in 80GB: ~25

PagedAttention:
- Average sequence: 500 tokens
- KV cache per request: ~660 MB (on average)
- Concurrent requests in 80GB: ~100

4x more concurrent requests from memory management alone.

The Attention Kernel Challenge

There's a catch. Standard attention kernels expect contiguous KV tensors. PagedAttention requires a modified kernel that:

Follows the block table to gather K, V from non-contiguous blocks
Computes attention across these gathered tensors
Remains efficient despite the indirection

vLLM provides custom CUDA kernels for this. The overhead is small (~5-10%) compared to the memory savings.

# Simplified paged attention
def paged_attention(
    query: Tensor,           # [batch, heads, head_dim]
    key_cache: Tensor,       # [num_blocks, block_size, heads, head_dim]
    value_cache: Tensor,     # [num_blocks, block_size, heads, head_dim]
    block_tables: Tensor,    # [batch, max_blocks_per_seq]
    seq_lengths: Tensor,     # [batch]
) -> Tensor:
    # Custom kernel gathers K, V using block_tables
    # Computes attention across non-contiguous blocks
    # Returns output tensor
    pass

Beyond Memory: Copy-on-Write

PagedAttention enables another optimization: copy-on-write for prefix sharing.

If 10 requests share the same system prompt, they can share the same KV cache blocks for that prefix:

# Without sharing
# Request 1: [system_prompt KV] [user_1 KV]  - 2048 tokens
# Request 2: [system_prompt KV] [user_2 KV]  - 2048 tokens
# Total: 4096 tokens worth of KV cache

# With sharing (copy-on-write)
# Shared:    [system_prompt KV]              - 1024 tokens
# Request 1: → shared, then [user_1 KV]      - 512 tokens
# Request 2: → shared, then [user_2 KV]      - 512 tokens
# Total: 2048 tokens worth of KV cache

# 50% memory savings for shared prefixes

This is why prefix caching in vLLM is so effective. The memory savings compound.

When PagedAttention Matters Most

The benefits scale with:

High concurrency: More requests = more memory pressure = more savings
Variable output lengths: Pre-allocation waste is highest when lengths vary
Shared prefixes: Copy-on-write multiplies the savings
Long contexts: More tokens = bigger KV cache = bigger waste from pre-allocation

For single-request, short-context inference, PagedAttention's overhead might not be worth it. For production serving at scale, it's transformative.

The Bigger Picture

PagedAttention is part of a broader trend: applying systems engineering principles to ML infrastructure.

Virtual memory solved a similar problem in the 1960s. Garbage collection solved memory management for applications. Now we're solving it for neural network inference.

The lesson: sometimes the biggest wins come not from better models, but from better systems around them.