Understanding What Makes vLLM Fast

Libraries are organized by a cataloging system. Without it, finding a book means searching every shelf. With it, you go directly to the right location.

vLLM applies similar organization to GPU memory. Instead of contiguous blocks that waste space, it manages memory in pages—finding and using every available byte.

The Memory Problem vLLM Solves

class NaiveMemoryAllocation:
    """How most frameworks handle KV cache"""

    def allocate_for_request(self, max_length: int):
        # Pre-allocate for worst case
        # Request says max_tokens=2048? Allocate all 2048.

        # Reality: average output is 150 tokens
        # Waste: 2048 - 150 = 1898 tokens worth of memory

        # With 100 concurrent requests:
        # 100 * 1898 * kv_size_per_token = massive waste
        pass


class PagedAllocation:
    """How vLLM handles KV cache"""

    def allocate_for_request(self):
        # Allocate pages as needed
        # Start with 1 page (e.g., 16 tokens)
        # Add pages as generation continues

        # Reality: only allocate what's used
        # No waste from over-provisioning
        pass

PagedAttention Explained

class PagedAttention:
    """Virtual memory, but for attention"""

    def __init__(self, page_size: int = 16):
        self.page_size = page_size
        self.page_table = {}  # request_id -> list of physical pages
        self.free_pages = []  # Available physical pages

    def allocate_page(self, request_id: str) -> int:
        """Allocate a new page for a request"""
        if not self.free_pages:
            # No free pages, need to handle (eviction, OOM, etc.)
            raise MemoryError("No free pages")

        physical_page = self.free_pages.pop()

        if request_id not in self.page_table:
            self.page_table[request_id] = []

        self.page_table[request_id].append(physical_page)
        return physical_page

    def free_request(self, request_id: str):
        """Return pages when request completes"""
        pages = self.page_table.pop(request_id, [])
        self.free_pages.extend(pages)
        # Pages immediately available for next request

Continuous Batching: The Other Half

class ContinuousBatching:
    """Don't wait for batch to complete"""

    def __init__(self):
        self.active_requests = []
        self.pending_requests = []

    def iteration_step(self):
        # Traditional batching:
        # Wait for ALL requests to finish → then start new batch
        # Problem: short requests wait for long ones

        # Continuous batching:
        # Each iteration, check for completed requests
        # Immediately add new requests to fill slots

        completed = [r for r in self.active_requests if r.is_done]
        for request in completed:
            self.active_requests.remove(request)
            self._return_memory(request)

        # Fill empty slots
        while len(self.active_requests) < self.max_batch:
            if self.pending_requests:
                new_request = self.pending_requests.pop(0)
                self.active_requests.append(new_request)
            else:
                break

        # Run one decode step for all active requests
        self._decode_step(self.active_requests)

Memory Efficiency in Numbers

def memory_comparison():
    """Real numbers from a 70B model deployment"""

    # Without PagedAttention (naive allocation)
    naive = {
        "max_concurrent_requests": 8,
        "memory_per_request_gb": 8.0,  # Allocated for max_tokens
        "actual_usage_gb": 1.2,  # Average actual use
        "waste_percentage": 85,
    }

    # With PagedAttention
    paged = {
        "max_concurrent_requests": 50,
        "memory_per_request_gb": 1.3,  # Only what's needed + overhead
        "fragmentation_overhead": 0.05,  # 5% for page management
        "improvement": "6x more concurrent requests",
    }

    return naive, paged

The vLLM Architecture

class VLLMArchitecture:
    components = {
        "Scheduler": {
            "role": "Decides which requests to process",
            "strategy": "Maximize throughput while meeting latency",
        },
        "PagedAttention": {
            "role": "Manages KV cache memory",
            "key_insight": "Non-contiguous allocation",
        },
        "Worker": {
            "role": "Runs model inference",
            "optimization": "Batched operations across requests",
        },
        "BlockManager": {
            "role": "Allocates and frees memory pages",
            "pattern": "Copy-on-write for prompt caching",
        },
    }

    request_flow = """
    1. Request arrives → Scheduler queues it
    2. Scheduler picks requests for next batch
    3. BlockManager allocates KV cache pages
    4. Worker runs prefill (input processing)
    5. Worker runs decode iterations (token generation)
    6. On completion → BlockManager frees pages
    7. Scheduler admits new requests
    """

Configuration That Matters

def key_vllm_parameters() -> dict:
    return {
        "--max-num-seqs": {
            "default": 256,
            "effect": "Max concurrent requests",
            "tune": "Higher = more throughput, more memory",
        },
        "--max-num-batched-tokens": {
            "default": 2048,
            "effect": "Tokens processed per iteration",
            "tune": "Balance batch efficiency vs latency",
        },
        "--gpu-memory-utilization": {
            "default": 0.9,
            "effect": "How much GPU memory to use",
            "tune": "Lower for safety margin, higher for throughput",
        },
        "--block-size": {
            "default": 16,
            "effect": "Tokens per page",
            "tune": "Rarely needs changing",
        },
    }

Why This Matters

Without these optimizations, serving a 70B model might handle 8 concurrent requests. With vLLM, the same hardware handles 50+.

That's not a small improvement—it's the difference between "needs a cluster" and "runs on two GPUs."

The techniques aren't magic. They're careful memory management applied to a specific problem. vLLM's contribution was making them work together reliably in production.