Moving Beyond Simple Request Batching

Static batching is the assembly line where every car waits for the slowest worker to finish before moving to the next station. Continuous batching is Toyota's just-in-time manufacturing. Same resources, dramatically different output.

The Static Batching Problem

Traditional batching collects N requests, processes them together, returns all results:

# Static batching
def process_batch(requests: list[Request]) -> list[Response]:
    # All requests start together
    # All requests must finish before any can return
    # If one request generates 500 tokens and another generates 50,
    # the 50-token request waits for the 500-token request

    batch_outputs = model.generate(
        prompts=[r.prompt for r in requests],
        max_tokens=max(r.max_tokens for r in requests)
    )
    return batch_outputs

The waste: a request generating 50 tokens sits idle for the 450 iterations it takes for the longest request to finish.

Continuous Batching: The Fix

Continuous batching operates at the iteration level, not the request level:

# Continuous batching (conceptual)
class ContinuousBatcher:
    def __init__(self, model, max_batch_size: int):
        self.model = model
        self.max_batch_size = max_batch_size
        self.active_requests: dict[str, RequestState] = {}
        self.pending_queue: Queue[Request] = Queue()

    async def step(self):
        # Remove completed requests
        completed = [
            req_id for req_id, state in self.active_requests.items()
            if state.is_done()
        ]
        for req_id in completed:
            self.active_requests[req_id].return_result()
            del self.active_requests[req_id]

        # Add new requests from queue (if capacity)
        while (len(self.active_requests) < self.max_batch_size
               and not self.pending_queue.empty()):
            new_req = self.pending_queue.get()
            self.active_requests[new_req.id] = RequestState(new_req)

        # Run one decode step for all active requests
        if self.active_requests:
            self.model.decode_step(list(self.active_requests.values()))

Requests enter when there's capacity. Requests leave when they're done. The batch composition changes every iteration.

The Throughput Difference

Consider 100 requests with output lengths uniformly distributed between 50 and 500 tokens:

Static batching (batch size 10):

10 batches
Each batch waits for 500-token request
Total: ~5000 decode iterations
Effective tokens generated: ~27,500 (avg 275 × 100)
Tokens per iteration: 5.5

Continuous batching (max batch size 10):

Requests flow through continuously
Short requests complete fast, slots refilled
~2750 iterations (much closer to average)
Tokens per iteration: 10

That's roughly 2x throughput improvement from scheduling alone.

Memory Management Complexity

The catch: continuous batching requires dynamic memory allocation for KV cache.

Static batching can pre-allocate a fixed block. Continuous batching needs to:

Allocate KV cache for new requests
Deallocate when requests complete
Handle fragmentation

This is where PagedAttention comes in. It manages KV cache like virtual memory, allocating in pages rather than contiguous blocks.

Prefill vs Decode Scheduling

Another subtlety: prefill (processing the input prompt) is compute-intensive. Decode (generating tokens) is memory-intensive.

Naive continuous batching might schedule a large prefill alongside decode operations, starving the decode batch of compute.

Smart schedulers separate these:

class SmartScheduler:
    def schedule(self):
        # Option 1: Dedicated prefill phase
        if self.pending_prefills:
            return self.run_prefills(max_batch=4)

        # Option 2: Chunked prefill (mix small prefill chunks with decode)
        if self.pending_prefills and len(self.decode_batch) < threshold:
            return self.run_mixed_batch()

        # Default: decode only
        return self.run_decode_step()

vLLM, for example, lets you configure how aggressively to interleave prefill with decode.

When Static Batching Still Wins

Continuous batching has overhead. For some workloads, static batching is simpler and sufficient:

Uniform output lengths: If all requests generate similar token counts, the waste is minimal
Batch inference pipelines: Processing 10,000 documents offline? Batch away
Low concurrency: With 2-3 concurrent requests, the complexity isn't worth it

The Production Checklist

If you're evaluating a serving framework for continuous batching:

Does it support iteration-level scheduling? (vLLM, TensorRT-LLM do; naive HuggingFace doesn't)
How does it handle prefill scheduling? (Some starve decode, some don't)
What's the memory management strategy? (PagedAttention? Contiguous?)
Can it preempt long requests? (Matters for latency SLAs)
What's the max batch size before degradation? (Depends on model and hardware)

The serving framework choice matters more than most model optimizations. A well-tuned continuous batcher on basic hardware often beats a poorly scheduled system on premium GPUs.