Moving Beyond Simple Request Batching
Static batching is the assembly line where every car waits for the slowest worker to finish before moving to the next station. Continuous batching is Toyota's just-in-time manufacturing. Same resources, dramatically different output.
The Static Batching Problem
Traditional batching collects N requests, processes them together, returns all results:
# Static batching
def process_batch(requests: list[Request]) -> list[Response]:
# All requests start together
# All requests must finish before any can return
# If one request generates 500 tokens and another generates 50,
# the 50-token request waits for the 500-token request
batch_outputs = model.generate(
prompts=[r.prompt for r in requests],
max_tokens=max(r.max_tokens for r in requests)
)
return batch_outputs
The waste: a request generating 50 tokens sits idle for the 450 iterations it takes for the longest request to finish.
Continuous Batching: The Fix
Continuous batching operates at the iteration level, not the request level:
# Continuous batching (conceptual)
class ContinuousBatcher:
def __init__(self, model, max_batch_size: int):
self.model = model
self.max_batch_size = max_batch_size
self.active_requests: dict[str, RequestState] = {}
self.pending_queue: Queue[Request] = Queue()
async def step(self):
# Remove completed requests
completed = [
req_id for req_id, state in self.active_requests.items()
if state.is_done()
]
for req_id in completed:
self.active_requests[req_id].return_result()
del self.active_requests[req_id]
# Add new requests from queue (if capacity)
while (len(self.active_requests) < self.max_batch_size
and not self.pending_queue.empty()):
new_req = self.pending_queue.get()
self.active_requests[new_req.id] = RequestState(new_req)
# Run one decode step for all active requests
if self.active_requests:
self.model.decode_step(list(self.active_requests.values()))
Requests enter when there's capacity. Requests leave when they're done. The batch composition changes every iteration.
The Throughput Difference
Consider 100 requests with output lengths uniformly distributed between 50 and 500 tokens:
Static batching (batch size 10):
- 10 batches
- Each batch waits for 500-token request
- Total: ~5000 decode iterations
- Effective tokens generated: ~27,500 (avg 275 × 100)
- Tokens per iteration: 5.5
Continuous batching (max batch size 10):
- Requests flow through continuously
- Short requests complete fast, slots refilled
- ~2750 iterations (much closer to average)
- Tokens per iteration: 10
That's roughly 2x throughput improvement from scheduling alone.
Memory Management Complexity
The catch: continuous batching requires dynamic memory allocation for KV cache.
Static batching can pre-allocate a fixed block. Continuous batching needs to:
- Allocate KV cache for new requests
- Deallocate when requests complete
- Handle fragmentation
This is where PagedAttention comes in. It manages KV cache like virtual memory, allocating in pages rather than contiguous blocks.
Prefill vs Decode Scheduling
Another subtlety: prefill (processing the input prompt) is compute-intensive. Decode (generating tokens) is memory-intensive.
Naive continuous batching might schedule a large prefill alongside decode operations, starving the decode batch of compute.
Smart schedulers separate these:
class SmartScheduler:
def schedule(self):
# Option 1: Dedicated prefill phase
if self.pending_prefills:
return self.run_prefills(max_batch=4)
# Option 2: Chunked prefill (mix small prefill chunks with decode)
if self.pending_prefills and len(self.decode_batch) < threshold:
return self.run_mixed_batch()
# Default: decode only
return self.run_decode_step()
vLLM, for example, lets you configure how aggressively to interleave prefill with decode.
When Static Batching Still Wins
Continuous batching has overhead. For some workloads, static batching is simpler and sufficient:
- Uniform output lengths: If all requests generate similar token counts, the waste is minimal
- Batch inference pipelines: Processing 10,000 documents offline? Batch away
- Low concurrency: With 2-3 concurrent requests, the complexity isn't worth it
The Production Checklist
If you're evaluating a serving framework for continuous batching:
- Does it support iteration-level scheduling? (vLLM, TensorRT-LLM do; naive HuggingFace doesn't)
- How does it handle prefill scheduling? (Some starve decode, some don't)
- What's the memory management strategy? (PagedAttention? Contiguous?)
- Can it preempt long requests? (Matters for latency SLAs)
- What's the max batch size before degradation? (Depends on model and hardware)
The serving framework choice matters more than most model optimizations. A well-tuned continuous batcher on basic hardware often beats a poorly scheduled system on premium GPUs.