The Tradeoff Every Inference System Makes

Every highway has this tradeoff. Pack more cars on the road, and each car moves slower. Reduce traffic, and each car moves faster, but total throughput drops.

LLM inference is the same. You can optimize for throughput or latency. You cannot optimize for both.

The Fundamental Tension

Batch size 1:
- Tokens per second: 50
- Latency per request: 200ms
- GPU utilization: 15%

Batch size 32:
- Tokens per second: 800
- Latency per request: 1,200ms
- GPU utilization: 80%

Larger batches amortize fixed costs (kernel launches, memory transfers) across more requests. But each request waits longer to start and competes for resources.

Why Batching Helps Throughput

The GPU has massive parallelism. A single inference request uses a fraction of it:

# Single request
# Model weights: loaded once
# KV cache: small
# Compute: mostly idle cores

# Batched requests
# Model weights: loaded once (amortized across batch)
# KV cache: larger but parallel
# Compute: more cores utilized

Loading model weights is a fixed cost whether you process 1 or 32 requests. Batching spreads that cost.

Why Batching Hurts Latency

Several mechanisms:

Queuing: Requests wait to fill a batch. If batch size is 32 and you have 5 requests, they wait for 27 more (or a timeout).

Longest-request bottleneck: In static batching, all requests wait for the longest to finish. A 2000-token response holds up 50-token responses.

Memory pressure: Larger batches consume more KV cache, potentially triggering swapping or OOM.

Compute contention: Even with parallelism, resources are finite. 32 requests share bandwidth that 1 request had exclusively.

The Pareto Frontier

For any system, there's a curve of achievable (throughput, latency) pairs:

Latency (ms)
    |
3000|  * <- Batch 64, max throughput
    |
2000|     * <- Batch 32
    |
1000|        * <- Batch 16
    |
 500|           * <- Batch 8
    |
 200|              * <- Batch 1, min latency
    +------------------------
       200  400  600  800  1000
                Throughput (tok/s)

You can be anywhere on this curve. You cannot be above it. The shape depends on your hardware and model.

Choosing Your Point

Different applications need different tradeoffs:

Chat applications: Users waiting for a response. Latency matters more than throughput. Stay toward the low-batch end.

config = {
    "max_batch_size": 8,
    "max_waiting_time_ms": 50,  # Don't wait long to fill batch
    "latency_sla_ms": 500,
}

Batch processing: Processing 100,000 documents overnight. Throughput matters, latency doesn't.

config = {
    "max_batch_size": 64,
    "max_waiting_time_ms": 5000,  # Wait to fill batches
    # No latency SLA
}

Mixed workloads: Separate queues with different configurations.

configs = {
    "interactive": {
        "max_batch_size": 8,
        "priority": "high",
        "latency_sla_ms": 500,
    },
    "batch": {
        "max_batch_size": 64,
        "priority": "low",
        "latency_sla_ms": None,
    }
}

The Timeout Dance

A common pattern: wait up to T milliseconds for a batch to fill, then process whatever you have.

async def batch_with_timeout(
    queue: Queue,
    max_batch_size: int,
    timeout_ms: float
) -> list[Request]:
    batch = []
    deadline = time.time() + timeout_ms / 1000

    while len(batch) < max_batch_size and time.time() < deadline:
        try:
            request = await asyncio.wait_for(
                queue.get(),
                timeout=deadline - time.time()
            )
            batch.append(request)
        except asyncio.TimeoutError:
            break

    return batch

Short timeout = lower latency, smaller batches. Long timeout = higher throughput, larger batches.

Dynamic Batch Sizing

Advanced systems adjust batch size based on load:

def adaptive_batch_size(
    current_queue_depth: int,
    recent_latencies: list[float],
    target_latency_ms: float
) -> int:
    avg_latency = mean(recent_latencies)

    if avg_latency > target_latency_ms * 1.2:
        # Latency too high, reduce batch size
        return max(1, current_batch_size - 4)
    elif avg_latency < target_latency_ms * 0.5 and current_queue_depth > 10:
        # Headroom available and queue building, increase batch
        return min(max_batch_size, current_batch_size + 4)
    else:
        return current_batch_size

When load is low, process immediately with small batches. When load is high, batch more aggressively to keep up.

The Uncomfortable Truth

You can't escape this tradeoff. You can only:

Move along the curve (configuration)
Shift the curve itself (better hardware, model optimization)
Serve different workloads on different curves (separation)

Anyone promising high throughput AND low latency either has different definitions or is measuring something else.

Know your constraint. Optimize for it. Accept the tradeoff.