Why Token Count Matters More Than Request Count
Shipping companies don't charge by the number of packages. They charge by weight and volume. The container has limits. A single piano takes more space than 100 letters.
LLM serving works the same way. Request count is meaningless. Token count is what matters.
The Memory Math
GPU memory for a batch depends on total tokens, not request count:
def batch_memory_requirement(requests: list[Request]) -> int:
total_tokens = sum(r.input_tokens + r.max_output_tokens for r in requests)
kv_cache_per_token = (
2 * # K and V
num_layers *
num_heads *
head_dim *
2 # FP16 bytes
)
return total_tokens * kv_cache_per_token
# Example: Llama-70B
# kv_cache_per_token ≈ 1.3 MB
# Batch A: 10 requests × 500 tokens each = 5,000 tokens = 6.5 GB
# Batch B: 1 request × 50,000 tokens = 50,000 tokens = 65 GB
# Same "10 requests" vs "1 request", 10x memory difference
Request Count Batching Breaks Down
Traditional batching:
# Naive: batch by count
def make_batches(requests: list[Request], batch_size: int = 8):
return [requests[i:i+batch_size] for i in range(0, len(requests), batch_size)]
What happens when one request in your batch of 8 has a 100k context:
Batch: [500 tok, 600 tok, 450 tok, 100,000 tok, 300 tok, 550 tok, 400 tok, 480 tok]
Memory needed: 103,280 tokens × 1.3 MB = 134 GB
H100 memory: 80 GB
Result: OOM crash
Seven innocent requests killed by one monster.
Token Budget Batching
def make_token_batches(
requests: list[Request],
token_budget: int = 50_000
) -> list[list[Request]]:
batches = []
current_batch = []
current_tokens = 0
for request in requests:
request_tokens = request.input_tokens + request.max_output_tokens
if request_tokens > token_budget:
# Request too large for any batch, process alone
batches.append([request])
continue
if current_tokens + request_tokens > token_budget:
# Would exceed budget, start new batch
batches.append(current_batch)
current_batch = [request]
current_tokens = request_tokens
else:
current_batch.append(request)
current_tokens += request_tokens
if current_batch:
batches.append(current_batch)
return batches
Now batches respect memory limits regardless of request count.
Scheduling Implications
Token-aware scheduling changes queue priorities:
class TokenAwareScheduler:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
self.current_tokens = 0
self.active_requests = []
self.pending_queue = []
def can_admit(self, request: Request) -> bool:
return self.current_tokens + request.total_tokens <= self.max_tokens
def admit(self, request: Request):
if self.can_admit(request):
self.active_requests.append(request)
self.current_tokens += request.total_tokens
return True
else:
self.pending_queue.append(request)
return False
def complete(self, request: Request):
self.active_requests.remove(request)
self.current_tokens -= request.total_tokens
# Try to admit pending requests
self._admit_from_queue()
def _admit_from_queue(self):
# Admit smaller requests first (fit more in available space)
self.pending_queue.sort(key=lambda r: r.total_tokens)
still_pending = []
for request in self.pending_queue:
if self.can_admit(request):
self.admit(request)
else:
still_pending.append(request)
self.pending_queue = still_pending
Smaller requests fill gaps left by completing requests. Efficient packing.
Estimation Challenges
You need to know token counts before processing. Input tokens are known. Output tokens aren't.
def estimate_output_tokens(request: Request) -> int:
# Option 1: Use max_tokens parameter
if request.max_tokens:
return request.max_tokens
# Option 2: Estimate based on task type
estimates = {
"classification": 10,
"extraction": 100,
"summarization": 200,
"generation": 500,
"chat": 300,
}
return estimates.get(request.task_type, 300)
# Option 3: Learn from historical data
# return model.predict_output_length(request.input)
Over-estimate slightly. Running out of memory mid-request is worse than leaving some capacity unused.
The Concurrency Limit That Actually Works
"Max 100 concurrent requests" is a poor limit. "Max 500,000 tokens in flight" is better.
class TokenConcurrencyLimit:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
self.current_tokens = 0
self.lock = asyncio.Lock()
async def acquire(self, tokens: int):
while True:
async with self.lock:
if self.current_tokens + tokens <= self.max_tokens:
self.current_tokens += tokens
return
await asyncio.sleep(0.01)
async def release(self, tokens: int):
async with self.lock:
self.current_tokens -= tokens
This naturally allows more small requests or fewer large ones, matching actual resource consumption.
Rate Limiting by Tokens
Same principle applies to rate limits:
# Bad: 100 requests per minute
# Allows: 100 × 100k tokens = 10M tokens
# Good: 1,000,000 tokens per minute
# Allows: Variable requests depending on size
# Fair to small and large requests alike
OpenAI and Anthropic price by tokens for this reason. Your internal systems should think the same way.
One number to remember: tokens, not requests.