What Changes When 100 Users Hit Your LLM
You tested with one user. Latency was great. Then you launched.
100 concurrent users later, that 200ms latency became 3 seconds. The model didn't get slower. The infrastructure around it buckled.
The Memory Wall
Think of GPU memory like a popular restaurant with limited seating. One diner (request) gets immediate service. 100 diners at once? Most are waiting outside.
Each concurrent request needs KV cache memory:
# KV cache memory per request (rough formula)
def kv_cache_memory(
batch_size: int,
seq_length: int,
num_layers: int,
num_heads: int,
head_dim: int,
dtype_bytes: int = 2 # FP16
) -> int:
# 2 for K and V tensors
return 2 * batch_size * seq_length * num_layers * num_heads * head_dim * dtype_bytes
# Example: Llama-70B with 2048 context
memory_per_request = kv_cache_memory(
batch_size=1,
seq_length=2048,
num_layers=80,
num_heads=64,
head_dim=128
)
# ≈ 2.7 GB per request
# 100 concurrent users = 270 GB
# H100 has 80 GB
# You can fit ~30 concurrent requests, max
The other 70 requests? Queued.
Queuing Theory Strikes Back
When arrival rate approaches service capacity, latency explodes. This isn't LLM-specific. It's the same math that governs supermarket checkout lines and highway traffic.
Utilization 50%: Average wait = 1x service time
Utilization 80%: Average wait = 4x service time
Utilization 95%: Average wait = 19x service time
Utilization 99%: Average wait = 99x service time
At 95% GPU utilization, your 200ms inference takes 4 seconds on average because of queuing. The GPU is fast. The queue is slow.
The Batching Paradox
Batching multiple requests together increases throughput. But it also increases latency for individual requests.
# Without batching: 100 requests, processed sequentially
# Each takes 200ms
# Total time: 20 seconds
# Average latency: 10 seconds (wait + process)
# With batching (batch size 10): 100 requests, 10 batches
# Each batch takes 400ms (not 10x, because GPU parallelism)
# Total time: 4 seconds
# Average latency: 2 seconds
# Throughput went up 5x
# Latency went down 5x
# But individual request latency went from 200ms to 400ms
Batching trades single-request latency for system throughput. At high concurrency, this tradeoff is worth it.
The Long Request Problem
Imagine a restaurant where some diners order a 20-course tasting menu while others just want coffee. The coffee drinkers wait hours.
Same thing happens with LLMs. A request generating 2,000 tokens monopolizes resources while short requests pile up.
# Request distribution
short_requests = [50 tokens] * 90 # 90% of traffic
long_requests = [2000 tokens] * 10 # 10% of traffic
# Without preemption:
# Short requests wait behind long ones
# P99 latency dominated by long request time
# With iteration-level preemption (continuous batching):
# Short requests slip in between iterations of long requests
# Everyone makes progress
This is why continuous batching matters so much at scale.
Practical Limits
For a single H100 serving Llama-70B:
| Concurrent Users | Expected P50 Latency | Notes |
|---|---|---|
| 1-5 | 200-300ms | Comfortable |
| 10-20 | 400-800ms | Batching helps |
| 30-50 | 1-2s | Near memory limit |
| 50-100 | 2-5s | Queuing dominates |
| 100+ | Degraded | Need more GPUs |
These numbers shift with model size, quantization, and context length. But the shape of the curve is universal.
Scaling Strategies
Vertical (bigger GPU): Limited. H100 is roughly the ceiling for now.
Horizontal (more GPUs): Works, but load balancing matters. Token-aware routing beats round-robin.
Model optimization: Quantization, shorter contexts, smaller models for simpler tasks.
Request routing: Route complex queries to capable models, simple queries to fast models.
# Simple router
def route_request(request: Request) -> str:
if request.estimated_complexity < 0.3:
return "llama-8b" # Fast, cheap
elif request.estimated_complexity < 0.7:
return "llama-70b-int8" # Balanced
else:
return "llama-70b-fp16" # Maximum quality
The Monitoring You Need
At high concurrency, these metrics matter:
metrics = {
"queue_depth": "How many requests are waiting?",
"queue_time_p99": "How long do the unlucky ones wait?",
"batch_size_avg": "Are we batching efficiently?",
"memory_utilization": "How close to OOM?",
"requests_rejected": "Are we shedding load?",
}
If queue depth is climbing, latency is about to spike. If memory utilization is above 90%, you're one long request away from trouble.
The difference between a system that handles 100 users and one that collapses? Usually just these metrics, watched closely.