Managing Load Without Dropping Requests
A system without backpressure is a balloon without a pressure release valve. It works fine until it doesn't, and then it explodes.
Traffic spikes happen. Marketing campaigns, viral moments, bot attacks. Your LLM serving system needs a plan.
The Failure Modes
No backpressure (accept everything):
- Traffic spike arrives
- Queue grows unboundedly
- Memory exhausted
- System crashes
- All requests fail, including the ones that were in progress
Naive rejection (drop immediately):
- Traffic spike arrives
- Every request beyond capacity rejected
- Users see errors
- No degradation, just failure
Smart backpressure:
- Traffic spike arrives
- Queue accepts requests up to limit
- Excess requests get fast "overloaded" response
- Users in queue experience delay but get results
- System stays healthy
Token-Aware Admission Control
The trick with LLMs: request cost varies wildly. One request might need 500 tokens of KV cache, another might need 50,000.
class TokenAwareAdmissionControl:
def __init__(
self,
max_tokens_in_flight: int,
max_queue_tokens: int
):
self.max_in_flight = max_tokens_in_flight
self.max_queue = max_queue_tokens
self.current_in_flight = 0
self.current_queued = 0
def admit(self, request: Request) -> AdmissionResult:
estimated_tokens = estimate_total_tokens(request)
# Check if we can process immediately
if self.current_in_flight + estimated_tokens <= self.max_in_flight:
self.current_in_flight += estimated_tokens
return AdmissionResult.PROCESS_NOW
# Check if we can queue
if self.current_queued + estimated_tokens <= self.max_queue:
self.current_queued += estimated_tokens
return AdmissionResult.QUEUED
# Reject with backpressure signal
return AdmissionResult.REJECTED
def request_complete(self, request: Request, actual_tokens: int):
self.current_in_flight -= actual_tokens
Count tokens, not requests. A single 100k-context request can consume more resources than 100 small ones.
Graceful Degradation Strategies
When overloaded, you have options beyond rejection:
Reduce quality:
def degraded_request(request: Request) -> Request:
if system_load() > 0.9:
return request.with_modifications(
max_tokens=min(request.max_tokens, 200), # Shorter responses
model="llama-8b" if request.model == "llama-70b" else request.model,
)
return request
Shorten context:
def truncate_context(request: Request, max_input: int) -> Request:
if len(request.prompt) > max_input:
# Keep system prompt + recent context
return request.with_truncated_prompt(max_input)
return request
Serve from cache:
async def maybe_cached_response(request: Request) -> Optional[Response]:
if system_load() > 0.95:
cached = cache.get(request.cache_key())
if cached and cached.age_seconds < 300:
return cached.response
return None
Users get a response, even if not ideal. Better than errors.
Queue Design Matters
Not all queues are equal:
FIFO (First In, First Out): Fair but doesn't optimize for system efficiency. Long requests block short ones.
Priority queue: VIP users or urgent requests first. Requires prioritization logic.
Shortest job first: Process short requests before long ones. Better average latency but can starve long requests.
Token-budget batching: Group requests to optimize batch efficiency.
class SmartQueue:
def __init__(self):
self.high_priority = []
self.normal = []
self.batch_priority = []
def add(self, request: Request):
if request.priority == "high":
heappush(self.high_priority, (time.time(), request))
elif request.estimated_tokens < 500:
heappush(self.batch_priority, (request.estimated_tokens, request))
else:
heappush(self.normal, (time.time(), request))
def get_next_batch(self, max_tokens: int) -> list[Request]:
batch = []
tokens = 0
# High priority first
while self.high_priority and tokens < max_tokens:
_, req = heappop(self.high_priority)
batch.append(req)
tokens += req.estimated_tokens
# Fill with small requests (good for batching)
while self.batch_priority and tokens < max_tokens:
_, req = heappop(self.batch_priority)
batch.append(req)
tokens += req.estimated_tokens
return batch
The Signals to Watch
Early warning signs of backpressure needed:
alerts = {
"queue_depth > 100": "Requests accumulating",
"queue_time_p99 > 5s": "Users waiting too long",
"memory_util > 85%": "Approaching OOM",
"rejection_rate > 1%": "Already shedding load",
"error_rate_spike": "Something breaking",
}
React before the system collapses. Automated scaling or load shedding triggered by these metrics keeps you healthy.
The Response to Rejection
When you reject a request, tell the client what to do:
def rejection_response(reason: str) -> Response:
return Response(
status=503,
headers={
"Retry-After": "30", # Hint: try again in 30s
"X-Queue-Position": "overflow",
},
body={
"error": "service_overloaded",
"message": "System at capacity. Please retry.",
"retry_after_seconds": 30,
}
)
HTTP 503 with Retry-After is the standard. Good clients will back off. Bad clients will keep hammering. Rate limiting handles the latter.
A system that fails gracefully under 10x load is more valuable than one that handles 2x load perfectly and crashes at 3x.