Managing Load Without Dropping Requests

A system without backpressure is a balloon without a pressure release valve. It works fine until it doesn't, and then it explodes.

Traffic spikes happen. Marketing campaigns, viral moments, bot attacks. Your LLM serving system needs a plan.

The Failure Modes

No backpressure (accept everything):

Traffic spike arrives
Queue grows unboundedly
Memory exhausted
System crashes
All requests fail, including the ones that were in progress

Naive rejection (drop immediately):

Traffic spike arrives
Every request beyond capacity rejected
Users see errors
No degradation, just failure

Smart backpressure:

Traffic spike arrives
Queue accepts requests up to limit
Excess requests get fast "overloaded" response
Users in queue experience delay but get results
System stays healthy

Token-Aware Admission Control

The trick with LLMs: request cost varies wildly. One request might need 500 tokens of KV cache, another might need 50,000.

class TokenAwareAdmissionControl:
    def __init__(
        self,
        max_tokens_in_flight: int,
        max_queue_tokens: int
    ):
        self.max_in_flight = max_tokens_in_flight
        self.max_queue = max_queue_tokens
        self.current_in_flight = 0
        self.current_queued = 0

    def admit(self, request: Request) -> AdmissionResult:
        estimated_tokens = estimate_total_tokens(request)

        # Check if we can process immediately
        if self.current_in_flight + estimated_tokens <= self.max_in_flight:
            self.current_in_flight += estimated_tokens
            return AdmissionResult.PROCESS_NOW

        # Check if we can queue
        if self.current_queued + estimated_tokens <= self.max_queue:
            self.current_queued += estimated_tokens
            return AdmissionResult.QUEUED

        # Reject with backpressure signal
        return AdmissionResult.REJECTED

    def request_complete(self, request: Request, actual_tokens: int):
        self.current_in_flight -= actual_tokens

Count tokens, not requests. A single 100k-context request can consume more resources than 100 small ones.

Graceful Degradation Strategies

When overloaded, you have options beyond rejection:

Reduce quality:

def degraded_request(request: Request) -> Request:
    if system_load() > 0.9:
        return request.with_modifications(
            max_tokens=min(request.max_tokens, 200),  # Shorter responses
            model="llama-8b" if request.model == "llama-70b" else request.model,
        )
    return request

Shorten context:

def truncate_context(request: Request, max_input: int) -> Request:
    if len(request.prompt) > max_input:
        # Keep system prompt + recent context
        return request.with_truncated_prompt(max_input)
    return request

Serve from cache:

async def maybe_cached_response(request: Request) -> Optional[Response]:
    if system_load() > 0.95:
        cached = cache.get(request.cache_key())
        if cached and cached.age_seconds < 300:
            return cached.response
    return None

Users get a response, even if not ideal. Better than errors.

Queue Design Matters

Not all queues are equal:

FIFO (First In, First Out): Fair but doesn't optimize for system efficiency. Long requests block short ones.

Priority queue: VIP users or urgent requests first. Requires prioritization logic.

Shortest job first: Process short requests before long ones. Better average latency but can starve long requests.

Token-budget batching: Group requests to optimize batch efficiency.

class SmartQueue:
    def __init__(self):
        self.high_priority = []
        self.normal = []
        self.batch_priority = []

    def add(self, request: Request):
        if request.priority == "high":
            heappush(self.high_priority, (time.time(), request))
        elif request.estimated_tokens < 500:
            heappush(self.batch_priority, (request.estimated_tokens, request))
        else:
            heappush(self.normal, (time.time(), request))

    def get_next_batch(self, max_tokens: int) -> list[Request]:
        batch = []
        tokens = 0

        # High priority first
        while self.high_priority and tokens < max_tokens:
            _, req = heappop(self.high_priority)
            batch.append(req)
            tokens += req.estimated_tokens

        # Fill with small requests (good for batching)
        while self.batch_priority and tokens < max_tokens:
            _, req = heappop(self.batch_priority)
            batch.append(req)
            tokens += req.estimated_tokens

        return batch

The Signals to Watch

Early warning signs of backpressure needed:

alerts = {
    "queue_depth > 100": "Requests accumulating",
    "queue_time_p99 > 5s": "Users waiting too long",
    "memory_util > 85%": "Approaching OOM",
    "rejection_rate > 1%": "Already shedding load",
    "error_rate_spike": "Something breaking",
}

React before the system collapses. Automated scaling or load shedding triggered by these metrics keeps you healthy.

The Response to Rejection

When you reject a request, tell the client what to do:

def rejection_response(reason: str) -> Response:
    return Response(
        status=503,
        headers={
            "Retry-After": "30",  # Hint: try again in 30s
            "X-Queue-Position": "overflow",
        },
        body={
            "error": "service_overloaded",
            "message": "System at capacity. Please retry.",
            "retry_after_seconds": 30,
        }
    )

HTTP 503 with Retry-After is the standard. Good clients will back off. Bad clients will keep hammering. Rate limiting handles the latter.

A system that fails gracefully under 10x load is more valuable than one that handles 2x load perfectly and crashes at 3x.