Back to Blog

Managing Load Without Dropping Requests

A system without backpressure is a balloon without a pressure release valve. It works fine until it doesn't, and then it explodes.

Traffic spikes happen. Marketing campaigns, viral moments, bot attacks. Your LLM serving system needs a plan.

The Failure Modes

No backpressure (accept everything):

  1. Traffic spike arrives
  2. Queue grows unboundedly
  3. Memory exhausted
  4. System crashes
  5. All requests fail, including the ones that were in progress

Naive rejection (drop immediately):

  1. Traffic spike arrives
  2. Every request beyond capacity rejected
  3. Users see errors
  4. No degradation, just failure

Smart backpressure:

  1. Traffic spike arrives
  2. Queue accepts requests up to limit
  3. Excess requests get fast "overloaded" response
  4. Users in queue experience delay but get results
  5. System stays healthy

Token-Aware Admission Control

The trick with LLMs: request cost varies wildly. One request might need 500 tokens of KV cache, another might need 50,000.

class TokenAwareAdmissionControl:
    def __init__(
        self,
        max_tokens_in_flight: int,
        max_queue_tokens: int
    ):
        self.max_in_flight = max_tokens_in_flight
        self.max_queue = max_queue_tokens
        self.current_in_flight = 0
        self.current_queued = 0

    def admit(self, request: Request) -> AdmissionResult:
        estimated_tokens = estimate_total_tokens(request)

        # Check if we can process immediately
        if self.current_in_flight + estimated_tokens <= self.max_in_flight:
            self.current_in_flight += estimated_tokens
            return AdmissionResult.PROCESS_NOW

        # Check if we can queue
        if self.current_queued + estimated_tokens <= self.max_queue:
            self.current_queued += estimated_tokens
            return AdmissionResult.QUEUED

        # Reject with backpressure signal
        return AdmissionResult.REJECTED

    def request_complete(self, request: Request, actual_tokens: int):
        self.current_in_flight -= actual_tokens

Count tokens, not requests. A single 100k-context request can consume more resources than 100 small ones.

Graceful Degradation Strategies

When overloaded, you have options beyond rejection:

Reduce quality:

def degraded_request(request: Request) -> Request:
    if system_load() > 0.9:
        return request.with_modifications(
            max_tokens=min(request.max_tokens, 200),  # Shorter responses
            model="llama-8b" if request.model == "llama-70b" else request.model,
        )
    return request

Shorten context:

def truncate_context(request: Request, max_input: int) -> Request:
    if len(request.prompt) > max_input:
        # Keep system prompt + recent context
        return request.with_truncated_prompt(max_input)
    return request

Serve from cache:

async def maybe_cached_response(request: Request) -> Optional[Response]:
    if system_load() > 0.95:
        cached = cache.get(request.cache_key())
        if cached and cached.age_seconds < 300:
            return cached.response
    return None

Users get a response, even if not ideal. Better than errors.

Queue Design Matters

Not all queues are equal:

FIFO (First In, First Out): Fair but doesn't optimize for system efficiency. Long requests block short ones.

Priority queue: VIP users or urgent requests first. Requires prioritization logic.

Shortest job first: Process short requests before long ones. Better average latency but can starve long requests.

Token-budget batching: Group requests to optimize batch efficiency.

class SmartQueue:
    def __init__(self):
        self.high_priority = []
        self.normal = []
        self.batch_priority = []

    def add(self, request: Request):
        if request.priority == "high":
            heappush(self.high_priority, (time.time(), request))
        elif request.estimated_tokens < 500:
            heappush(self.batch_priority, (request.estimated_tokens, request))
        else:
            heappush(self.normal, (time.time(), request))

    def get_next_batch(self, max_tokens: int) -> list[Request]:
        batch = []
        tokens = 0

        # High priority first
        while self.high_priority and tokens < max_tokens:
            _, req = heappop(self.high_priority)
            batch.append(req)
            tokens += req.estimated_tokens

        # Fill with small requests (good for batching)
        while self.batch_priority and tokens < max_tokens:
            _, req = heappop(self.batch_priority)
            batch.append(req)
            tokens += req.estimated_tokens

        return batch

The Signals to Watch

Early warning signs of backpressure needed:

alerts = {
    "queue_depth > 100": "Requests accumulating",
    "queue_time_p99 > 5s": "Users waiting too long",
    "memory_util > 85%": "Approaching OOM",
    "rejection_rate > 1%": "Already shedding load",
    "error_rate_spike": "Something breaking",
}

React before the system collapses. Automated scaling or load shedding triggered by these metrics keeps you healthy.

The Response to Rejection

When you reject a request, tell the client what to do:

def rejection_response(reason: str) -> Response:
    return Response(
        status=503,
        headers={
            "Retry-After": "30",  # Hint: try again in 30s
            "X-Queue-Position": "overflow",
        },
        body={
            "error": "service_overloaded",
            "message": "System at capacity. Please retry.",
            "retry_after_seconds": 30,
        }
    )

HTTP 503 with Retry-After is the standard. Good clients will back off. Bad clients will keep hammering. Rate limiting handles the latter.

A system that fails gracefully under 10x load is more valuable than one that handles 2x load perfectly and crashes at 3x.