Balancing Fast Responses and Fair Queuing

At a coffee shop, someone ordering a simple espresso shouldn't wait 20 minutes because the person ahead ordered a seven-layer custom frappuccino.

LLM serving has the same problem. A user sending "Summarize this 50,000 word document" shouldn't monopolize resources while users asking "What's 2+2?" wait.

The Starvation Problem

Without fairness controls, long requests block short ones:

Time 0:    Large request arrives (2000 tokens output)
Time 0:    Small request 1 arrives (50 tokens)
Time 0:    Small request 2 arrives (50 tokens)
...
Time 0:    Small request 100 arrives (50 tokens)

Without preemption:
Time 40s:  Large request completes
Time 40s:  Small requests start processing
Time 41s:  All small requests complete

Result: Small requests waited 40s for a 20ms job

The large request wasn't wrong to exist. The system was wrong to handle it this way.

Preemption and Time-Slicing

One solution: don't let any single request monopolize the GPU for too long.

class TimeSlicedScheduler:
    def __init__(self, time_slice_ms: int = 100):
        self.time_slice_ms = time_slice_ms
        self.active_requests = []
        self.preempted_queue = []

    async def run_iteration(self):
        start = time.time()

        while time.time() - start < self.time_slice_ms / 1000:
            if not self.active_requests:
                break

            # Run one decode step for active batch
            await self.decode_step(self.active_requests)

            # Check for newly completed requests
            self._handle_completions()

        # After time slice, admit waiting requests
        self._admit_pending()

    def _admit_pending(self):
        # Give waiting requests a chance
        if self.preempted_queue:
            # Rotate: move some active to preempted, some preempted to active
            pass

Like an operating system's process scheduler. No single process (request) runs forever.

Multi-Tenant Fairness

In a shared system, fairness extends to users/tenants:

class PerTenantFairScheduler:
    def __init__(self, tenants: list[str]):
        self.tenant_queues = {t: [] for t in tenants}
        self.tenant_usage = {t: 0 for t in tenants}  # Tokens processed
        self.round_robin_index = 0

    def add_request(self, tenant: str, request: Request):
        self.tenant_queues[tenant].append(request)

    def get_next_batch(self) -> list[Request]:
        batch = []
        tokens = 0
        max_tokens = 50000

        # Round-robin across tenants
        tenants = list(self.tenant_queues.keys())
        for i in range(len(tenants)):
            tenant = tenants[(self.round_robin_index + i) % len(tenants)]

            while self.tenant_queues[tenant] and tokens < max_tokens:
                request = self.tenant_queues[tenant][0]
                if tokens + request.total_tokens <= max_tokens:
                    batch.append(self.tenant_queues[tenant].pop(0))
                    tokens += request.total_tokens
                else:
                    break

        self.round_robin_index = (self.round_robin_index + 1) % len(tenants)
        return batch

Tenant A sending 1000 requests shouldn't starve Tenant B's 10 requests.

Weighted Fair Queuing

Not all tenants are equal. Enterprise pays more than free tier:

class WeightedFairScheduler:
    def __init__(self, tenant_weights: dict[str, float]):
        # Higher weight = more resources
        self.weights = tenant_weights  # e.g., {"enterprise": 10, "pro": 3, "free": 1}
        self.virtual_time = {t: 0 for t in tenant_weights}

    def get_next_request(self) -> tuple[str, Request]:
        # Tenant with lowest virtual time goes next
        # Virtual time increases slower for higher-weight tenants

        eligible = [
            (t, self.virtual_time[t])
            for t in self.tenant_queues
            if self.tenant_queues[t]
        ]

        if not eligible:
            return None

        tenant, _ = min(eligible, key=lambda x: x[1])
        request = self.tenant_queues[tenant].pop(0)

        # Update virtual time (inversely proportional to weight)
        self.virtual_time[tenant] += request.total_tokens / self.weights[tenant]

        return tenant, request

Enterprise tenant processes 10x more tokens before their virtual time catches up to free tier.

The Latency SLA Approach

Another framing: don't think about fairness, think about SLAs.

class SLADrivenScheduler:
    def __init__(self):
        self.sla_targets = {
            "premium": {"p99_latency_ms": 500},
            "standard": {"p99_latency_ms": 2000},
            "batch": {"p99_latency_ms": None},  # Best effort
        }

    def get_next_batch(self):
        now = time.time()

        # Priority: requests closest to missing SLA
        def urgency(request):
            target = self.sla_targets[request.tier]["p99_latency_ms"]
            if target is None:
                return float('inf')  # Lowest priority

            time_waiting = now - request.arrival_time
            time_remaining = target - time_waiting
            return time_remaining  # Lower = more urgent

        pending = sorted(self.all_pending(), key=urgency)
        return self._make_batch(pending)

Requests that are about to miss their SLA get priority. Requests with no SLA fill gaps.

What Fair Even Means

There's no single definition:

Request fairness: Each request waits roughly the same time
Token fairness: Each token processed has equal priority
User fairness: Each user gets equal resources over time
Payment fairness: Resources proportional to payment
SLA fairness: Everyone meets their contracted latency

Pick the definition that matches your business. Then implement it consistently.

Unfairness breeds resentment. Users notice when their quick question takes forever because someone else is generating a novel.