Balancing Fast Responses and Fair Queuing
At a coffee shop, someone ordering a simple espresso shouldn't wait 20 minutes because the person ahead ordered a seven-layer custom frappuccino.
LLM serving has the same problem. A user sending "Summarize this 50,000 word document" shouldn't monopolize resources while users asking "What's 2+2?" wait.
The Starvation Problem
Without fairness controls, long requests block short ones:
Time 0: Large request arrives (2000 tokens output)
Time 0: Small request 1 arrives (50 tokens)
Time 0: Small request 2 arrives (50 tokens)
...
Time 0: Small request 100 arrives (50 tokens)
Without preemption:
Time 40s: Large request completes
Time 40s: Small requests start processing
Time 41s: All small requests complete
Result: Small requests waited 40s for a 20ms job
The large request wasn't wrong to exist. The system was wrong to handle it this way.
Preemption and Time-Slicing
One solution: don't let any single request monopolize the GPU for too long.
class TimeSlicedScheduler:
def __init__(self, time_slice_ms: int = 100):
self.time_slice_ms = time_slice_ms
self.active_requests = []
self.preempted_queue = []
async def run_iteration(self):
start = time.time()
while time.time() - start < self.time_slice_ms / 1000:
if not self.active_requests:
break
# Run one decode step for active batch
await self.decode_step(self.active_requests)
# Check for newly completed requests
self._handle_completions()
# After time slice, admit waiting requests
self._admit_pending()
def _admit_pending(self):
# Give waiting requests a chance
if self.preempted_queue:
# Rotate: move some active to preempted, some preempted to active
pass
Like an operating system's process scheduler. No single process (request) runs forever.
Multi-Tenant Fairness
In a shared system, fairness extends to users/tenants:
class PerTenantFairScheduler:
def __init__(self, tenants: list[str]):
self.tenant_queues = {t: [] for t in tenants}
self.tenant_usage = {t: 0 for t in tenants} # Tokens processed
self.round_robin_index = 0
def add_request(self, tenant: str, request: Request):
self.tenant_queues[tenant].append(request)
def get_next_batch(self) -> list[Request]:
batch = []
tokens = 0
max_tokens = 50000
# Round-robin across tenants
tenants = list(self.tenant_queues.keys())
for i in range(len(tenants)):
tenant = tenants[(self.round_robin_index + i) % len(tenants)]
while self.tenant_queues[tenant] and tokens < max_tokens:
request = self.tenant_queues[tenant][0]
if tokens + request.total_tokens <= max_tokens:
batch.append(self.tenant_queues[tenant].pop(0))
tokens += request.total_tokens
else:
break
self.round_robin_index = (self.round_robin_index + 1) % len(tenants)
return batch
Tenant A sending 1000 requests shouldn't starve Tenant B's 10 requests.
Weighted Fair Queuing
Not all tenants are equal. Enterprise pays more than free tier:
class WeightedFairScheduler:
def __init__(self, tenant_weights: dict[str, float]):
# Higher weight = more resources
self.weights = tenant_weights # e.g., {"enterprise": 10, "pro": 3, "free": 1}
self.virtual_time = {t: 0 for t in tenant_weights}
def get_next_request(self) -> tuple[str, Request]:
# Tenant with lowest virtual time goes next
# Virtual time increases slower for higher-weight tenants
eligible = [
(t, self.virtual_time[t])
for t in self.tenant_queues
if self.tenant_queues[t]
]
if not eligible:
return None
tenant, _ = min(eligible, key=lambda x: x[1])
request = self.tenant_queues[tenant].pop(0)
# Update virtual time (inversely proportional to weight)
self.virtual_time[tenant] += request.total_tokens / self.weights[tenant]
return tenant, request
Enterprise tenant processes 10x more tokens before their virtual time catches up to free tier.
The Latency SLA Approach
Another framing: don't think about fairness, think about SLAs.
class SLADrivenScheduler:
def __init__(self):
self.sla_targets = {
"premium": {"p99_latency_ms": 500},
"standard": {"p99_latency_ms": 2000},
"batch": {"p99_latency_ms": None}, # Best effort
}
def get_next_batch(self):
now = time.time()
# Priority: requests closest to missing SLA
def urgency(request):
target = self.sla_targets[request.tier]["p99_latency_ms"]
if target is None:
return float('inf') # Lowest priority
time_waiting = now - request.arrival_time
time_remaining = target - time_waiting
return time_remaining # Lower = more urgent
pending = sorted(self.all_pending(), key=urgency)
return self._make_batch(pending)
Requests that are about to miss their SLA get priority. Requests with no SLA fill gaps.
What Fair Even Means
There's no single definition:
- Request fairness: Each request waits roughly the same time
- Token fairness: Each token processed has equal priority
- User fairness: Each user gets equal resources over time
- Payment fairness: Resources proportional to payment
- SLA fairness: Everyone meets their contracted latency
Pick the definition that matches your business. Then implement it consistently.
Unfairness breeds resentment. Users notice when their quick question takes forever because someone else is generating a novel.