Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Mar 26, 2025

Balancing Fast Responses and Fair Queuing

A 10,000-token request takes 20 seconds. Behind it, a hundred 50-token requests wait. Is that fair? What even is fair in LLM serving?

Mar 22, 2025

Why Token Count Matters More Than Request Count

100 requests sounds like 100 requests. But one 50k-token request consumes more resources than 99 short ones combined. Batch by tokens, not requests.

Mar 19, 2025

Managing Load Without Dropping Requests

Traffic spikes 10x. Do you queue requests until OOM, drop them randomly, or gracefully degrade? The answer shapes your system's behavior under pressure.

Mar 15, 2025

The Tradeoff Every Inference System Makes

You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.

Mar 12, 2025

How vLLM Serves 10x More Requests

vLLM doesn't use a faster model. It uses memory smarter. PagedAttention treats KV cache like virtual memory, and the results are dramatic.

Mar 8, 2025

Moving Beyond Simple Request Batching

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fills those gaps. The difference is 3-5x throughput.

Mar 5, 2025

What Changes When 100 Users Hit Your LLM

Single-user latency was 200ms. At 100 concurrent users, it's 3 seconds. The model didn't slow down. Your serving architecture did.

Mar 1, 2025

How Failed Requests Inflate Your Bill

5% of requests fail. You retry 3 times. That's not 5% overhead. It's 15%. And under pressure, it gets much worse.

Feb 26, 2025

Separating Real Speedups from Benchmarketing

FlashAttention claims 2-4x speedup. CUDA graphs claim 10x. What actually helps in production, and what's just good marketing?

Feb 22, 2025

Choosing Benchmarks That Predict Production

That benchmark showing 10,000 tokens/second? It probably used batch size 64 and measured mean latency. Here's how to benchmark for reality.