Choosing Benchmarks That Predict Production

The vendor claims 10,000 tokens per second. You deploy and get 500. The benchmark wasn't lying. It just wasn't measuring what you needed.

How Benchmarks Mislead

Batch size inflation: Throughput scales with batch size. A benchmark at batch size 64 looks great. Your production workload is mostly batch size 1.

Batch size 1:   500 tokens/second, 50ms latency
Batch size 8:   2,000 tokens/second, 100ms latency
Batch size 64:  10,000 tokens/second, 800ms latency

Same hardware, same model, 20x throughput difference.

Mean vs percentile: Mean latency of 100ms sounds great. P99 of 3 seconds means 1% of users wait 30x longer.

Short sequences only: Benchmarked with 128-token prompts. Your production prompts are 4,000 tokens.

No concurrency: Benchmarked one request at a time. Production has 50 concurrent requests.

What to Benchmark

Match your production workload:

@dataclass
class WorkloadProfile:
    prompt_length_distribution: list[int]  # [500, 1000, 2000, 4000] tokens
    prompt_length_weights: list[float]     # [0.3, 0.4, 0.2, 0.1]
    output_length_distribution: list[int]  # [50, 100, 200, 500] tokens
    output_length_weights: list[float]     # [0.2, 0.4, 0.3, 0.1]
    concurrency_levels: list[int]          # [1, 10, 50, 100]

def generate_benchmark_requests(profile: WorkloadProfile, n: int):
    for _ in range(n):
        prompt_len = random.choices(
            profile.prompt_length_distribution,
            weights=profile.prompt_length_weights
        )[0]
        yield generate_prompt(prompt_len)

The Metrics That Matter

For interactive applications:

def benchmark_interactive(endpoint, requests, concurrency):
    results = []

    async def make_request(prompt):
        start = time.time()
        first_token_time = None
        tokens = 0

        async for token in stream_request(endpoint, prompt):
            if first_token_time is None:
                first_token_time = time.time()
            tokens += 1

        end = time.time()

        return {
            "ttft": first_token_time - start,
            "total_time": end - start,
            "tokens": tokens,
            "tokens_per_second": tokens / (end - first_token_time)
        }

    # Run with specified concurrency
    results = await run_concurrent(make_request, requests, concurrency)

    return {
        "ttft_p50": percentile(results, "ttft", 50),
        "ttft_p99": percentile(results, "ttft", 99),
        "throughput_p50": percentile(results, "tokens_per_second", 50),
        "throughput_p99": percentile(results, "tokens_per_second", 99),
    }

For batch processing:

def benchmark_batch(endpoint, requests, batch_size):
    start = time.time()
    total_tokens = 0

    for batch in chunk(requests, batch_size):
        results = await process_batch(endpoint, batch)
        total_tokens += sum(r.tokens for r in results)

    elapsed = time.time() - start

    return {
        "total_throughput": total_tokens / elapsed,
        "requests_per_second": len(requests) / elapsed,
    }

Concurrency Stress Test

Real production has bursts:

async def stress_test(endpoint, base_concurrency, spike_multiplier):
    results = {"base": [], "spike": [], "recovery": []}

    # Baseline: normal load
    for _ in range(100):
        r = await benchmark_at_concurrency(endpoint, base_concurrency)
        results["base"].append(r)

    # Spike: sudden load increase
    for _ in range(50):
        r = await benchmark_at_concurrency(endpoint, base_concurrency * spike_multiplier)
        results["spike"].append(r)

    # Recovery: back to normal
    for _ in range(100):
        r = await benchmark_at_concurrency(endpoint, base_concurrency)
        results["recovery"].append(r)

    # Did latency recover? Or did queues build up?
    return {
        "baseline_p99": percentile(results["base"], 99),
        "spike_p99": percentile(results["spike"], 99),
        "recovery_p99": percentile(results["recovery"], 99),
        "recovery_time": time_to_baseline(results["recovery"])
    }

Benchmark Checklist

Before trusting any benchmark:

What batch size? If > 1, divide throughput by batch size for real-world estimate
What sequence lengths? Match your production distribution
What concurrency? Test at your expected peak
Mean or percentile? P99 is what users remember
Sustained or burst? Real traffic is bursty
Cold or warm? First request after idle is slower
What hardware exactly? Same GPU model can have different memory bandwidth

The Honest Report

A benchmark report that's actually useful:

Hardware: 1x H100 80GB
Model: Llama-2-70B, INT8 quantized

| Concurrency | Prompt Len | TTFT P50 | TTFT P99 | Throughput P50 |
|-------------|------------|----------|----------|----------------|
| 1           | 512        | 180ms    | 220ms    | 45 tok/s       |
| 1           | 2048       | 450ms    | 580ms    | 42 tok/s       |
| 10          | 512        | 250ms    | 890ms    | 35 tok/s       |
| 10          | 2048       | 680ms    | 2100ms   | 28 tok/s       |
| 50          | 512        | 890ms    | 3500ms   | 22 tok/s       |

Notes:
- TTFT degrades significantly under concurrency
- P99 is 4x worse than P50 at high concurrency
- Throughput per request drops 50% at 50 concurrent

This tells you what to actually expect. The headline number never does.