Why Output Tokens Cost 4x More Than Input

OpenAI charges $15 per million input tokens for GPT-4 Turbo. Output tokens? $60 per million. A 4x premium.

This isn't arbitrary pricing. It reflects a fundamental asymmetry in how transformers process tokens.

Two Different Operations

When you send a prompt to an LLM, two distinct phases happen:

Prefill: Process all input tokens at once. Highly parallel. GPU compute units stay busy. Fast.

Decode: Generate output tokens one at a time. Each token depends on the previous. Memory-bound. Slow.

The same GPU that processes 10,000 input tokens per second might only generate 50 output tokens per second. That's a 200x difference in throughput.

Why Prefill Is Fast

Input tokens are processed in parallel. A 1,000-token prompt doesn't take 1,000x longer than a 1-token prompt. The GPU does matrix multiplications across all tokens simultaneously.

Prefill throughput: O(1) with respect to sequence position
Each token processed independently in parallel
GPU utilization: High (compute-bound)

The attention computation is O(n²) in sequence length, but it's highly parallelizable. Modern GPUs eat this for breakfast.

Why Decode Is Slow

Output tokens must be generated sequentially. Token 50 depends on tokens 1-49. You can't parallelize this.

Decode throughput: O(n) with respect to output length
Each token waits for all previous tokens
GPU utilization: Low (memory-bound)

Each decode step:

Load the KV cache from memory (slow)
Do a tiny bit of compute (fast)
Write the new KV entry (slow)
Repeat

The GPU spends most of its time waiting for memory, not computing.

The Numbers

For a 70B parameter model on an H100:

Phase	Throughput	Bottleneck
Prefill	5,000+ tokens/sec	Compute
Decode	30-50 tokens/sec	Memory bandwidth

Same hardware. 100x difference. The decode phase is memory-bound because each token generation requires reading the entire KV cache.

What This Means for Pricing

Providers price based on their costs. If decode uses 100x more GPU-seconds per token than prefill, output tokens should cost more.

The 4x premium is actually generous. It reflects batching efficiencies (multiple users share decode overhead) and competition keeping margins thin.

What This Means for You

Minimize output tokens, not input tokens.

A system prompt of 2,000 tokens costs $0.03 at GPT-4 Turbo rates. Getting the model to output 500 fewer tokens per response? That saves $0.03 per request.

Same cost, but one happens once (system prompt), the other happens every response.

Use max_tokens aggressively.

If you need 100 tokens, don't let the model ramble to 500. Set max_tokens: 150 and save 4x on those extra 350 tokens you didn't need.

Output format matters.

JSON with verbose keys costs more than terse formats:

// Expensive
{"customer_name": "Alice", "purchase_amount": 99.50}

// Cheaper
{"n": "Alice", "a": 99.50}

Extreme? Maybe. But at scale, output verbosity directly impacts costs.

Batched Decode

Smart inference engines batch multiple decode requests together. User A's token 47 generates alongside User B's token 23.

This improves GPU utilization during decode. The memory bandwidth is still the bottleneck, but you're spreading it across more useful work.

Batched decode is why API providers can charge "only" 4x for output tokens instead of 100x.

Looking Forward

The input/output price gap will persist as long as autoregressive generation remains sequential. Some approaches try to change this:

Speculative decoding: Draft model proposes multiple tokens, main model verifies in parallel
Parallel decoding: Generate multiple candidate continuations simultaneously
Non-autoregressive models: Generate all tokens at once (quality tradeoffs)

Until these mature, the physics remains: input tokens are cheap, output tokens are expensive.

When optimizing LLM costs, focus on the expensive side first.