Why Doubling Context Quadruples Your Problems
A product manager asks: "Can we double the context window from 8K to 16K?" Seems reasonable. More context means better responses, right?
Here's what they don't know: doubling context doesn't double compute. It quadruples it.
The Math
Standard transformer attention is O(n²) in sequence length. Every token attends to every other token.
| Context Length | Attention Operations | Relative Cost |
|---|---|---|
| 4K tokens | 16M | 1x |
| 8K tokens | 64M | 4x |
| 16K tokens | 256M | 16x |
| 32K tokens | 1,024M | 64x |
| 128K tokens | 16,384M | 1,024x |
That 128K context window Claude offers? It costs 1,024x more attention compute than a 4K context.
Where It Hurts
The quadratic cost hits hardest during prefill, when the model processes your entire prompt.
A 2,000 token prompt prefills in around 100ms. A 32,000 token document? Closer to 4 seconds. Not 16x slower (linear). 256x more operations, but hardware parallelism helps.
Still, you feel it.
Why Teams Get Surprised
Context window limits have been increasing rapidly:
- GPT-3: 4K tokens
- GPT-4: 8K, then 32K, then 128K
- Claude: 100K, then 200K
Marketing says "200K context!" Engineers assume costs scale linearly. Finance discovers the bill.
The Real Budget
Think of context as a budget with diminishing returns:
First 2K tokens: Essential (system prompt, user query)
Next 4K tokens: Useful (immediate context)
Next 8K tokens: Marginally useful (background info)
Beyond 16K: Often noise (full document dumps)
Each doubling costs 4x more. Is that next doubling worth it?
What Actually Helps
Prefix caching
If your system prompt is constant across requests, cache its KV representation. Pay the quadratic cost once, reuse forever.
# Without prefix caching: Every request pays full prefill
cost_per_request = O(system_prompt² + user_query²)
# With prefix caching: System prompt computed once
cost_per_request = O(user_query²) # Much smaller
Chunked context
Instead of stuffing 50K tokens into one request, process in chunks:
# Bad: One massive context
response = model.generate(prompt=huge_document + query)
# Better: Summarize chunks, then query
summaries = [model.summarize(chunk) for chunk in chunks]
response = model.generate(prompt=summaries + query)
Trading one O(n²) for multiple smaller O(k²) operations.
Context selection
Most documents contain noise. Embed and retrieve only relevant sections:
# Instead of sending the whole document
relevant_chunks = retriever.search(query, document, top_k=5)
response = model.generate(prompt=relevant_chunks + query)
5 chunks of 500 tokens (2,500 total) vs. full 50K document? 400x less compute.
When to Pay the Tax
Sometimes the quadratic cost is worth it:
- Legal documents: Missing a clause matters
- Code review: Full file context catches bugs
- Long conversations: User expects continuity
But default to minimal context. Add more only when quality demands it.
The Future
Linear attention variants exist (Mamba, RWKV, etc.). They scale O(n) instead of O(n²). The tradeoff: different quality characteristics, less battle-tested.
For now, standard attention dominates. The quadratic tax remains.
When someone asks to "just increase the context window," remember: 2x context = 4x compute. Plan accordingly.