Why Doubling Context Quadruples Your Problems

A product manager asks: "Can we double the context window from 8K to 16K?" Seems reasonable. More context means better responses, right?

Here's what they don't know: doubling context doesn't double compute. It quadruples it.

The Math

Standard transformer attention is O(n²) in sequence length. Every token attends to every other token.

Context Length	Attention Operations	Relative Cost
4K tokens	16M	1x
8K tokens	64M	4x
16K tokens	256M	16x
32K tokens	1,024M	64x
128K tokens	16,384M	1,024x

That 128K context window Claude offers? It costs 1,024x more attention compute than a 4K context.

Where It Hurts

The quadratic cost hits hardest during prefill, when the model processes your entire prompt.

A 2,000 token prompt prefills in around 100ms. A 32,000 token document? Closer to 4 seconds. Not 16x slower (linear). 256x more operations, but hardware parallelism helps.

Still, you feel it.

Why Teams Get Surprised

Context window limits have been increasing rapidly:

GPT-3: 4K tokens
GPT-4: 8K, then 32K, then 128K
Claude: 100K, then 200K

Marketing says "200K context!" Engineers assume costs scale linearly. Finance discovers the bill.

The Real Budget

Think of context as a budget with diminishing returns:

First 2K tokens: Essential (system prompt, user query)
Next 4K tokens: Useful (immediate context)
Next 8K tokens: Marginally useful (background info)
Beyond 16K: Often noise (full document dumps)

Each doubling costs 4x more. Is that next doubling worth it?

What Actually Helps

Prefix caching

If your system prompt is constant across requests, cache its KV representation. Pay the quadratic cost once, reuse forever.

# Without prefix caching: Every request pays full prefill
cost_per_request = O(system_prompt² + user_query²)

# With prefix caching: System prompt computed once
cost_per_request = O(user_query²)  # Much smaller

Chunked context

Instead of stuffing 50K tokens into one request, process in chunks:

# Bad: One massive context
response = model.generate(prompt=huge_document + query)

# Better: Summarize chunks, then query
summaries = [model.summarize(chunk) for chunk in chunks]
response = model.generate(prompt=summaries + query)

Trading one O(n²) for multiple smaller O(k²) operations.

Context selection

Most documents contain noise. Embed and retrieve only relevant sections:

# Instead of sending the whole document
relevant_chunks = retriever.search(query, document, top_k=5)
response = model.generate(prompt=relevant_chunks + query)

5 chunks of 500 tokens (2,500 total) vs. full 50K document? 400x less compute.

When to Pay the Tax

Sometimes the quadratic cost is worth it:

Legal documents: Missing a clause matters
Code review: Full file context catches bugs
Long conversations: User expects continuity

But default to minimal context. Add more only when quality demands it.

The Future

Linear attention variants exist (Mamba, RWKV, etc.). They scale O(n) instead of O(n²). The tradeoff: different quality characteristics, less battle-tested.

For now, standard attention dominates. The quadratic tax remains.

When someone asks to "just increase the context window," remember: 2x context = 4x compute. Plan accordingly.