How to Think About Context as a Budget
Your working memory holds about 7 items. Try to juggle more, and you start dropping things. LLM context works the same way, except the limit is 128K tokens and the drop-off is more gradual.
The difference is what happens when you approach that limit. Human working memory just... fails. LLM attention degrades. Tokens at position 50,000 get less attention than tokens at position 500. The model still processes them. It just doesn't weight them the same.
The Quadratic Reality
Standard transformer attention is O(n²) in sequence length. Every token attends to every other token. Double your context, quadruple your compute.
| Context | Attention ops | Relative cost |
|---|---|---|
| 4K | 16M | 1x |
| 16K | 256M | 16x |
| 64K | 4B | 256x |
| 128K | 16B | 1,024x |
That 128K context window Claude offers? Using it fully costs 1,024x more compute than a 4K context. The price per token hides this because providers amortize costs across users. But the compute is real.
Diminishing Returns
Context has diminishing utility. The first 2K tokens are usually essential: system prompt, user query, immediate context. The next 4K are useful: relevant background, recent conversation. Beyond 8K, you're often adding noise.
First 2K: Essential (instructions, query)
Next 4K: Useful (immediate context)
Next 8K: Marginally useful (background)
Beyond 16K: Often noise (document dumps)
Each doubling costs 4x more. Is the marginal context worth 4x the compute?
For most queries, no. But teams reach for max context because it's easier than thinking about what actually needs to be there.
The Budget Mindset
Treat context like a budget you allocate:
CONTEXT_BUDGET = 8192 # Your per-request budget
def allocate_context(system_prompt, user_query, retrieved_docs):
remaining = CONTEXT_BUDGET
# System prompt: non-negotiable
remaining -= len(tokenize(system_prompt))
# User query: non-negotiable
remaining -= len(tokenize(user_query))
# Retrieved docs: fit what you can, ranked by relevance
included_docs = []
for doc in sorted(retrieved_docs, key=lambda d: d.relevance, reverse=True):
doc_tokens = len(tokenize(doc.content))
if doc_tokens <= remaining:
included_docs.append(doc)
remaining -= doc_tokens
else:
# Could truncate, or skip
break
return system_prompt, user_query, included_docs
Explicit budgeting forces prioritization. What's actually important? What can be summarized? What can be dropped?
What Belongs in Context
Not everything deserves context space.
Belongs: Information the model needs to generate this specific response. User's question. Relevant retrieved passages. Current conversation turns. Explicit instructions.
Doesn't belong: Information you're including "just in case." Full documents when excerpts would work. Twenty conversation turns when five are relevant. Verbose formatting that could be concise.
The test: if you removed this chunk, would the response quality drop? If not, it's burning budget for nothing.
Compression Strategies
When you need more information than fits:
Summarization: Compress long documents into summaries. A 10K document becomes 500 tokens of summary. You lose detail but keep the signal.
def summarize_for_context(long_doc, max_tokens=500):
# Use a fast model for summarization
summary = fast_model.generate(
f"Summarize in {max_tokens} tokens: {long_doc}"
)
return summary
Chunking + retrieval: Split documents into chunks. Embed them. Retrieve only the relevant chunks. Five 500-token chunks is 2,500 tokens, not 50,000.
Hierarchical context: Keep summaries of older conversation turns. Expand recent turns fully. The model sees recent history in detail, older history in outline.
The Prefix Cache Advantage
If your system prompt is constant across requests, prefix caching changes the economics. The first 2K tokens are computed once and reused. You pay the O(n²) cost only on new tokens.
# Without prefix caching
cost = O((system_prompt + query)²) # System prompt re-computed every request
# With prefix caching
cost = O(query²) # System prompt cached, only query is new
This makes generous system prompts cheap. A 4K system prompt with prefix caching costs the same as a 500-token system prompt without it, in ongoing compute terms.
When to Use Full Context
Sometimes you need every token:
Legal document review: Missing a clause matters. Full document, full attention.
Code debugging: The bug could be anywhere. Full file context catches what summaries miss.
Long-form writing: The model needs to maintain consistency across thousands of words.
But these are exceptions. Default to minimal context. Add more only when quality demonstrably improves.
Measuring Context ROI
Track this: does adding more context improve your evals?
def context_ablation_study(test_cases, context_sizes):
results = {}
for size in context_sizes:
scores = []
for case in test_cases:
truncated_context = case.context[:size]
response = model.generate(truncated_context)
score = evaluate(response, case.expected)
scores.append(score)
results[size] = mean(scores)
return results
# If 8K context scores 95% and 32K scores 96%,
# is 4x compute worth 1% quality?
Often, quality plateaus well before max context. You're paying for tokens the model barely attends to.
Context is a budget. Spend it on what matters. Save it where you can. The constraint isn't the window size your provider advertises. It's the compute you're willing to burn.