Back to Blog
Why Tokens at Position 50K Get Ignored
Radio signals weaken with distance squared. A station that's crystal clear at 10 miles becomes static at 100 miles. The signal still exists, but it's too weak to be useful. The physics of propagation doesn't care about the radio's claimed range.
Positional encodings face similar decay. A model might accept 128K tokens, but the influence of tokens at position 50K on token 100K can be negligibly small. The position encoding exists, but the attention signal has decayed below usefulness.
How Positional Decay Works
def positional_decay_mechanics():
return {
"attention_score_formula": """
score = Q @ K^T / sqrt(d)
Q and K include positional information.
When positions are far apart, their interaction term can be small.
""",
"rotary_embeddings_rope": {
"mechanism": "Rotate Q and K vectors based on position",
"decay": "cos(θ * Δposition) term in attention",
"effect": "Attention naturally decreases with position distance",
},
"alibi": {
"mechanism": "Add linear bias based on position difference",
"formula": "score = Q @ K^T - m * |i - j|",
"decay": "Explicit linear penalty for distance",
},
"learned_positional": {
"mechanism": "Learned embedding per position",
"problem": "Positions beyond training are extrapolation",
"decay": "Quality degrades for unseen positions",
},
}
Measuring the Decay
def measuring_attention_decay():
return {
"experiment": """
For a model with 128K context:
1. Place a unique fact at various positions
2. Ask about that fact at position 128K
3. Measure retrieval accuracy by position
""",
"typical_results": {
"position_1000": {"accuracy": 0.95, "avg_attention": 0.01},
"position_10000": {"accuracy": 0.85, "avg_attention": 0.001},
"position_50000": {"accuracy": 0.60, "avg_attention": 0.0001},
"position_100000": {"accuracy": 0.40, "avg_attention": 0.00001},
},
"interpretation": """
Even though all positions are 'in context',
the effective attention to distant positions is minimal.
A token at position 50K gets ~100x less attention
than one at position 10K.
""",
}
RoPE and Extrapolation
def rope_extrapolation():
"""
Rotary Position Embeddings and their limits
"""
return {
"how_rope_works": """
RoPE encodes position by rotating Q and K vectors.
Rotation frequency varies by dimension.
Low-frequency dimensions: Change slowly with position
High-frequency dimensions: Change rapidly
""",
"extrapolation_problem": """
If model trained on 4K context:
- Positions 0-4K: Seen rotations, learned patterns
- Positions 4K-128K: Unseen rotations, must extrapolate
Extrapolation quality degrades with distance from training.
""",
"position_interpolation": {
"technique": "Scale positions to fit training range",
"formula": "scaled_position = position * (training_length / target_length)",
"example": "Position 128K → scaled to equivalent of position 4K",
"tradeoff": "Preserves learned patterns, reduces resolution",
},
"ntk_aware_scaling": {
"technique": "Adjust rotation frequencies, not positions",
"benefit": "Better preserves local relationships",
"used_by": "Many extended-context models",
},
}
The Softmax Concentration Problem
def softmax_concentration():
return {
"problem": """
Softmax normalizes attention across all positions.
When attending to 128K positions:
Total attention must sum to 1.0
Even split: 1/128000 = 0.0000078 per position
But attention isn't even - it concentrates on few positions.
""",
"concentration_effect": """
If 1000 nearby positions each have score 10,
and 50000 distant positions each have score 1:
softmax(10) >> softmax(1) after normalization
The distant positions get effectively zero attention,
even though they're technically 'attended to'.
""",
"visualization": """
Attention distribution over 128K context:
Attention
^
|█████
| █
| █
| ████████████████████████████████████____
+----------------------------------------> Position
0 Recent Distant 128K
Most attention on recent, negligible on distant.
""",
}
Practical Implications
def practical_implications():
return {
"what_this_means": {
"context_advertising": "128K context ≠ 128K useful context",
"quality_gradient": "Information near end is more accessible",
"position_matters": "Same fact, different position = different accuracy",
},
"design_implications": [
"Put important info at start and end",
"Don't rely on raw middle of long context",
"Use retrieval for position-independent access",
"Test actual retrieval at your expected positions",
],
"when_long_context_works": [
"Recent context matters most (chat, streaming)",
"Hierarchical structure (summaries, then details)",
"Redundant information (fact mentioned multiple times)",
"Tasks that don't require distant retrieval",
],
"when_long_context_fails": [
"Finding needle in middle of haystack",
"Synthesizing info from distant positions",
"Tasks requiring precise retrieval from any position",
],
}
Mitigations
def mitigating_position_decay():
return {
"explicit_retrieval": {
"approach": "Don't stuff context, retrieve relevant portions",
"benefit": "Retrieved content goes in high-attention positions",
"implementation": "RAG, semantic search",
},
"position_rotation": {
"approach": "Rotate important content to end of context",
"benefit": "Leverages recency bias in attention",
"implementation": "Put query-relevant docs last",
},
"chunked_processing": {
"approach": "Process long docs in chunks, aggregate",
"benefit": "Each chunk gets full attention",
"implementation": "Map-reduce style summarization",
},
"architectural_solutions": {
"landmark_attention": "Mark important tokens for preserved attention",
"memory_tokens": "Compressed representations of distant context",
"hierarchical": "Different attention patterns at different scales",
},
"practical_recommendation": """
For most applications:
1. Keep context under 16K if possible
2. Use retrieval for long documents
3. Put most important content at end
4. Test actual retrieval at your positions
5. Don't trust context > 32K for precise retrieval
""",
}
Testing Position Sensitivity
class PositionSensitivityTest:
"""
Test how position affects information retrieval
"""
def test_retrieval_by_position(
self,
model,
context_length: int,
positions: list
) -> dict:
"""Test fact retrieval at different positions"""
results = {}
for position_ratio in positions:
position = int(context_length * position_ratio)
# Create context with fact at specific position
fact = "The secret code is ALPHA-BETA-GAMMA."
context = self.create_context_with_fact(
total_length=context_length,
fact=fact,
fact_position=position
)
# Query for the fact
query = "What is the secret code?"
response = model.generate(context + query)
# Check if fact retrieved correctly
correct = "ALPHA-BETA-GAMMA" in response
results[position_ratio] = {
"position": position,
"correct": correct,
"response_snippet": response[:100],
}
return results
def summarize_results(self, results: dict) -> str:
"""Identify effective context length"""
for ratio, result in sorted(results.items(), reverse=True):
if result["correct"]:
return f"Effective retrieval up to {ratio*100}% of context"
return "Retrieval failing at all positions tested"
Positional encodings have practical limits that context window sizes don't reveal. A 128K window doesn't mean 128K tokens of equally-accessible context. Test your actual retrieval patterns at the positions you expect to use. The decay is real.