Back to Blog

Why 128K Context Doesn't Mean 128K Useful

Stadium speakers can technically project sound to the back row. But the clarity drops, the bass gets muddy, and people in the nosebleeds struggle to understand the announcer. Technical capability doesn't equal quality experience.

Long context models face the same gap. A 128K context window means the model can technically attend to 128K tokens. It doesn't mean it attends to them equally well. Attention quality degrades with distance. Information in position 1,000 influences the output differently than information in position 100,000.

The Attention Distance Problem

def attention_distance_effect():
    return {
        "how_attention_works": """
            Attention scores determine how much each position influences the output.
            Score = softmax(Q * K^T / sqrt(d))

            In theory: All positions equally accessible
            In practice: Nearby positions dominate
        """,

        "distance_degradation": {
            "1k_tokens": "Strong attention, clear influence",
            "10k_tokens": "Moderate attention, some decay",
            "50k_tokens": "Weak attention, often ignored",
            "100k_tokens": "Very weak, lucky if attended to",
        },

        "why_this_happens": [
            "Positional encoding biases toward recency",
            "Softmax concentrates probability on few positions",
            "Training data rarely has long-range dependencies",
            "Model learns to 'ignore' distant context",
        ],
    }

Measuring Effective Context

class EffectiveContextMeasurement:
    """
    Measure how much of context actually gets used
    """

    def needle_in_haystack(
        self,
        model,
        context_lengths: list,
        positions: list
    ) -> dict:
        """
        Test retrieval accuracy at different positions
        """
        results = {}

        for context_len in context_lengths:
            results[context_len] = {}
            for position_ratio in positions:
                position = int(context_len * position_ratio)

                # Insert fact at position
                context = self.create_haystack(context_len, position)
                question = self.create_retrieval_question()

                # Test if model can retrieve
                response = model.generate(context + question)
                accuracy = self.check_retrieval(response)

                results[context_len][position_ratio] = accuracy

        return results

    def typical_findings(self) -> dict:
        """What needle-in-haystack typically shows"""
        return {
            "4k_context": {
                "beginning": 0.98,
                "middle": 0.95,
                "end": 0.97,
            },
            "32k_context": {
                "beginning": 0.95,
                "middle": 0.75,
                "end": 0.90,
            },
            "128k_context": {
                "beginning": 0.90,
                "middle": 0.50,
                "end": 0.85,
            },
            "pattern": "Middle gets lost at long contexts",
        }

The U-Shaped Attention Curve

def attention_distribution():
    return {
        "typical_pattern": """
            Attention follows a U-shape over position:

            High attention at beginning (primacy effect)
            Low attention in middle (lost in context)
            High attention at end (recency effect)

            This means:
            - System prompt (beginning) gets attended
            - Recent turns (end) get attended
            - Historical context (middle) may be ignored
        """,

        "implications": {
            "good_for": [
                "System prompts (always at start)",
                "Recent conversation (always at end)",
                "Summary-based context (compressed)",
            ],
            "bad_for": [
                "Raw conversation history",
                "Long documents requiring full reading",
                "Information retrieval from any position",
            ],
        },

        "visualization": """
            Attention
            ^
            |__
            |  \\_______________
            |                   \\____
            |                        \\__
            +-------------------------> Position
            Start              Middle               End
        """,
    }

Practical Context Strategies

def context_strategies():
    return {
        "prioritize_position": {
            "strategy": "Put important info at start and end",
            "implementation": """
                context = [
                    system_prompt,           # Position 0 (high attention)
                    relevant_retrieved,      # Early positions
                    compressed_history,      # Middle (lower attention)
                    recent_messages,         # End (high attention)
                    current_query,           # Very end
                ]
            """,
            "why_works": "Leverages U-shaped attention",
        },

        "compress_middle": {
            "strategy": "Summarize historical context",
            "implementation": """
                if context_length > 32000:
                    # Summarize middle portions
                    middle = context[10000:context_length-10000]
                    summary = summarize(middle)
                    context = context[:10000] + summary + context[-10000:]
            """,
            "why_works": "Preserves info in attended positions",
        },

        "retrieval_augmented": {
            "strategy": "Don't stuff context, retrieve on demand",
            "implementation": """
                # Instead of full history in context
                relevant = retrieve_relevant(query, history)
                context = system_prompt + relevant + recent + query
            """,
            "why_works": "Only relevant info, in good positions",
        },
    }

Quality vs Context Length Tradeoff

def quality_context_tradeoff():
    """
    Longer context often means worse per-token quality
    """
    return {
        "measurements": {
            "4k_context": {"quality": 0.95, "cost": 1.0},
            "32k_context": {"quality": 0.90, "cost": 8.0},
            "128k_context": {"quality": 0.80, "cost": 32.0},
        },

        "why_quality_drops": [
            "Attention spread thinner across more tokens",
            "Signal-to-noise ratio decreases",
            "Model 'distracted' by irrelevant context",
            "Positional encodings less reliable at distance",
        ],

        "guidance": """
            More context is not always better.

            Consider:
            1. Does task actually need long context?
            2. Is info in context actually relevant?
            3. Would summarization preserve needed info?
            4. Is retrieval better than stuffing?

            Often 8K of relevant context beats 128K of everything.
        """,
    }

Testing Your Context Usage

class ContextEffectivenessTest:
    """
    Test if your context is actually being used
    """

    def ablation_test(self, model, context: str, query: str) -> dict:
        """
        Test response quality with different context lengths
        """
        full_response = model.generate(context + query)

        results = {
            "full_context": {
                "length": len(context),
                "response": full_response,
            }
        }

        # Test with progressively shorter context
        for ratio in [0.75, 0.5, 0.25, 0.1]:
            shortened = self.truncate_context(context, ratio)
            response = model.generate(shortened + query)

            results[f"{int(ratio*100)}%_context"] = {
                "length": len(shortened),
                "response": response,
                "quality_diff": self.compare_quality(full_response, response),
            }

        return results

    def analyze_results(self, results: dict) -> str:
        """Interpret ablation results"""
        quality_at_50 = results["50%_context"]["quality_diff"]

        if quality_at_50 > 0.95:
            return "Context beyond 50% not contributing significantly"
        elif quality_at_50 > 0.80:
            return "Full context provides moderate benefit"
        else:
            return "Full context is being utilized"

Architectural Approaches

def architectural_solutions():
    return {
        "sliding_window": {
            "approach": "Only attend to recent N tokens",
            "example": "Mistral uses 4K sliding window",
            "tradeoff": "Infinite context but no long-range deps",
        },

        "landmark_attention": {
            "approach": "Mark important tokens for long-range attention",
            "example": "LongLLaMA, Landmark tokens",
            "tradeoff": "Better long-range but needs markers",
        },

        "hierarchical_attention": {
            "approach": "Summarize distant context hierarchically",
            "example": "LLMA, Focused Transformer",
            "tradeoff": "Compressed representation of history",
        },

        "retrieval_augmented": {
            "approach": "External retrieval instead of stuffed context",
            "example": "RAG, RETRO",
            "tradeoff": "Retrieval quality becomes critical",
        },

        "practical_choice": """
            For most applications:
            1. Start with 8-16K context
            2. Use retrieval for longer documents
            3. Put important info at start/end
            4. Summarize historical context
            5. Test if longer context actually helps
        """,
    }

The 95/12 Heuristic

def the_95_12_heuristic():
    """
    Often get 95% of quality with 12% of context
    """
    return {
        "observation": """
            For many tasks:
            - Full 100K context gives 100% quality
            - Best 12K of that context gives 95% quality
            - Random 12K gives 80% quality

            The key is selecting the RIGHT 12K.
        """,

        "implications": {
            "retrieval": "Good retrieval > long context",
            "cost": "12% context = 12% cost = 95% quality",
            "latency": "12% context = much lower latency",
        },

        "how_to_select": [
            "Recency (recent messages)",
            "Relevance (semantic similarity to query)",
            "Importance (system prompt, key facts)",
            "Structure (beginnings of documents)",
        ],
    }

Long context is a capability, not a recommendation. Just because you can fit 128K tokens doesn't mean you should. Test whether your full context improves results. Often, smart selection of relevant context beats brute-force inclusion of everything.