Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Oct 22, 2025

Why Tokens at Position 50K Get Ignored

Attention scores decay with distance. By position 50K, tokens may have near-zero influence. Positional encodings have practical limits, regardless of window size.

Oct 18, 2025

Trading Full Context for Speed

Full attention is O(n²). Sliding window attention is O(n). The trade: lose long-range dependencies, gain linear scaling. Often worth it.

Oct 15, 2025

When to Use Self-Attention vs Cross-Attention

Self-attention lets a sequence talk to itself. Cross-attention lets one sequence attend to another. Understanding the difference enables better architectures.

Oct 11, 2025

Getting 95% Quality at 12% Cost

Most queries don't need the full context. Selecting the right 12% often preserves 95% of quality at a fraction of the cost and latency.

Oct 8, 2025

Why 128K Context Doesn't Mean 128K Useful

Models advertise 128K context windows. But attention quality degrades with distance. The last 10% of context often contributes less than the first 10%.

Oct 4, 2025

Knowing If You're Memory or Compute Limited

Optimizing for compute when you're memory bound wastes effort. Optimizing for memory when you're compute bound wastes opportunity. Profile first, then optimize.

Oct 1, 2025

Tensor vs Pipeline Parallelism: When Each Wins

Tensor parallelism cuts latency by splitting layers across GPUs. Pipeline parallelism increases throughput by splitting the model into stages. Choose based on your constraint.

Sep 27, 2025

Adding GPUs Without Linear Speedup

Four GPUs don't give you 4x throughput. Communication overhead, load imbalance, and synchronization eat into gains. Know the scaling curve before you buy.

Sep 24, 2025

Memory Planning for Multi-GPU Deployments

Four GPUs don't give you 4x the KV cache memory. Communication overhead, activation memory, and synchronization eat into the gains. Plan accordingly.

Sep 20, 2025

Mapping Quality Against Cost

Every configuration lives on a quality-cost curve. Some are on the efficient frontier, most aren't. Map the frontier, then choose your spot deliberately.