All Tags

#attention

9 posts tagged with "attention"

Why Tokens at Position 50K Get Ignored

Attention scores decay with distance. By position 50K, tokens may have near-zero influence. Positional encodings have practical limits, regardless of window size.

Trading Full Context for Speed

Full attention is O(n²). Sliding window attention is O(n). The trade: lose long-range dependencies, gain linear scaling. Often worth it.

Attention That Fits in Memory

Standard attention needs O(n²) memory. Memory-efficient variants need O(n). Same output, 10x less peak memory.

What Flash Attention Actually Does

Flash Attention doesn't make attention faster. It makes attention fit in memory. The speedup is a side effect of better memory access.