Trading Full Context for Speed
Full attention is O(n²). Sliding window attention is O(n). The trade: lose long-range dependencies, gain linear scaling. Often worth it.
1 post tagged with "sliding-window"
Full attention is O(n²). Sliding window attention is O(n). The trade: lose long-range dependencies, gain linear scaling. Often worth it.