How Speculative Decoding Works
A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.
1 post tagged with "speedup"
A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.