All Tags

#latency

11 posts tagged with "latency"

How Speculative Decoding Works

A small model proposes tokens, a large model verifies in parallel. When predictions match, you get 2-3x speedup. When they don't, you're no worse off.

Why Streaming Breaks and How to Fix It

Your code says streaming enabled. Your monitoring shows 0% actual streams. The bytes are getting collected somewhere between your model and the user's screen.