When to Use FP8 for Inference
H100's FP8 gives near-FP16 quality at near-INT8 speed. It's becoming the new default. Here's when and how to use it.
3 posts tagged with "inference"
H100's FP8 gives near-FP16 quality at near-INT8 speed. It's becoming the new default. Here's when and how to use it.
E2EL = TTFT + generation time sounds simple. But where does that time actually go? Understanding the equation reveals where to optimize.
Your monitoring dashboard shows 180ms average latency. Your users say the app is slow. Both are telling the truth. The disconnect is what you're measuring.