#throughput

5 posts tagged with "throughput"

Sep 27, 2025

Adding GPUs Without Linear Speedup

Four GPUs don't give you 4x throughput. Communication overhead, load imbalance, and synchronization eat into gains. Know the scaling curve before you buy.

Mar 15, 2025

The Tradeoff Every Inference System Makes

You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.

Mar 8, 2025

Moving Beyond Simple Request Batching

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fills those gaps. The difference is 3-5x throughput.

Mar 5, 2025

What Changes When 100 Users Hit Your LLM

Single-user latency was 200ms. At 100 concurrent users, it's 3 seconds. The model didn't slow down. Your serving architecture did.

Jan 1, 2025

Four Metrics That Actually Matter for LLM Inference

Your monitoring dashboard shows 180ms average latency. Your users say the app is slow. Both are telling the truth. The disconnect is what you're measuring.