Tuning Batch Size for Your Workload
Batch size 1 wastes GPU. Batch size 64 kills latency. Somewhere in between is your sweet spot. Here's how to find it.
3 posts tagged with "batching"
Batch size 1 wastes GPU. Batch size 64 kills latency. Somewhere in between is your sweet spot. Here's how to find it.
100 requests sounds like 100 requests. But one 50k-token request consumes more resources than 99 short ones combined. Batch by tokens, not requests.
Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fills those gaps. The difference is 3-5x throughput.