Balancing Fast Responses and Fair Queuing
A 10,000-token request takes 20 seconds. Behind it, a hundred 50-token requests wait. Is that fair? What even is fair in LLM serving?
1 post tagged with "scheduling"
A 10,000-token request takes 20 seconds. Behind it, a hundred 50-token requests wait. Is that fair? What even is fair in LLM serving?