Extending Context Beyond Training Length
Models trained on 4K context can work at 32K with position interpolation. Quality degrades, but predictably. Know the tradeoffs before extending.
7 posts tagged with "scaling"
Models trained on 4K context can work at 32K with position interpolation. Quality degrades, but predictably. Know the tradeoffs before extending.
Four GPUs don't give you 4x throughput. Communication overhead, load imbalance, and synchronization eat into gains. Know the scaling curve before you buy.
Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?
An unbounded queue is a memory leak waiting to happen. A too-small queue drops requests unnecessarily. Here's how to size and manage LLM request queues.
Traffic spikes 10x. Do you queue requests until OOM, drop them randomly, or gracefully degrade? The answer shapes your system's behavior under pressure.
Single-user latency was 200ms. At 100 concurrent users, it's 3 seconds. The model didn't slow down. Your serving architecture did.
Double your context window, quadruple your compute. The O(n²) attention cost catches teams off guard when they scale.