#scaling

7 posts tagged with "scaling"

Nov 22, 2025

Extending Context Beyond Training Length

Models trained on 4K context can work at 32K with position interpolation. Quality degrades, but predictably. Know the tradeoffs before extending.

Sep 27, 2025

Adding GPUs Without Linear Speedup

Four GPUs don't give you 4x throughput. Communication overhead, load imbalance, and synchronization eat into gains. Know the scaling curve before you buy.

Jun 14, 2025

The GPU Memory Techniques That Actually Scale

Paged allocation, quantization, prefix caching—which techniques give 4x more concurrent requests and which are hype?

Apr 5, 2025

Designing Queues That Don't Explode

An unbounded queue is a memory leak waiting to happen. A too-small queue drops requests unnecessarily. Here's how to size and manage LLM request queues.

Mar 19, 2025

Managing Load Without Dropping Requests

Traffic spikes 10x. Do you queue requests until OOM, drop them randomly, or gracefully degrade? The answer shapes your system's behavior under pressure.

Mar 5, 2025

What Changes When 100 Users Hit Your LLM

Single-user latency was 200ms. At 100 concurrent users, it's 3 seconds. The model didn't slow down. Your serving architecture did.

Jan 11, 2025

Why Doubling Context Quadruples Your Problems

Double your context window, quadruple your compute. The O(n²) attention cost catches teams off guard when they scale.