Matching the Right Model to Each Task
8B models handle classification well. 70B models handle summarization. Code-specialized models beat generalists at code. Match the model to the task.
11 posts tagged with "architecture"
8B models handle classification well. 70B models handle summarization. Code-specialized models beat generalists at code. Match the model to the task.
Merge adapters for single-tenant deployments. Keep them separate for multi-tenant. The serving architecture depends on how many customizations you're running.
Full attention is O(n²). Sliding window attention is O(n). The trade: lose long-range dependencies, gain linear scaling. Often worth it.
Self-attention lets a sequence talk to itself. Cross-attention lets one sequence attend to another. Understanding the difference enables better architectures.
Tensor parallelism cuts latency by splitting layers across GPUs. Pipeline parallelism increases throughput by splitting the model into stages. Choose based on your constraint.
An unbounded queue is a memory leak waiting to happen. A too-small queue drops requests unnecessarily. Here's how to size and manage LLM request queues.
Premium users expect faster responses. Batch jobs can wait. Here's how to implement priority queues that don't starve anyone.
Traffic spikes 10x. Do you queue requests until OOM, drop them randomly, or gracefully degrade? The answer shapes your system's behavior under pressure.
You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.
Single-user latency was 200ms. At 100 concurrent users, it's 3 seconds. The model didn't slow down. Your serving architecture did.