#architecture

11 posts tagged with "architecture"

Dec 27, 2025

Matching the Right Model to Each Task

8B models handle classification well. 70B models handle summarization. Code-specialized models beat generalists at code. Match the model to the task.

Nov 8, 2025

Deploying and Serving Fine-tuned Models

Merge adapters for single-tenant deployments. Keep them separate for multi-tenant. The serving architecture depends on how many customizations you're running.

Oct 18, 2025

Trading Full Context for Speed

Full attention is O(n²). Sliding window attention is O(n). The trade: lose long-range dependencies, gain linear scaling. Often worth it.

Oct 15, 2025

When to Use Self-Attention vs Cross-Attention

Self-attention lets a sequence talk to itself. Cross-attention lets one sequence attend to another. Understanding the difference enables better architectures.

Oct 1, 2025

Tensor vs Pipeline Parallelism: When Each Wins

Tensor parallelism cuts latency by splitting layers across GPUs. Pipeline parallelism increases throughput by splitting the model into stages. Choose based on your constraint.

Apr 5, 2025

Designing Queues That Don't Explode

An unbounded queue is a memory leak waiting to happen. A too-small queue drops requests unnecessarily. Here's how to size and manage LLM request queues.

Mar 29, 2025

Implementing Request Priority in LLM Serving

Premium users expect faster responses. Batch jobs can wait. Here's how to implement priority queues that don't starve anyone.

Mar 19, 2025

Managing Load Without Dropping Requests

Traffic spikes 10x. Do you queue requests until OOM, drop them randomly, or gracefully degrade? The answer shapes your system's behavior under pressure.

Mar 15, 2025

The Tradeoff Every Inference System Makes

You can have 1000 tokens per second with 3-second latency, or 200 tokens per second with 200ms latency. You cannot have both. Here's how to choose.

Mar 5, 2025

What Changes When 100 Users Hit Your LLM

Single-user latency was 200ms. At 100 concurrent users, it's 3 seconds. The model didn't slow down. Your serving architecture did.