All Tags

#architecture

11 posts tagged with "architecture"

Matching the Right Model to Each Task

8B models handle classification well. 70B models handle summarization. Code-specialized models beat generalists at code. Match the model to the task.

Deploying and Serving Fine-tuned Models

Merge adapters for single-tenant deployments. Keep them separate for multi-tenant. The serving architecture depends on how many customizations you're running.

Trading Full Context for Speed

Full attention is O(n²). Sliding window attention is O(n). The trade: lose long-range dependencies, gain linear scaling. Often worth it.

Tensor vs Pipeline Parallelism: When Each Wins

Tensor parallelism cuts latency by splitting layers across GPUs. Pipeline parallelism increases throughput by splitting the model into stages. Choose based on your constraint.

Designing Queues That Don't Explode

An unbounded queue is a memory leak waiting to happen. A too-small queue drops requests unnecessarily. Here's how to size and manage LLM request queues.

Managing Load Without Dropping Requests

Traffic spikes 10x. Do you queue requests until OOM, drop them randomly, or gracefully degrade? The answer shapes your system's behavior under pressure.