Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Apr 30, 2025

What TPU Economics Look Like in Practice

$4/hour vs $10/hour sounds great. But conversion cost, ecosystem limitations, and operational overhead change the math.

Apr 26, 2025

H100 vs A100: Which One for Inference

H100 costs 2x more than A100 but delivers 2x memory bandwidth. For decode-bound inference, that math matters.

Apr 23, 2025

Making the GPU vs TPU Decision

GPUs dominate LLM inference. TPUs offer interesting economics. Here's how to think about the choice.

Apr 19, 2025

Understanding What Makes vLLM Fast

vLLM serves 10x more requests than naive PyTorch. PagedAttention, continuous batching, and memory management make the difference.

Apr 16, 2025

Choosing Between vLLM SGLang and TensorRT

vLLM, SGLang, TensorRT-LLM—each optimizes for different things. Here's how to pick without running a 6-month bake-off.

Apr 12, 2025

The Math on Self-Hosting vs API

When does self-hosting break even? Here's the formula, the variables, and the 6-month reality check most teams skip.

Apr 9, 2025

When Self-Hosting Actually Saves Money

Everyone wants to self-host LLMs to save money. Most shouldn't. Here's the math on when it actually makes sense.

Apr 5, 2025

Designing Queues That Don't Explode

An unbounded queue is a memory leak waiting to happen. A too-small queue drops requests unnecessarily. Here's how to size and manage LLM request queues.

Apr 2, 2025

Tuning Batch Size for Your Workload

Batch size 1 wastes GPU. Batch size 64 kills latency. Somewhere in between is your sweet spot. Here's how to find it.

Mar 29, 2025

Implementing Request Priority in LLM Serving

Premium users expect faster responses. Batch jobs can wait. Here's how to implement priority queues that don't starve anyone.