Blog

Deep dives into LLM inference optimization. Practical insights for developers and founders building with AI.

Feb 19, 2025

The Latency You're Not Measuring

Model latency is 200ms. End-to-end latency is 800ms. Where did 600ms go? Probably somewhere you're not looking.

Feb 15, 2025

The Streaming Bug That Costs You 3 Seconds

Your code says streaming is enabled. Your load balancer says otherwise. Here's where streaming breaks and how to fix it.

Feb 12, 2025

Prefill vs Decode: The Two Phases That Shape Latency

Every LLM request has two distinct phases with different performance characteristics. Understanding them is the key to targeted optimization.

Feb 8, 2025

What P99 Latency Tells You That P50 Hides

Median latency is 200ms. One in a hundred requests takes 8 seconds. Your dashboard shows green. Your users are churning.

Feb 5, 2025

Why First Token Latency Determines User Experience

Users don't perceive throughput. They perceive the silence before the first token appears. TTFT is the metric that determines whether your app feels fast.

Feb 1, 2025

Catching Cost Spikes Before Month-End

By the time you see the invoice, the damage is done. Real-time spend monitoring catches runaway costs before they compound.

Jan 29, 2025

Knowing Which Feature Burns Money

Your LLM bill is one number. Your product has twenty features. Without cost attribution, you're optimizing in the dark.

Jan 25, 2025

Adding Token Budgets to Your Deploy Process

Your API has rate limits. Your database has connection limits. Your LLM endpoints should have token limits. Here's how to add them without breaking production.

Jan 22, 2025

How to Think About Context as a Budget

A 128K context window doesn't mean you should use 128K tokens. Context is a budget with diminishing returns and escalating costs.

Jan 18, 2025

Calculating End-to-End Latency Correctly

E2EL = TTFT + generation time sounds simple. But where does that time actually go? Understanding the equation reveals where to optimize.