PeakInfer
Demo
Blog
Pricing
Docs
Sign in
Archives
105 posts organized by date.
All
Tags
Archives
Search
2025
(105 posts)
December
9 posts
Dec 31, 2025
A Year of LLM Inference: Lessons Learned
Dec 27, 2025
Matching the Right Model to Each Task
Dec 24, 2025
What Happens When Your Primary Model Fails
Dec 20, 2025
The Techniques That Actually Cut Costs
Dec 17, 2025
Security Considerations for LLM Serving
Dec 13, 2025
Running Multiple Customers on One GPU
Dec 10, 2025
Where Speculative Decoding Actually Helps
Dec 6, 2025
How Speculative Decoding Works
Dec 3, 2025
The Formula for Offloading Decisions
November
9 posts
Nov 29, 2025
Reducing KV Cache Size Without Quality Loss
Nov 26, 2025
Understanding What Your Model Attends To
Nov 22, 2025
Extending Context Beyond Training Length
Nov 19, 2025
Testing Fine-tuned Model Quality
Nov 15, 2025
How Much Data You Actually Need
Nov 12, 2025
Switching LoRA Adapters at Runtime
Nov 8, 2025
Deploying and Serving Fine-tuned Models
Nov 5, 2025
The Real Cost: Fine-tuning vs Prompting
Nov 1, 2025
What Actually Works with LoRA
October
9 posts
Oct 29, 2025
Running Fine-tuned Models in Production
Oct 25, 2025
When LoRA Makes Sense
Oct 22, 2025
Why Tokens at Position 50K Get Ignored
Oct 18, 2025
Trading Full Context for Speed
Oct 15, 2025
When to Use Self-Attention vs Cross-Attention
Oct 11, 2025
Getting 95% Quality at 12% Cost
Oct 8, 2025
Why 128K Context Doesn't Mean 128K Useful
Oct 4, 2025
Knowing If You're Memory or Compute Limited
Oct 1, 2025
Tensor vs Pipeline Parallelism: When Each Wins
September
8 posts
Sep 27, 2025
Adding GPUs Without Linear Speedup
Sep 24, 2025
Memory Planning for Multi-GPU Deployments
Sep 20, 2025
Mapping Quality Against Cost
Sep 17, 2025
How to Catch Quality Regressions
Sep 13, 2025
When to Use LLM-as-Judge
Sep 10, 2025
Treating Evals as Non-Negotiable Constraints
Sep 6, 2025
How the Big Labs Actually Do Evals
Sep 3, 2025
Building Evals That Catch Real Problems
August
9 posts
Aug 30, 2025
Evaluating Millions of LLM Responses
Aug 27, 2025
What to Monitor in LLM Systems
Aug 23, 2025
Degrading Gracefully Under Load
Aug 20, 2025
Using Rate Limits to Control Spend
Aug 16, 2025
Starting Cheap and Escalating When Needed
Aug 13, 2025
Deciding Which Model Handles Each Request
Aug 9, 2025
Managing Model Versions Without Downtime
Aug 6, 2025
Safe Rollouts for LLM Changes
Aug 2, 2025
What Production LLM Systems Need to Survive
July
9 posts
Jul 30, 2025
Attention That Fits in Memory
Jul 26, 2025
Finding Memory Leaks in LLM Serving
Jul 23, 2025
The Performance Wins from Fusing Kernels
Jul 19, 2025
What Flash Attention Actually Does
Jul 16, 2025
Testing Quality After Quantization
Jul 12, 2025
When to Use FP8 for Inference
Jul 9, 2025
How Much Quality Loss Is Acceptable
Jul 5, 2025
When to Use AWQ vs GPTQ
Jul 2, 2025
Compressing the Cache Not Just the Model
June
8 posts
Jun 28, 2025
Calculating If Quantization Pays Off
Jun 25, 2025
Choosing the Right Precision for Your Model
Jun 21, 2025
Quantizing Without Breaking Your App
Jun 18, 2025
Taking PyTorch Models to Production
Jun 14, 2025
The GPU Memory Techniques That Actually Scale
Jun 11, 2025
When to Move Data Off the GPU
Jun 7, 2025
Finding the KV Cache Problem Before Your Bill Does
Jun 4, 2025
The Cache That Makes LLMs Possible
May
9 posts
May 31, 2025
What Senior Engineers Know About GPU Memory
May 28, 2025
The Checklist Before You Deploy
May 24, 2025
Evaluating Custom Inference Hardware
May 21, 2025
Why Your GPU Utilization Numbers Lie
May 17, 2025
Using Spot Instances for Inference Workloads
May 14, 2025
Cost Per Token Across Hardware Options
May 10, 2025
The Costs You're Not Tracking
May 7, 2025
Understanding Inference Platform Economics
May 3, 2025
Using Multiple Providers to Cut Costs
April
9 posts
Apr 30, 2025
What TPU Economics Look Like in Practice
Apr 26, 2025
H100 vs A100: Which One for Inference
Apr 23, 2025
Making the GPU vs TPU Decision
Apr 19, 2025
Understanding What Makes vLLM Fast
Apr 16, 2025
Choosing Between vLLM SGLang and TensorRT
Apr 12, 2025
The Math on Self-Hosting vs API
Apr 9, 2025
When Self-Hosting Actually Saves Money
Apr 5, 2025
Designing Queues That Don't Explode
Apr 2, 2025
Tuning Batch Size for Your Workload
March
9 posts
Mar 29, 2025
Implementing Request Priority in LLM Serving
Mar 26, 2025
Balancing Fast Responses and Fair Queuing
Mar 22, 2025
Why Token Count Matters More Than Request Count
Mar 19, 2025
Managing Load Without Dropping Requests
Mar 15, 2025
The Tradeoff Every Inference System Makes
Mar 12, 2025
How vLLM Serves 10x More Requests
Mar 8, 2025
Moving Beyond Simple Request Batching
Mar 5, 2025
What Changes When 100 Users Hit Your LLM
Mar 1, 2025
How Failed Requests Inflate Your Bill
February
8 posts
Feb 26, 2025
Separating Real Speedups from Benchmarketing
Feb 22, 2025
Choosing Benchmarks That Predict Production
Feb 19, 2025
The Latency You're Not Measuring
Feb 15, 2025
The Streaming Bug That Costs You 3 Seconds
Feb 12, 2025
Prefill vs Decode: The Two Phases That Shape Latency
Feb 8, 2025
What P99 Latency Tells You That P50 Hides
Feb 5, 2025
Why First Token Latency Determines User Experience
Feb 1, 2025
Catching Cost Spikes Before Month-End
January
9 posts
Jan 29, 2025
Knowing Which Feature Burns Money
Jan 25, 2025
Adding Token Budgets to Your Deploy Process
Jan 22, 2025
How to Think About Context as a Budget
Jan 18, 2025
Calculating End-to-End Latency Correctly
Jan 15, 2025
Why Your System Prompt Costs $50K/Month
Jan 11, 2025
Why Doubling Context Quadruples Your Problems
Jan 8, 2025
Why Output Tokens Cost 4x More Than Input
Jan 4, 2025
Why Streaming Breaks and How to Fix It
Jan 1, 2025
Four Metrics That Actually Matter for LLM Inference