Back to Blog
When Self-Hosting Actually Saves Money
Restaurants can buy bread or bake it. Small cafes buy. Large chains with consistent volume bake their own. The decision isn't about capability—it's about scale and predictability.
Self-hosting LLMs follows the same economics. The question isn't "can we?" It's "should we, given our volume and team?"
The $30K/Month Threshold
Below $30K/month in API spend, self-hosting rarely makes sense:
def should_self_host(monthly_api_spend: float) -> dict:
# Infrastructure costs (conservative)
gpu_cost = 8000 # 2x H100, reserved pricing
engineering_cost = 25000 # 0.5-1 FTE to maintain
ops_overhead = 5000 # Monitoring, on-call, misc
self_host_cost = gpu_cost + engineering_cost + ops_overhead
return {
"api_cost": monthly_api_spend,
"self_host_cost": self_host_cost,
"savings": monthly_api_spend - self_host_cost,
"recommendation": "self-host" if monthly_api_spend > 38000 else "use API"
}
# The math:
# $30K API spend: -$8K/month (lose money)
# $50K API spend: +$12K/month (save money)
# $100K API spend: +$62K/month (significant savings)
The Engineering Cost Nobody Counts
Self-hosting isn't "deploy and forget":
class SelfHostingReality:
def __init__(self):
self.obvious_costs = {
"gpu_compute": "What you planned for",
"networking": "Usually underestimated",
"storage": "Checkpoints add up",
}
self.hidden_costs = {
"on_call": "GPU failures at 3am",
"model_updates": "New versions need testing",
"security_patches": "CUDA, drivers, OS",
"capacity_planning": "Scaling isn't automatic",
"debugging": "Why is latency high today?",
}
def true_eng_time_per_month(self) -> str:
# Per engineer estimate
return """
Incidents: 10-20 hours
Updates: 8-16 hours
Optimization: 10-20 hours
Planning: 5-10 hours
Total: 33-66 hours/month = 0.2-0.4 FTE minimum
"""
The Comparison Framework
def total_cost_of_ownership(
monthly_tokens: int,
api_price_per_million: float,
team_size: int,
growth_rate: float, # Monthly
) -> dict:
# API costs (scales linearly)
year_one_api = sum([
monthly_tokens * api_price_per_million / 1_000_000 * (1 + growth_rate) ** month
for month in range(12)
])
# Self-host costs
setup_cost = 50_000 # Engineering time to get running
monthly_infra = 15_000 # GPUs, networking, storage
monthly_eng = 20_000 # Ongoing maintenance (0.5 FTE + partial others)
year_one_self_host = setup_cost + (monthly_infra + monthly_eng) * 12
return {
"api_year_one": year_one_api,
"self_host_year_one": year_one_self_host,
"break_even_point": "Calculate when lines cross",
}
When Self-Hosting Makes Sense
Clear yes:
- Predictable, high volume: 100M+ tokens/day, consistent
- Existing ML team: Already have GPU expertise
- Latency requirements: Need <100ms and APIs can't deliver
- Data sovereignty: Data can't leave your infrastructure
- Customization: Fine-tuned models, specialized serving
Clear no:
- Spiky traffic: 10x variation day to day
- Small team: No one dedicated to ML infra
- Rapid experimentation: Trying different models weekly
- Low volume: Under $30K/month in API costs
The Hybrid Path
Most successful teams don't go all-in:
class HybridStrategy:
def route_request(self, request: Request) -> str:
if request.is_batch_job:
return "self_hosted" # Predictable, high volume
if request.requires_latest_model:
return "api" # API has newest models first
if request.latency_critical and request.in_region:
return "self_hosted" # Control the latency
return "api" # Default to API
Self-host your predictable baseline. Use APIs for peaks and experiments.
The right question isn't "build or buy?" It's "what's worth building given what we'd need to maintain?"