When Self-Hosting Actually Saves Money

Restaurants can buy bread or bake it. Small cafes buy. Large chains with consistent volume bake their own. The decision isn't about capability—it's about scale and predictability.

Self-hosting LLMs follows the same economics. The question isn't "can we?" It's "should we, given our volume and team?"

The $30K/Month Threshold

Below $30K/month in API spend, self-hosting rarely makes sense:

def should_self_host(monthly_api_spend: float) -> dict:
    # Infrastructure costs (conservative)
    gpu_cost = 8000  # 2x H100, reserved pricing
    engineering_cost = 25000  # 0.5-1 FTE to maintain
    ops_overhead = 5000  # Monitoring, on-call, misc

    self_host_cost = gpu_cost + engineering_cost + ops_overhead

    return {
        "api_cost": monthly_api_spend,
        "self_host_cost": self_host_cost,
        "savings": monthly_api_spend - self_host_cost,
        "recommendation": "self-host" if monthly_api_spend > 38000 else "use API"
    }

# The math:
# $30K API spend: -$8K/month (lose money)
# $50K API spend: +$12K/month (save money)
# $100K API spend: +$62K/month (significant savings)

The Engineering Cost Nobody Counts

Self-hosting isn't "deploy and forget":

class SelfHostingReality:
    def __init__(self):
        self.obvious_costs = {
            "gpu_compute": "What you planned for",
            "networking": "Usually underestimated",
            "storage": "Checkpoints add up",
        }

        self.hidden_costs = {
            "on_call": "GPU failures at 3am",
            "model_updates": "New versions need testing",
            "security_patches": "CUDA, drivers, OS",
            "capacity_planning": "Scaling isn't automatic",
            "debugging": "Why is latency high today?",
        }

    def true_eng_time_per_month(self) -> str:
        # Per engineer estimate
        return """
        Incidents: 10-20 hours
        Updates: 8-16 hours
        Optimization: 10-20 hours
        Planning: 5-10 hours

        Total: 33-66 hours/month = 0.2-0.4 FTE minimum
        """

The Comparison Framework

def total_cost_of_ownership(
    monthly_tokens: int,
    api_price_per_million: float,
    team_size: int,
    growth_rate: float,  # Monthly
) -> dict:

    # API costs (scales linearly)
    year_one_api = sum([
        monthly_tokens * api_price_per_million / 1_000_000 * (1 + growth_rate) ** month
        for month in range(12)
    ])

    # Self-host costs
    setup_cost = 50_000  # Engineering time to get running
    monthly_infra = 15_000  # GPUs, networking, storage
    monthly_eng = 20_000  # Ongoing maintenance (0.5 FTE + partial others)

    year_one_self_host = setup_cost + (monthly_infra + monthly_eng) * 12

    return {
        "api_year_one": year_one_api,
        "self_host_year_one": year_one_self_host,
        "break_even_point": "Calculate when lines cross",
    }

When Self-Hosting Makes Sense

Clear yes:

Predictable, high volume: 100M+ tokens/day, consistent
Existing ML team: Already have GPU expertise
Latency requirements: Need <100ms and APIs can't deliver
Data sovereignty: Data can't leave your infrastructure
Customization: Fine-tuned models, specialized serving

Clear no:

Spiky traffic: 10x variation day to day
Small team: No one dedicated to ML infra
Rapid experimentation: Trying different models weekly
Low volume: Under $30K/month in API costs

The Hybrid Path

Most successful teams don't go all-in:

class HybridStrategy:
    def route_request(self, request: Request) -> str:
        if request.is_batch_job:
            return "self_hosted"  # Predictable, high volume

        if request.requires_latest_model:
            return "api"  # API has newest models first

        if request.latency_critical and request.in_region:
            return "self_hosted"  # Control the latency

        return "api"  # Default to API

Self-host your predictable baseline. Use APIs for peaks and experiments.

The right question isn't "build or buy?" It's "what's worth building given what we'd need to maintain?"