Catching Cost Spikes Before Month-End

The invoice arrives on the 3rd. By then, you've already spent 30 days burning money you didn't intend to spend. A bug that doubled token usage on January 5th compounds for 28 days before anyone notices.

This is why real-time spend monitoring exists.

The Rolling Average Approach

Simple threshold alerts don't work for LLM costs. Usage varies by time of day, day of week, and season. A fixed threshold of "$500/hour" triggers constantly during peaks and misses problems during valleys.

Better approach: compare current spend to a rolling average.

from collections import deque
from datetime import datetime, timedelta

class SpendMonitor:
    def __init__(self, window_hours: int = 168):  # 7 days
        self.hourly_costs = deque(maxlen=window_hours)
        self.alert_threshold = 2.0  # Alert if 2x above average

    def record_hourly_cost(self, cost: float, hour: datetime):
        self.hourly_costs.append((hour, cost))

    def check_anomaly(self, current_cost: float) -> bool:
        if len(self.hourly_costs) < 24:  # Need at least a day of data
            return False

        avg = sum(c for _, c in self.hourly_costs) / len(self.hourly_costs)
        return current_cost > avg * self.alert_threshold

    def get_baseline(self) -> float:
        if not self.hourly_costs:
            return 0.0
        return sum(c for _, c in self.hourly_costs) / len(self.hourly_costs)

If today's 2pm costs 2x more than the average 2pm over the past week, something changed. Maybe it's a feature launch. Maybe it's a bug. Either way, you want to know.

Layered Alerts

One threshold isn't enough. Different severities need different responses:

ALERT_LEVELS = {
    "warning": {
        "threshold": 1.5,  # 50% above baseline
        "channel": "slack-cost-alerts",
        "action": "notify"
    },
    "critical": {
        "threshold": 2.5,  # 150% above baseline
        "channel": "pagerduty",
        "action": "page-oncall"
    },
    "emergency": {
        "threshold": 5.0,  # 400% above baseline
        "channel": "pagerduty",
        "action": "auto-rate-limit"
    }
}

The emergency tier can automatically engage rate limiting. When costs spike 5x, waiting for a human to respond might cost thousands.

Per-Feature Monitoring

Aggregate spend alerts catch big problems but miss feature-specific issues. A bug in one feature might only cause a 20% overall increase while that feature's costs spike 10x.

def check_feature_anomalies(current_hour_costs: dict[str, float]):
    alerts = []
    for feature_id, cost in current_hour_costs.items():
        baseline = get_feature_baseline(feature_id)
        if baseline == 0:
            continue

        ratio = cost / baseline
        if ratio > 2.0:
            alerts.append({
                "feature": feature_id,
                "current": cost,
                "baseline": baseline,
                "ratio": ratio
            })

    return sorted(alerts, key=lambda x: x["ratio"], reverse=True)

When the summarization feature suddenly costs 8x its baseline while everything else is normal, you know exactly where to look.

The 3-Day Forecast

Beyond anomaly detection, forecasting prevents end-of-month surprises:

def forecast_monthly_spend(daily_costs: list[float]) -> dict:
    days_elapsed = len(daily_costs)
    total_so_far = sum(daily_costs)
    daily_avg = total_so_far / days_elapsed

    days_in_month = 30  # Simplification
    projected = daily_avg * days_in_month

    return {
        "current_total": total_so_far,
        "daily_average": daily_avg,
        "projected_monthly": projected,
        "days_remaining": days_in_month - days_elapsed
    }

On the 10th of the month, if your projected spend is 40% over budget, you have 20 days to fix it instead of discovering it on the invoice.

What Actually Triggers Spikes

In practice, cost spikes come from predictable sources:

Runaway retries: A downstream service starts timing out. Retry logic kicks in. Each retry costs tokens. 5% failure rate with 3 retries means 15% extra spend, but a 50% failure rate with 3 retries means 150% extra spend.

Prompt regression: Someone adds "please be thorough" to a system prompt. Output length doubles. Costs double.

Traffic surge: A feature goes viral. Good problem, but still needs detection.

Context accumulation: Conversation memory grows without bounds. Day 1: 500 tokens. Day 30: 50,000 tokens per message.

Each of these has a different signature in the data. Retries show up as increased request count with higher failure rates. Prompt regressions show up as increased output tokens per request. Traffic surges show up as request count spikes. Context accumulation shows up as steadily increasing input tokens per request.

The Dashboard That Matters

Most cost dashboards show total spend over time. More useful:

Spend per request (are individual requests getting more expensive?)
Spend per user (is one user or customer driving costs?)
Spend per feature (which product area is growing?)
Spend vs baseline (is today abnormal?)

A simple total doesn't tell you where to look. These breakdowns do.

The goal isn't zero surprises. It's catching surprises on day 1 instead of day 30.