Adding Token Budgets to Your Deploy Process

In 2023, a startup's GPT-4 bill went from $3,000 to $150,000 in one weekend. A runaway feature was concatenating conversation history without limits. Each request used more context than the last. By Sunday, some requests were hitting 100K tokens.

No alerts fired. The feature was working exactly as coded. It just wasn't working as intended.

Token budgets are the circuit breakers you didn't know you needed.

The Problem with Unlimited Tokens

Most LLM integrations start simple: take user input, add system prompt, call API, return response. No limits.

# How most codebases start
def chat(user_message):
    return openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )

This works until:

A user pastes a 50-page document
A bug causes recursive context expansion
A prompt injection tricks your system into verbose output
Product ships a feature that nobody tested with long inputs

You don't have rate limits on tokens. You have rate limits on requests. Ten requests each using 50K tokens costs the same in API limits as ten requests using 500 tokens. But it's 100x the bill.

Token Budgets by Feature

Different features have different token needs. Treat them differently.

TOKEN_BUDGETS = {
    "chat": {
        "max_input": 4000,
        "max_output": 1000,
        "model": "gpt-4"
    },
    "summarize": {
        "max_input": 16000,
        "max_output": 500,
        "model": "gpt-4-turbo"
    },
    "code_review": {
        "max_input": 32000,
        "max_output": 2000,
        "model": "gpt-4-turbo"
    },
    "quick_answer": {
        "max_input": 1000,
        "max_output": 200,
        "model": "gpt-3.5-turbo"
    }
}

Each feature gets:

Input limit: maximum tokens you'll send
Output limit: max_tokens parameter, hard caps response
Model assignment: expensive model for high-value features

Quick answers don't need GPT-4. Code review does. Encoding this in configuration makes costs predictable.

Enforcement

Budget enforcement happens before the API call:

import tiktoken

def enforce_budget(feature: str, messages: list[dict]) -> list[dict]:
    budget = TOKEN_BUDGETS[feature]
    enc = tiktoken.encoding_for_model(budget["model"])

    total_tokens = sum(len(enc.encode(m["content"])) for m in messages)

    if total_tokens > budget["max_input"]:
        # Option 1: Truncate
        # Option 2: Summarize older context
        # Option 3: Reject with error
        raise TokenBudgetExceeded(
            f"Input {total_tokens} exceeds budget {budget['max_input']}"
        )

    return messages

def call_llm(feature: str, messages: list[dict]):
    budget = TOKEN_BUDGETS[feature]
    enforced = enforce_budget(feature, messages)

    return openai.chat.completions.create(
        model=budget["model"],
        messages=enforced,
        max_tokens=budget["max_output"]  # Output capped
    )

The max_tokens parameter does the output enforcement for you. Input enforcement you do yourself.

Graceful Degradation

Hard rejection is one option. Graceful degradation is often better:

def degrade_to_budget(messages: list[dict], max_tokens: int) -> list[dict]:
    """
    Trim older messages to fit budget.
    Keep system prompt and recent turns.
    """
    enc = tiktoken.encoding_for_model("gpt-4")

    # System prompt is non-negotiable
    system = messages[0]
    system_tokens = len(enc.encode(system["content"]))

    remaining = max_tokens - system_tokens
    kept = [system]

    # Iterate from most recent, keep what fits
    for msg in reversed(messages[1:]):
        msg_tokens = len(enc.encode(msg["content"]))
        if msg_tokens <= remaining:
            kept.insert(1, msg)  # After system, in order
            remaining -= msg_tokens
        else:
            # Could truncate this message, or skip
            break

    return kept

Users get responses with less context rather than error messages. Quality might drop. But they're not blocked.

Per-User Budgets

Feature budgets protect against bugs. User budgets protect against abuse.

class UserTokenTracker:
    def __init__(self, redis_client):
        self.redis = redis_client

    def can_spend(self, user_id: str, tokens: int) -> bool:
        key = f"tokens:{user_id}:{today()}"
        current = int(self.redis.get(key) or 0)
        limit = self.get_user_limit(user_id)
        return current + tokens <= limit

    def spend(self, user_id: str, tokens: int):
        key = f"tokens:{user_id}:{today()}"
        self.redis.incrby(key, tokens)
        self.redis.expire(key, 86400)  # 24h TTL

    def get_user_limit(self, user_id: str) -> int:
        # Free tier: 10K tokens/day
        # Pro: 100K tokens/day
        # Enterprise: 1M tokens/day
        tier = get_user_tier(user_id)
        return {"free": 10_000, "pro": 100_000, "enterprise": 1_000_000}[tier]

Daily limits reset automatically. Users see their remaining budget. Abuse gets throttled before it hurts your bill.

Deployment Safeguards

Add these before shipping any LLM feature:

Global rate limit: Max tokens per minute across all users. Prevents bill explosion from any source.

GLOBAL_TPM_LIMIT = 1_000_000  # 1M tokens per minute max

def check_global_limit(tokens: int) -> bool:
    key = f"global:tokens:{current_minute()}"
    current = int(redis.get(key) or 0)
    return current + tokens <= GLOBAL_TPM_LIMIT

Cost alerts: Notify when hourly spend exceeds threshold.

def track_cost(tokens_in: int, tokens_out: int, model: str):
    cost = calculate_cost(tokens_in, tokens_out, model)
    hour_key = f"cost:{current_hour()}"
    redis.incrbyfloat(hour_key, cost)

    if float(redis.get(hour_key) or 0) > HOURLY_COST_ALERT:
        send_alert(f"Hourly LLM spend exceeded ${HOURLY_COST_ALERT}")

Feature flags: Ability to disable features without deploy.

def is_feature_enabled(feature: str) -> bool:
    return redis.get(f"feature:{feature}:enabled") == "true"

When something goes wrong, you can cut specific features immediately.

What Token Budgets Buy You

With budgets in place:

A runaway bug hits the per-feature limit, not your credit limit
Cost per feature becomes predictable and measurable
Users on free tiers can't abuse their way to enterprise bills
You can disable expensive features without taking down the whole system

Without budgets, you're trusting that nobody will paste a novel, no bug will cause exponential context growth, and no user will discover that your API is essentially a free GPT-4 proxy.

That's not a bet I'd take.