Adding Token Budgets to Your Deploy Process
In 2023, a startup's GPT-4 bill went from $3,000 to $150,000 in one weekend. A runaway feature was concatenating conversation history without limits. Each request used more context than the last. By Sunday, some requests were hitting 100K tokens.
No alerts fired. The feature was working exactly as coded. It just wasn't working as intended.
Token budgets are the circuit breakers you didn't know you needed.
The Problem with Unlimited Tokens
Most LLM integrations start simple: take user input, add system prompt, call API, return response. No limits.
# How most codebases start
def chat(user_message):
return openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message}
]
)
This works until:
- A user pastes a 50-page document
- A bug causes recursive context expansion
- A prompt injection tricks your system into verbose output
- Product ships a feature that nobody tested with long inputs
You don't have rate limits on tokens. You have rate limits on requests. Ten requests each using 50K tokens costs the same in API limits as ten requests using 500 tokens. But it's 100x the bill.
Token Budgets by Feature
Different features have different token needs. Treat them differently.
TOKEN_BUDGETS = {
"chat": {
"max_input": 4000,
"max_output": 1000,
"model": "gpt-4"
},
"summarize": {
"max_input": 16000,
"max_output": 500,
"model": "gpt-4-turbo"
},
"code_review": {
"max_input": 32000,
"max_output": 2000,
"model": "gpt-4-turbo"
},
"quick_answer": {
"max_input": 1000,
"max_output": 200,
"model": "gpt-3.5-turbo"
}
}
Each feature gets:
- Input limit: maximum tokens you'll send
- Output limit:
max_tokensparameter, hard caps response - Model assignment: expensive model for high-value features
Quick answers don't need GPT-4. Code review does. Encoding this in configuration makes costs predictable.
Enforcement
Budget enforcement happens before the API call:
import tiktoken
def enforce_budget(feature: str, messages: list[dict]) -> list[dict]:
budget = TOKEN_BUDGETS[feature]
enc = tiktoken.encoding_for_model(budget["model"])
total_tokens = sum(len(enc.encode(m["content"])) for m in messages)
if total_tokens > budget["max_input"]:
# Option 1: Truncate
# Option 2: Summarize older context
# Option 3: Reject with error
raise TokenBudgetExceeded(
f"Input {total_tokens} exceeds budget {budget['max_input']}"
)
return messages
def call_llm(feature: str, messages: list[dict]):
budget = TOKEN_BUDGETS[feature]
enforced = enforce_budget(feature, messages)
return openai.chat.completions.create(
model=budget["model"],
messages=enforced,
max_tokens=budget["max_output"] # Output capped
)
The max_tokens parameter does the output enforcement for you. Input enforcement you do yourself.
Graceful Degradation
Hard rejection is one option. Graceful degradation is often better:
def degrade_to_budget(messages: list[dict], max_tokens: int) -> list[dict]:
"""
Trim older messages to fit budget.
Keep system prompt and recent turns.
"""
enc = tiktoken.encoding_for_model("gpt-4")
# System prompt is non-negotiable
system = messages[0]
system_tokens = len(enc.encode(system["content"]))
remaining = max_tokens - system_tokens
kept = [system]
# Iterate from most recent, keep what fits
for msg in reversed(messages[1:]):
msg_tokens = len(enc.encode(msg["content"]))
if msg_tokens <= remaining:
kept.insert(1, msg) # After system, in order
remaining -= msg_tokens
else:
# Could truncate this message, or skip
break
return kept
Users get responses with less context rather than error messages. Quality might drop. But they're not blocked.
Per-User Budgets
Feature budgets protect against bugs. User budgets protect against abuse.
class UserTokenTracker:
def __init__(self, redis_client):
self.redis = redis_client
def can_spend(self, user_id: str, tokens: int) -> bool:
key = f"tokens:{user_id}:{today()}"
current = int(self.redis.get(key) or 0)
limit = self.get_user_limit(user_id)
return current + tokens <= limit
def spend(self, user_id: str, tokens: int):
key = f"tokens:{user_id}:{today()}"
self.redis.incrby(key, tokens)
self.redis.expire(key, 86400) # 24h TTL
def get_user_limit(self, user_id: str) -> int:
# Free tier: 10K tokens/day
# Pro: 100K tokens/day
# Enterprise: 1M tokens/day
tier = get_user_tier(user_id)
return {"free": 10_000, "pro": 100_000, "enterprise": 1_000_000}[tier]
Daily limits reset automatically. Users see their remaining budget. Abuse gets throttled before it hurts your bill.
Deployment Safeguards
Add these before shipping any LLM feature:
Global rate limit: Max tokens per minute across all users. Prevents bill explosion from any source.
GLOBAL_TPM_LIMIT = 1_000_000 # 1M tokens per minute max
def check_global_limit(tokens: int) -> bool:
key = f"global:tokens:{current_minute()}"
current = int(redis.get(key) or 0)
return current + tokens <= GLOBAL_TPM_LIMIT
Cost alerts: Notify when hourly spend exceeds threshold.
def track_cost(tokens_in: int, tokens_out: int, model: str):
cost = calculate_cost(tokens_in, tokens_out, model)
hour_key = f"cost:{current_hour()}"
redis.incrbyfloat(hour_key, cost)
if float(redis.get(hour_key) or 0) > HOURLY_COST_ALERT:
send_alert(f"Hourly LLM spend exceeded ${HOURLY_COST_ALERT}")
Feature flags: Ability to disable features without deploy.
def is_feature_enabled(feature: str) -> bool:
return redis.get(f"feature:{feature}:enabled") == "true"
When something goes wrong, you can cut specific features immediately.
What Token Budgets Buy You
With budgets in place:
- A runaway bug hits the per-feature limit, not your credit limit
- Cost per feature becomes predictable and measurable
- Users on free tiers can't abuse their way to enterprise bills
- You can disable expensive features without taking down the whole system
Without budgets, you're trusting that nobody will paste a novel, no bug will cause exponential context growth, and no user will discover that your API is essentially a free GPT-4 proxy.
That's not a bet I'd take.