How Failed Requests Inflate Your Bill
Your LLM API has 99% success rate. Seems fine. You retry failures up to 3 times. Also seems fine.
But 1% failure with 3 retries means up to 3% of your total requests are retries. At scale, that's real money.
The Retry Math
Simple case: 5% failure rate, 3 retries max.
Original requests: 1,000
Failed on first try: 50 (5%)
Retry 1: 50 attempts, 47.5 succeed, 2.5 fail
Retry 2: 2.5 attempts, 2.4 succeed, 0.1 fail
Retry 3: 0.1 attempts, 0.1 succeed
Total API calls: 1,000 + 50 + 2.5 + 0.1 = 1,052.6
Overhead: 5.26%
That's the good case. Now consider correlated failures.
When Failures Correlate
API failures aren't random. They cluster during:
- Rate limit periods
- Provider outages
- Network issues
- Your own traffic spikes
During these periods, failure rate might jump to 30-50%. Your retry logic amplifies the problem:
Original requests: 1,000
Failure rate during incident: 40%
Failed on first try: 400
Retry 1: 400 attempts, 240 succeed, 160 fail
Retry 2: 160 attempts, 96 succeed, 64 fail
Retry 3: 64 attempts, 38 succeed, 26 fail
Total API calls: 1,000 + 400 + 160 + 64 = 1,624
Overhead: 62.4%
You're paying 62% more during the incident. And making the provider's overload worse.
Retry Strategies That Don't Bankrupt You
Exponential backoff with jitter:
import random
import asyncio
async def retry_with_backoff(
func,
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0
):
for attempt in range(max_retries + 1):
try:
return await func()
except RetryableError as e:
if attempt == max_retries:
raise
# Exponential backoff with jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
await asyncio.sleep(delay + jitter)
raise MaxRetriesExceeded()
The jitter prevents thundering herd: all your retries hitting at the same moment.
Circuit breaker:
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 60.0
):
self.failures = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
self.state = "closed" # closed, open, half-open
async def call(self, func):
if self.state == "open":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "half-open"
else:
raise CircuitOpen()
try:
result = await func()
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold:
self.state = "open"
raise
When failures accumulate, stop trying. Wait for recovery. This prevents retry storms from making outages worse.
Retry budget:
class RetryBudget:
def __init__(self, budget_percent: float = 10.0, window_seconds: float = 60.0):
self.budget_percent = budget_percent
self.window_seconds = window_seconds
self.requests = []
self.retries = []
def record_request(self):
now = time.time()
self.requests.append(now)
self._cleanup(now)
def can_retry(self) -> bool:
now = time.time()
self._cleanup(now)
if len(self.requests) == 0:
return True
retry_rate = len(self.retries) / len(self.requests) * 100
return retry_rate < self.budget_percent
def record_retry(self):
self.retries.append(time.time())
def _cleanup(self, now):
cutoff = now - self.window_seconds
self.requests = [t for t in self.requests if t > cutoff]
self.retries = [t for t in self.retries if t > cutoff]
Instead of unlimited retries, set a budget: retries can't exceed 10% of total requests. This caps your downside.
Partial Failures Are Expensive
The worst case: the request partially succeeded before failing.
# You paid for 500 input tokens
# Model generated 300 output tokens
# Then the connection dropped
# Total cost: input + partial output
# Value received: zero (incomplete response)
# This is worse than a clean failure
For long generations, consider:
- Checkpointing (save partial results)
- Smaller max_tokens with continuation
- Streaming with client-side buffering
Monitoring Retry Cost
Track retry metrics explicitly:
from prometheus_client import Counter, Histogram
request_total = Counter('llm_requests_total', 'Total requests', ['status'])
retry_total = Counter('llm_retries_total', 'Total retries', ['attempt'])
retry_cost = Counter('llm_retry_cost_tokens', 'Tokens spent on retries')
async def tracked_request(prompt):
request_total.labels(status='attempted').inc()
for attempt in range(max_retries + 1):
try:
result = await llm.generate(prompt)
request_total.labels(status='success').inc()
return result
except RetryableError:
if attempt > 0:
retry_total.labels(attempt=attempt).inc()
retry_cost.inc(count_tokens(prompt))
if attempt == max_retries:
request_total.labels(status='failed').inc()
raise
Dashboard showing retry rate spiking from 2% to 15%? That's your signal to investigate before the bill arrives.
The best retry is the one you don't need. Invest in reliability before retry logic.