How Failed Requests Inflate Your Bill

Your LLM API has 99% success rate. Seems fine. You retry failures up to 3 times. Also seems fine.

But 1% failure with 3 retries means up to 3% of your total requests are retries. At scale, that's real money.

The Retry Math

Simple case: 5% failure rate, 3 retries max.

Original requests: 1,000
Failed on first try: 50 (5%)
Retry 1: 50 attempts, 47.5 succeed, 2.5 fail
Retry 2: 2.5 attempts, 2.4 succeed, 0.1 fail
Retry 3: 0.1 attempts, 0.1 succeed

Total API calls: 1,000 + 50 + 2.5 + 0.1 = 1,052.6
Overhead: 5.26%

That's the good case. Now consider correlated failures.

When Failures Correlate

API failures aren't random. They cluster during:

Rate limit periods
Provider outages
Network issues
Your own traffic spikes

During these periods, failure rate might jump to 30-50%. Your retry logic amplifies the problem:

Original requests: 1,000
Failure rate during incident: 40%
Failed on first try: 400
Retry 1: 400 attempts, 240 succeed, 160 fail
Retry 2: 160 attempts, 96 succeed, 64 fail
Retry 3: 64 attempts, 38 succeed, 26 fail

Total API calls: 1,000 + 400 + 160 + 64 = 1,624
Overhead: 62.4%

You're paying 62% more during the incident. And making the provider's overload worse.

Retry Strategies That Don't Bankrupt You

Exponential backoff with jitter:

import random
import asyncio

async def retry_with_backoff(
    func,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0
):
    for attempt in range(max_retries + 1):
        try:
            return await func()
        except RetryableError as e:
            if attempt == max_retries:
                raise

            # Exponential backoff with jitter
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.1)
            await asyncio.sleep(delay + jitter)

    raise MaxRetriesExceeded()

The jitter prevents thundering herd: all your retries hitting at the same moment.

Circuit breaker:

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 60.0
    ):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = None
        self.state = "closed"  # closed, open, half-open

    async def call(self, func):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpen()

        try:
            result = await func()
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = "open"
            raise

When failures accumulate, stop trying. Wait for recovery. This prevents retry storms from making outages worse.

Retry budget:

class RetryBudget:
    def __init__(self, budget_percent: float = 10.0, window_seconds: float = 60.0):
        self.budget_percent = budget_percent
        self.window_seconds = window_seconds
        self.requests = []
        self.retries = []

    def record_request(self):
        now = time.time()
        self.requests.append(now)
        self._cleanup(now)

    def can_retry(self) -> bool:
        now = time.time()
        self._cleanup(now)

        if len(self.requests) == 0:
            return True

        retry_rate = len(self.retries) / len(self.requests) * 100
        return retry_rate < self.budget_percent

    def record_retry(self):
        self.retries.append(time.time())

    def _cleanup(self, now):
        cutoff = now - self.window_seconds
        self.requests = [t for t in self.requests if t > cutoff]
        self.retries = [t for t in self.retries if t > cutoff]

Instead of unlimited retries, set a budget: retries can't exceed 10% of total requests. This caps your downside.

Partial Failures Are Expensive

The worst case: the request partially succeeded before failing.

# You paid for 500 input tokens
# Model generated 300 output tokens
# Then the connection dropped
# Total cost: input + partial output
# Value received: zero (incomplete response)

# This is worse than a clean failure

For long generations, consider:

Checkpointing (save partial results)
Smaller max_tokens with continuation
Streaming with client-side buffering

Monitoring Retry Cost

Track retry metrics explicitly:

from prometheus_client import Counter, Histogram

request_total = Counter('llm_requests_total', 'Total requests', ['status'])
retry_total = Counter('llm_retries_total', 'Total retries', ['attempt'])
retry_cost = Counter('llm_retry_cost_tokens', 'Tokens spent on retries')

async def tracked_request(prompt):
    request_total.labels(status='attempted').inc()

    for attempt in range(max_retries + 1):
        try:
            result = await llm.generate(prompt)
            request_total.labels(status='success').inc()
            return result
        except RetryableError:
            if attempt > 0:
                retry_total.labels(attempt=attempt).inc()
                retry_cost.inc(count_tokens(prompt))
            if attempt == max_retries:
                request_total.labels(status='failed').inc()
                raise

Dashboard showing retry rate spiking from 2% to 15%? That's your signal to investigate before the bill arrives.

The best retry is the one you don't need. Invest in reliability before retry logic.