← Back to Blog

LLM API Error Handling and Retry Strategies: Avoid Wasting Money on Failed Requests

Failed API requests cost you tokens without results. Here's how to handle errors properly, retry intelligently, and stop wasting money on requests that don't need to fail.

Every failed LLM API request is wasted money. You paid for input tokens, waited for a response, and got an error. If you're not handling this correctly, 5-15% of your API spend could be going to requests that fail and shouldn't have — or succeed but get retried anyway.

This guide covers error handling across all 10 major providers, retry strategies that actually save money, and the code patterns that prevent common failures.

Every Error Code and What It Means

Not all errors are equal. Some are retryable, some are not, and some will cost you money if you retry them. Here's the breakdown:

Error Code Name Retryable? Cost Impact
429 Rate Limit / Too Many Requests Yes Input tokens billed, no output
500 Internal Server Error Yes Usually no charge
502 Bad Gateway Yes Usually no charge
503 Service Unavailable Yes Usually no charge
408 Request Timeout Depends Input tokens may be billed
401 Unauthorized No No charge
403 Forbidden No No charge
400 Bad Request No No charge
404 Model Not Found No No charge
413 Payload Too Large No No charge
529 Overloaded (Anthropic) Yes Usually no charge

The cost trap: 429 errors

Rate limit errors are the sneakiest. OpenAI and Anthropic both bill you for input tokens even on 429 errors in some cases — especially if the request was partially processed before being rejected. If you send a 10K token request and hit a rate limit, you might still be charged for those input tokens with no output to show for it.

Google's Gemini and DeepSeek are generally more generous — 429 errors typically don't incur charges. But don't assume: always check your billing after a spike in rate limit errors.

The Exponential Backoff Pattern

The standard retry strategy for LLM APIs is exponential backoff with jitter. Here's the pattern that works across all providers:

async function callWithRetry(fn, maxRetries = 3) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;
      if (!isRetryable(error.status)) throw error;

      const baseDelay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
      const jitter = Math.random() * 1000;            // 0-1s random
      const delay = baseDelay + jitter;

      await new Promise(r => setTimeout(r, delay));
    }
  }
}

function isRetryable(status) {
  return [429, 500, 502, 503, 529].includes(status);
}

Key points:

  • Only retry on retryable errors — 400, 401, 403, 404 should never be retried
  • Add jitter — Without jitter, all clients retry at the same time, causing a thundering herd
  • Cap retries at 3 — If 3 retries fail, the problem is likely persistent
  • Respect Retry-After headers — Some providers include them in 429 responses

Provider-Specific Error Behavior

Each provider handles errors differently. Here's what you need to know:

Provider Error Behavior Comparison
OpenAI — 429Bills input tokens on partial processing. Use Retry-After header.
OpenAI — 500/502/503No charge. Safe to retry immediately.
Anthropic — 429No charge on 429. Has built-in retry in SDK.
Anthropic — 529Overloaded. No charge. Retry with backoff.
Google Gemini — 429No charge. Generous default quotas.
DeepSeek — 429No charge. Aggressive rate limits on free tier.
Mistral — 429No charge. Limits based on subscription tier.

Common Failure Patterns and Fixes

1. Context window overflow

The most expensive error. You build a massive prompt, send it, and get a 400 error because you exceeded the model's context window. The fix: always count tokens before sending.

// Before sending, validate token count
const maxTokens = {
  'gpt-4o': 128000, 'gpt-5': 272000, 'gpt-5.5': 1000000,
  'claude-sonnet-4.6': 1000000, 'claude-haiku-4.5': 200000,
  'gemini-2.0-flash': 1000000, 'deepseek-v4-pro': 1000000
};

const limit = maxTokens[model] || 128000;
if (estimatedTokens > limit * 0.9) {
  // Truncate or summarize before sending
  prompt = truncateToTokenLimit(prompt, limit * 0.85);
}

2. Streaming connection drops

Streaming responses (SSE) can disconnect mid-response, especially on long outputs. The result: you've consumed input tokens but get no complete output. Fix: implement streaming resume or fall back to non-streaming for critical requests.

// Streaming error handling
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 60000);

try {
  const stream = await fetch(url, {
    ...options,
    signal: controller.signal
  });

  let fullResponse = '';
  for await (const chunk of stream.body) {
    fullResponse += parseChunk(chunk);
    clearTimeout(timeout); // Reset timeout on each chunk
  }
} catch (error) {
  if (error.name === 'AbortError') {
    // Timeout — retry with non-streaming
    return await callWithRetry(() => fetchNonStreaming(options));
  }
  throw error;
}

3. Concurrent request collisions

Sending many requests simultaneously often triggers rate limits. The fix: implement a request queue with concurrency limits.

class RequestQueue {
  constructor(concurrency = 5) {
    this.concurrency = concurrency;
    this.running = 0;
    this.queue = [];
  }

  async add(fn) {
    if (this.running >= this.concurrency) {
      await new Promise(resolve => this.queue.push(resolve));
    }
    this.running++;
    try {
      return await fn();
    } finally {
      this.running--;
      if (this.queue.length) this.queue.shift()();
    }
  }
}

Cost-Aware Error Handling

The biggest mistake: retrying expensive requests blindly. If you're retrying a request that costs $0.50 per attempt, three retries cost $1.50 for a single logical operation.

Cost-Aware Retry Rules
  • Low-cost requests (<$0.01): Retry up to 3 times, aggressive backoff
  • Mid-cost requests ($0.01-$0.10): Retry up to 2 times, conservative backoff
  • High-cost requests (>$0.10): Retry once, then fall back to a cheaper model
  • Always: Set a max budget per operation, abort if retries exceed it

Model fallback chains

When your primary model fails, falling back to a cheaper model saves both time and money:

Recommended Fallback Chains
Premium chainGPT-5 → Claude Sonnet 4.6 → Gemini 2.5 Pro
Mid-tier chainGPT-5 mini → Claude Haiku 4.5 → Gemini 2.0 Flash
Budget chainDeepSeek V4 Pro → Gemini Flash Lite → GPT-4o mini
Free/cheap chainGemini Flash Lite → GPT-oss 20B → Llama 3.1 8B

Calculate the real cost of your error rate

If 10% of your requests fail and you retry them, you're paying double for that 10%. Use our calculator to see the impact.

Open Calculator →

Monitoring and Alerting

Handle errors in code, but monitor them in production. Set up alerts for:

  • Error rate >5% — Indicates a systemic issue, not random failures
  • Retry rate >20% — You're retrying too aggressively, wasting money
  • P95 latency >30s — Approaching timeout thresholds
  • Monthly error cost >$50 — Set a hard cap and alert on it

Use our AI API cost monitoring guide to set up real-time tracking of error-related spend.

Provider-Specific SDK Retry Defaults

Most providers have built-in retry logic in their official SDKs. Here's what you get by default:

Provider SDK Retries Default Strategy Customizable
OpenAI 2 Exponential backoff Yes — maxRetries param
Anthropic 2 Exponential backoff Yes — maxRetries param
Google Gemini 0 (manual) N/A — implement yourself N/A
DeepSeek 0 (manual) N/A — implement yourself N/A
Mistral 0 (manual) N/A — implement yourself N/A

Important: Don't rely solely on SDK defaults. OpenAI and Anthropic retry on 429 and 500 errors, but they won't implement cost-aware retry logic or model fallbacks. Build your own wrapper on top.

The Complete Error Handler

Here's a production-ready error handler that combines everything:

async function resilientLLMCall(options) {
  const { model, prompt, maxCost = 0.50 } = options;
  const costPerToken = getModelCost(model);
  let totalCost = 0;

  const fallbackChain = getFallbackChain(model);

  for (const currentModel of fallbackChain) {
    const cost = getModelCost(currentModel);
    if (totalCost + cost > maxCost) break;

    try {
      const result = await callWithRetry(
        () => callLLM(currentModel, prompt),
        currentModel.includes('gpt') || currentModel.includes('claude') ? 2 : 1
      );
      return { ...result, model: currentModel };
    } catch (error) {
      totalCost += cost;
      console.warn(`${currentModel} failed: ${error.message}. Falling back.`);
    }
  }

  throw new Error('All models in fallback chain failed');
}

Track your error costs in real time

See exactly how much failed requests are costing you across all providers.

View Pricing Trends →

Quick Reference: Error Handling Checklist

  • Do: Retry 429, 500, 502, 503, 529 with exponential backoff
  • Do: Add jitter to prevent thundering herd
  • Do: Set a max budget per operation
  • Do: Implement model fallback chains
  • Do: Validate token count before sending
  • Do: Monitor error rates and costs
  • Don't: Retry 400, 401, 403, 404 errors
  • Don't: Retry without checking if input tokens were billed
  • Don't: Use the same retry count for expensive and cheap models
  • Don't: Ignore Retry-After headers

Related Reading