← Back to Blog

Guide May 11, 2026 7 min read

LLM API Error Handling and Retry Strategies: Avoid Wasting Money on Failed Requests

Failed API requests cost you tokens without results. Here's how to handle errors properly, retry intelligently, and stop wasting money on requests that don't need to fail.

Every failed LLM API request is wasted money. You paid for input tokens, waited for a response, and got an error. If you're not handling this correctly, 5-15% of your API spend could be going to requests that fail and shouldn't have — or succeed but get retried anyway.

This guide covers error handling across all 10 major providers, retry strategies that actually save money, and the code patterns that prevent common failures.

Every Error Code and What It Means

Not all errors are equal. Some are retryable, some are not, and some will cost you money if you retry them. Here's the breakdown:

Error Code	Name	Retryable?	Cost Impact
`429`	Rate Limit / Too Many Requests	Yes	Input tokens billed, no output
`500`	Internal Server Error	Yes	Usually no charge
`502`	Bad Gateway	Yes	Usually no charge
`503`	Service Unavailable	Yes	Usually no charge
`408`	Request Timeout	Depends	Input tokens may be billed
`401`	Unauthorized	No	No charge
`403`	Forbidden	No	No charge
`400`	Bad Request	No	No charge
`404`	Model Not Found	No	No charge
`413`	Payload Too Large	No	No charge
`529`	Overloaded (Anthropic)	Yes	Usually no charge

The cost trap: 429 errors

Rate limit errors are the sneakiest. OpenAI and Anthropic both bill you for input tokens even on 429 errors in some cases — especially if the request was partially processed before being rejected. If you send a 10K token request and hit a rate limit, you might still be charged for those input tokens with no output to show for it.

Google's Gemini and DeepSeek are generally more generous — 429 errors typically don't incur charges. But don't assume: always check your billing after a spike in rate limit errors.

The Exponential Backoff Pattern

The standard retry strategy for LLM APIs is exponential backoff with jitter. Here's the pattern that works across all providers:

async function callWithRetry(fn, maxRetries = 3) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;
      if (!isRetryable(error.status)) throw error;

      const baseDelay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
      const jitter = Math.random() * 1000;            // 0-1s random
      const delay = baseDelay + jitter;

      await new Promise(r => setTimeout(r, delay));
    }
  }
}

function isRetryable(status) {
  return [429, 500, 502, 503, 529].includes(status);
}

Key points:

Only retry on retryable errors — 400, 401, 403, 404 should never be retried
Add jitter — Without jitter, all clients retry at the same time, causing a thundering herd
Cap retries at 3 — If 3 retries fail, the problem is likely persistent
Respect Retry-After headers — Some providers include them in 429 responses

Provider-Specific Error Behavior

Each provider handles errors differently. Here's what you need to know:

Provider Error Behavior Comparison

OpenAI — 429Bills input tokens on partial processing. Use Retry-After header.

OpenAI — 500/502/503No charge. Safe to retry immediately.

Anthropic — 429No charge on 429. Has built-in retry in SDK.

Anthropic — 529Overloaded. No charge. Retry with backoff.

Google Gemini — 429No charge. Generous default quotas.

DeepSeek — 429No charge. Aggressive rate limits on free tier.

Mistral — 429No charge. Limits based on subscription tier.

Common Failure Patterns and Fixes

1. Context window overflow

The most expensive error. You build a massive prompt, send it, and get a 400 error because you exceeded the model's context window. The fix: always count tokens before sending.

// Before sending, validate token count
const maxTokens = {
  'gpt-4o': 128000, 'gpt-5': 272000, 'gpt-5.5': 1000000,
  'claude-sonnet-4.6': 1000000, 'claude-haiku-4.5': 200000,
  'gemini-2.0-flash': 1000000, 'deepseek-v4-pro': 1000000
};

const limit = maxTokens[model] || 128000;
if (estimatedTokens > limit * 0.9) {
  // Truncate or summarize before sending
  prompt = truncateToTokenLimit(prompt, limit * 0.85);
}

2. Streaming connection drops

Streaming responses (SSE) can disconnect mid-response, especially on long outputs. The result: you've consumed input tokens but get no complete output. Fix: implement streaming resume or fall back to non-streaming for critical requests.

// Streaming error handling
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 60000);

try {
  const stream = await fetch(url, {
    ...options,
    signal: controller.signal
  });

  let fullResponse = '';
  for await (const chunk of stream.body) {
    fullResponse += parseChunk(chunk);
    clearTimeout(timeout); // Reset timeout on each chunk
  }
} catch (error) {
  if (error.name === 'AbortError') {
    // Timeout — retry with non-streaming
    return await callWithRetry(() => fetchNonStreaming(options));
  }
  throw error;
}

3. Concurrent request collisions

Sending many requests simultaneously often triggers rate limits. The fix: implement a request queue with concurrency limits.

class RequestQueue {
  constructor(concurrency = 5) {
    this.concurrency = concurrency;
    this.running = 0;
    this.queue = [];
  }

  async add(fn) {
    if (this.running >= this.concurrency) {
      await new Promise(resolve => this.queue.push(resolve));
    }
    this.running++;
    try {
      return await fn();
    } finally {
      this.running--;
      if (this.queue.length) this.queue.shift()();
    }
  }
}

Cost-Aware Error Handling

The biggest mistake: retrying expensive requests blindly. If you're retrying a request that costs $0.50 per attempt, three retries cost $1.50 for a single logical operation.

Cost-Aware Retry Rules

Low-cost requests (<$0.01): Retry up to 3 times, aggressive backoff
Mid-cost requests ($0.01-$0.10): Retry up to 2 times, conservative backoff
High-cost requests (>$0.10): Retry once, then fall back to a cheaper model
Always: Set a max budget per operation, abort if retries exceed it

Model fallback chains

When your primary model fails, falling back to a cheaper model saves both time and money:

Recommended Fallback Chains

Premium chainGPT-5 → Claude Sonnet 4.6 → Gemini 2.5 Pro

Mid-tier chainGPT-5 mini → Claude Haiku 4.5 → Gemini 2.0 Flash

Budget chainDeepSeek V4 Pro → Gemini Flash Lite → GPT-4o mini

Free/cheap chainGemini Flash Lite → GPT-oss 20B → Llama 3.1 8B

Calculate the real cost of your error rate

If 10% of your requests fail and you retry them, you're paying double for that 10%. Use our calculator to see the impact.

Open Calculator →

Monitoring and Alerting

Handle errors in code, but monitor them in production. Set up alerts for:

Error rate >5% — Indicates a systemic issue, not random failures
Retry rate >20% — You're retrying too aggressively, wasting money
P95 latency >30s — Approaching timeout thresholds
Monthly error cost >$50 — Set a hard cap and alert on it

Use our AI API cost monitoring guide to set up real-time tracking of error-related spend.

Provider-Specific SDK Retry Defaults

Most providers have built-in retry logic in their official SDKs. Here's what you get by default:

Provider	SDK Retries	Default Strategy	Customizable
OpenAI	2	Exponential backoff	Yes — maxRetries param
Anthropic	2	Exponential backoff	Yes — maxRetries param
Google Gemini	0 (manual)	N/A — implement yourself	N/A
DeepSeek	0 (manual)	N/A — implement yourself	N/A
Mistral	0 (manual)	N/A — implement yourself	N/A

Important: Don't rely solely on SDK defaults. OpenAI and Anthropic retry on 429 and 500 errors, but they won't implement cost-aware retry logic or model fallbacks. Build your own wrapper on top.

The Complete Error Handler

Here's a production-ready error handler that combines everything:

async function resilientLLMCall(options) {
  const { model, prompt, maxCost = 0.50 } = options;
  const costPerToken = getModelCost(model);
  let totalCost = 0;

  const fallbackChain = getFallbackChain(model);

  for (const currentModel of fallbackChain) {
    const cost = getModelCost(currentModel);
    if (totalCost + cost > maxCost) break;

    try {
      const result = await callWithRetry(
        () => callLLM(currentModel, prompt),
        currentModel.includes('gpt') || currentModel.includes('claude') ? 2 : 1
      );
      return { ...result, model: currentModel };
    } catch (error) {
      totalCost += cost;
      console.warn(`${currentModel} failed: ${error.message}. Falling back.`);
    }
  }

  throw new Error('All models in fallback chain failed');
}

Track your error costs in real time

See exactly how much failed requests are costing you across all providers.

View Pricing Trends →

Quick Reference: Error Handling Checklist

Do: Retry 429, 500, 502, 503, 529 with exponential backoff
Do: Add jitter to prevent thundering herd
Do: Set a max budget per operation
Do: Implement model fallback chains
Do: Validate token count before sending
Do: Monitor error rates and costs
Don't: Retry 400, 401, 403, 404 errors
Don't: Retry without checking if input tokens were billed
Don't: Use the same retry count for expensive and cheap models
Don't: Ignore Retry-After headers