LLM API Error Handling and Retry Strategies: Avoid Wasting Money on Failed Requests
Failed API requests cost you tokens without results. Here's how to handle errors properly, retry intelligently, and stop wasting money on requests that don't need to fail.
Every failed LLM API request is wasted money. You paid for input tokens, waited for a response, and got an error. If you're not handling this correctly, 5-15% of your API spend could be going to requests that fail and shouldn't have — or succeed but get retried anyway.
This guide covers error handling across all 10 major providers, retry strategies that actually save money, and the code patterns that prevent common failures.
Every Error Code and What It Means
Not all errors are equal. Some are retryable, some are not, and some will cost you money if you retry them. Here's the breakdown:
| Error Code | Name | Retryable? | Cost Impact |
|---|---|---|---|
429 |
Rate Limit / Too Many Requests | Yes | Input tokens billed, no output |
500 |
Internal Server Error | Yes | Usually no charge |
502 |
Bad Gateway | Yes | Usually no charge |
503 |
Service Unavailable | Yes | Usually no charge |
408 |
Request Timeout | Depends | Input tokens may be billed |
401 |
Unauthorized | No | No charge |
403 |
Forbidden | No | No charge |
400 |
Bad Request | No | No charge |
404 |
Model Not Found | No | No charge |
413 |
Payload Too Large | No | No charge |
529 |
Overloaded (Anthropic) | Yes | Usually no charge |
The cost trap: 429 errors
Rate limit errors are the sneakiest. OpenAI and Anthropic both bill you for input tokens even on 429 errors in some cases — especially if the request was partially processed before being rejected. If you send a 10K token request and hit a rate limit, you might still be charged for those input tokens with no output to show for it.
Google's Gemini and DeepSeek are generally more generous — 429 errors typically don't incur charges. But don't assume: always check your billing after a spike in rate limit errors.
The Exponential Backoff Pattern
The standard retry strategy for LLM APIs is exponential backoff with jitter. Here's the pattern that works across all providers:
async function callWithRetry(fn, maxRetries = 3) {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) throw error;
if (!isRetryable(error.status)) throw error;
const baseDelay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
const jitter = Math.random() * 1000; // 0-1s random
const delay = baseDelay + jitter;
await new Promise(r => setTimeout(r, delay));
}
}
}
function isRetryable(status) {
return [429, 500, 502, 503, 529].includes(status);
}
Key points:
- Only retry on retryable errors — 400, 401, 403, 404 should never be retried
- Add jitter — Without jitter, all clients retry at the same time, causing a thundering herd
- Cap retries at 3 — If 3 retries fail, the problem is likely persistent
- Respect Retry-After headers — Some providers include them in 429 responses
Provider-Specific Error Behavior
Each provider handles errors differently. Here's what you need to know:
Common Failure Patterns and Fixes
1. Context window overflow
The most expensive error. You build a massive prompt, send it, and get a 400 error because you exceeded the model's context window. The fix: always count tokens before sending.
// Before sending, validate token count
const maxTokens = {
'gpt-4o': 128000, 'gpt-5': 272000, 'gpt-5.5': 1000000,
'claude-sonnet-4.6': 1000000, 'claude-haiku-4.5': 200000,
'gemini-2.0-flash': 1000000, 'deepseek-v4-pro': 1000000
};
const limit = maxTokens[model] || 128000;
if (estimatedTokens > limit * 0.9) {
// Truncate or summarize before sending
prompt = truncateToTokenLimit(prompt, limit * 0.85);
}
2. Streaming connection drops
Streaming responses (SSE) can disconnect mid-response, especially on long outputs. The result: you've consumed input tokens but get no complete output. Fix: implement streaming resume or fall back to non-streaming for critical requests.
// Streaming error handling
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 60000);
try {
const stream = await fetch(url, {
...options,
signal: controller.signal
});
let fullResponse = '';
for await (const chunk of stream.body) {
fullResponse += parseChunk(chunk);
clearTimeout(timeout); // Reset timeout on each chunk
}
} catch (error) {
if (error.name === 'AbortError') {
// Timeout — retry with non-streaming
return await callWithRetry(() => fetchNonStreaming(options));
}
throw error;
}
3. Concurrent request collisions
Sending many requests simultaneously often triggers rate limits. The fix: implement a request queue with concurrency limits.
class RequestQueue {
constructor(concurrency = 5) {
this.concurrency = concurrency;
this.running = 0;
this.queue = [];
}
async add(fn) {
if (this.running >= this.concurrency) {
await new Promise(resolve => this.queue.push(resolve));
}
this.running++;
try {
return await fn();
} finally {
this.running--;
if (this.queue.length) this.queue.shift()();
}
}
}
Cost-Aware Error Handling
The biggest mistake: retrying expensive requests blindly. If you're retrying a request that costs $0.50 per attempt, three retries cost $1.50 for a single logical operation.
- Low-cost requests (<$0.01): Retry up to 3 times, aggressive backoff
- Mid-cost requests ($0.01-$0.10): Retry up to 2 times, conservative backoff
- High-cost requests (>$0.10): Retry once, then fall back to a cheaper model
- Always: Set a max budget per operation, abort if retries exceed it
Model fallback chains
When your primary model fails, falling back to a cheaper model saves both time and money:
Calculate the real cost of your error rate
If 10% of your requests fail and you retry them, you're paying double for that 10%. Use our calculator to see the impact.
Open Calculator →Monitoring and Alerting
Handle errors in code, but monitor them in production. Set up alerts for:
- Error rate >5% — Indicates a systemic issue, not random failures
- Retry rate >20% — You're retrying too aggressively, wasting money
- P95 latency >30s — Approaching timeout thresholds
- Monthly error cost >$50 — Set a hard cap and alert on it
Use our AI API cost monitoring guide to set up real-time tracking of error-related spend.
Provider-Specific SDK Retry Defaults
Most providers have built-in retry logic in their official SDKs. Here's what you get by default:
| Provider | SDK Retries | Default Strategy | Customizable |
|---|---|---|---|
| OpenAI | 2 | Exponential backoff | Yes — maxRetries param |
| Anthropic | 2 | Exponential backoff | Yes — maxRetries param |
| Google Gemini | 0 (manual) | N/A — implement yourself | N/A |
| DeepSeek | 0 (manual) | N/A — implement yourself | N/A |
| Mistral | 0 (manual) | N/A — implement yourself | N/A |
Important: Don't rely solely on SDK defaults. OpenAI and Anthropic retry on 429 and 500 errors, but they won't implement cost-aware retry logic or model fallbacks. Build your own wrapper on top.
The Complete Error Handler
Here's a production-ready error handler that combines everything:
async function resilientLLMCall(options) {
const { model, prompt, maxCost = 0.50 } = options;
const costPerToken = getModelCost(model);
let totalCost = 0;
const fallbackChain = getFallbackChain(model);
for (const currentModel of fallbackChain) {
const cost = getModelCost(currentModel);
if (totalCost + cost > maxCost) break;
try {
const result = await callWithRetry(
() => callLLM(currentModel, prompt),
currentModel.includes('gpt') || currentModel.includes('claude') ? 2 : 1
);
return { ...result, model: currentModel };
} catch (error) {
totalCost += cost;
console.warn(`${currentModel} failed: ${error.message}. Falling back.`);
}
}
throw new Error('All models in fallback chain failed');
}
Track your error costs in real time
See exactly how much failed requests are costing you across all providers.
View Pricing Trends →Quick Reference: Error Handling Checklist
- Do: Retry 429, 500, 502, 503, 529 with exponential backoff
- Do: Add jitter to prevent thundering herd
- Do: Set a max budget per operation
- Do: Implement model fallback chains
- Do: Validate token count before sending
- Do: Monitor error rates and costs
- Don't: Retry 400, 401, 403, 404 errors
- Don't: Retry without checking if input tokens were billed
- Don't: Use the same retry count for expensive and cheap models
- Don't: Ignore Retry-After headers
Related Reading
- AI API Rate Limits Compared: Every Provider's Limits in 2026
- AI API Cost Monitoring: How to Track, Predict, and Control Spending
- AI API Cost Optimization: The Complete Guide
- 7 AI API Pricing Mistakes That Cost Developers Thousands
- How to Set Up AI API Cost Alerts: Never Get Surprise Bills Again
- Compare model prices side by side →