AI API Rate Limits Compared: 2026 Guide to RPM, TPM, and Quotas
Your AI API integration works perfectly in development — then you hit production and start seeing 429 Too Many Requests errors. Rate limits are the invisible ceiling that determines how fast your app can scale, and every provider handles them differently.
This guide compares rate limits across all major AI API providers — OpenAI, Anthropic, Google, DeepSeek, Mistral, Cohere, xAI, and others — with specific numbers for each model tier. Plus practical strategies to handle rate limits without losing users.
Rate Limit Basics: RPM, TPM, and RPD
Every AI API provider enforces rate limits using three primary metrics:
| Metric | Full Name | What It Measures | Why It Matters |
|---|---|---|---|
| RPM | Requests per Minute | How many API calls you can make in 60 seconds | Limits concurrent users and real-time features |
| TPM | Tokens per Minute | Total tokens (input + output) processed per minute | Limits throughput for long-context or high-volume workloads |
| RPD | Requests per Day | Total API calls allowed in a 24-hour window | Limits daily batch processing and cron jobs |
Most providers enforce all three simultaneously. You might hit the TPM limit before RPM, or vice versa. The tightest constraint is your actual bottleneck.
Quick Rule of Thumb
If you're processing short requests (chat, classification), RPM is your bottleneck. If you're processing long requests (document analysis, code generation), TPM is your bottleneck. Plan for both.
Provider Rate Limits: Complete Comparison
OpenAI Rate Limits
OpenAI uses a tiered system based on your total spend. Higher tiers unlock higher limits. New accounts start at Tier 1.
| Model | Tier 1 RPM | Tier 2 RPM | Tier 3 RPM | Tier 4 RPM | Tier 5 RPM |
|---|---|---|---|---|---|
| GPT-5.5 | 500 | 2,000 | 5,000 | 8,000 | 10,000 |
| GPT-5 | 500 | 2,000 | 5,000 | 8,000 | 10,000 |
| GPT-5 Mini | 1,000 | 4,000 | 10,000 | 15,000 | 20,000 |
| GPT-4o | 500 | 2,000 | 5,000 | 8,000 | 10,000 |
| GPT-4o mini | 1,000 | 4,000 | 10,000 | 15,000 | 20,000 |
| GPT-oss 120B | 500 | 2,000 | 5,000 | 8,000 | 10,000 |
OpenAI TPM limits: Tier 1 starts at 40K TPM for GPT-5, scaling to 2M TPM at Tier 5. GPT-5 Mini gets 2x the TPM of GPT-5 at each tier. Budget models (GPT-oss, GPT-4o mini) get higher limits than flagship models.
How to tier up: Spend $5+ for Tier 2, $50+ for Tier 3, $100+ for Tier 4, $250+ for Tier 5. Tiers unlock automatically based on cumulative spend.
Anthropic Rate Limits
Anthropic uses a simpler tier system based on spend. Limits are per-model and scale aggressively with usage.
| Model | Tier 1 RPM | Tier 2 RPM | Tier 3 RPM | Tier 4 RPM |
|---|---|---|---|---|
| Claude Opus 4.7 | 200 | 1,000 | 2,000 | 4,000 |
| Claude Sonnet 4.6 | 500 | 2,000 | 4,000 | 8,000 |
| Claude Haiku 4.5 | 1,000 | 4,000 | 8,000 | 15,000 |
Anthropic TPM limits: Tier 1 starts at 40K TPM for Opus, 80K for Sonnet, and 100K for Haiku. Tier 4 scales to 2M TPM for Sonnet and Haiku, 1M for Opus.
Batch API bonus: Anthropic's Batch API has separate, higher rate limits — typically 2-3x the standard limits. If you have non-urgent workloads, Batch API gives you both 50% cost savings AND higher throughput.
Google Gemini Rate Limits
Google offers the most generous free-tier limits of any provider. Paid tiers scale to very high RPM.
| Model | Free RPM | Pay-as-you-go RPM | Free TPM | Pay TPM |
|---|---|---|---|---|
| Gemini 3.1 Pro | 5 | 2,000 | 32K | 4M |
| Gemini 2.5 Pro | 5 | 2,000 | 32K | 4M |
| Gemini 2.0 Flash | 15 | 4,000 | 1M | 8M |
| Gemini 2.0 Flash Lite | 30 | 6,000 | 1M | 8M |
Google's free tier is unmatched. Gemini 2.0 Flash Lite gives you 30 RPM for free — enough for side projects and prototypes. No other provider offers this. The pay-as-you-go TPM limits (4-8M) are also the highest in the industry.
DeepSeek Rate Limits
DeepSeek uses a flat rate limit system — no tiers, no spend-based scaling. Simple and predictable.
| Model | RPM | TPM | Notes |
|---|---|---|---|
| DeepSeek V4 Pro | 60 | 100K | Same limits for all users |
| DeepSeek V4 Flash | 60 | 100K | Same limits for all users |
| DeepSeek V3 | 60 | 100K | Legacy model |
DeepSeek Limitation
DeepSeek's 60 RPM is the lowest of any major provider. For a chatbot handling 100 concurrent users making 1 request/minute each, you'd need 100 RPM — exceeding DeepSeek's limit. At scale, you'll need request queuing or multiple API keys.
Mistral Rate Limits
| Model | RPM | TPM | Notes |
|---|---|---|---|
| Mistral Large 3 | 60 | 200K | Scale plan available |
| Mistral Small 4 | 60 | 200K | Scale plan available |
Other Providers
| Provider | Model | RPM | TPM |
|---|---|---|---|
| xAI | Grok 3 | 60 | 100K |
| Moonshot | Kimi K2.6 | 60 | 100K |
| Cohere | Command R+ | 100 | 200K |
| Cohere | Command R | 100 | 200K |
| AI21 | Jamba 1.5 Large | 60 | 200K |
| Together.ai | Llama 3.1 70B | 60 | 200K |
| Together.ai | Llama 3.1 8B | 60 | 200K |
RPM Comparison: All Providers Side by Side
Here's the base-tier RPM (lowest paid tier) for each provider's flagship model:
| Provider | Flagship Model | Base RPM | Max RPM | Base TPM |
|---|---|---|---|---|
| Gemini 2.0 Flash Lite | 6,000 | 6,000 | 8M | |
| Gemini 2.0 Flash | 4,000 | 4,000 | 8M | |
| OpenAI | GPT-5 Mini | 1,000 | 20,000 | 40K |
| Anthropic | Claude Haiku 4.5 | 1,000 | 15,000 | 100K |
| OpenAI | GPT-5 | 500 | 10,000 | 40K |
| Anthropic | Claude Sonnet 4.6 | 500 | 8,000 | 80K |
| Cohere | Command R+ | 100 | 100 | 200K |
| DeepSeek | DeepSeek V4 Pro | 60 | 60 | 100K |
| Mistral | Mistral Large 3 | 60 | 60 | 200K |
| xAI | Grok 3 | 60 | 60 | 100K |
Google dominates on rate limits. Gemini 2.0 Flash Lite at 6,000 RPM and 8M TPM is 100x DeepSeek's RPM and 80x its TPM. If throughput is your primary constraint, Google is the clear winner. OpenAI scales well with spend. Anthropic is solid in the middle. DeepSeek, Mistral, and xAI all cap at 60 RPM.
How to Handle 429 Rate Limit Errors
When you hit a rate limit, the API returns a 429 Too Many Requests status code with a Retry-After header. Here's how to handle it gracefully:
1. Exponential Backoff
The standard approach: wait longer after each failed attempt.
async function callWithRetry(apiCall, maxRetries = 5) {
for (let i = 0; i < maxRetries; i++) {
try {
return await apiCall();
} catch (error) {
if (error.status === 429 && i < maxRetries - 1) {
const delay = Math.pow(2, i) * 1000; // 1s, 2s, 4s, 8s, 16s
const retryAfter = error.headers?.['retry-after'];
const waitMs = retryAfter ? retryAfter * 1000 : delay;
console.log(`Rate limited. Retrying in ${waitMs}ms...`);
await new Promise(r => setTimeout(r, waitMs));
} else {
throw error;
}
}
}
}
2. Request Queue
For high-volume applications, queue requests and process them at a controlled rate:
class RateLimiter {
constructor(rpm) {
this.interval = 60000 / rpm; // ms between requests
this.queue = [];
this.processing = false;
}
async add(apiCall) {
return new Promise((resolve, reject) => {
this.queue.push({ apiCall, resolve, reject });
if (!this.processing) this.process();
});
}
async process() {
this.processing = true;
while (this.queue.length > 0) {
const { apiCall, resolve, reject } = this.queue.shift();
try {
const result = await apiCall();
resolve(result);
} catch (e) {
reject(e);
}
if (this.queue.length > 0) {
await new Promise(r => setTimeout(r, this.interval));
}
}
this.processing = false;
}
}
// Usage: Stay under DeepSeek's 60 RPM
const limiter = new RateLimiter(55); // 55 RPM (buffer)
const result = await limiter.add(() => callDeepSeek(prompt));
3. Multi-Key Rotation
For providers with low RPM (DeepSeek, Mistral), use multiple API keys to multiply your effective rate limit:
const keys = [process.env.KEY_1, process.env.KEY_2, process.env.KEY_3];
let keyIndex = 0;
function getNextKey() {
const key = keys[keyIndex];
keyIndex = (keyIndex + 1) % keys.length;
return key;
}
// Effective RPM: 60 x 3 keys = 180 RPM
4. Model Fallback Chain
When your primary model is rate-limited, fall back to a cheaper alternative:
const fallbackChain = [
{ model: 'claude-sonnet-4.6', rpm: 500 },
{ model: 'gpt-5-mini', rpm: 1000 },
{ model: 'gemini-2.0-flash', rpm: 4000 },
];
async function callWithFallback(prompt) {
for (const { model } of fallbackChain) {
try {
return await callModel(model, prompt);
} catch (e) {
if (e.status === 429) continue;
throw e;
}
}
throw new Error('All models rate-limited');
}
Pro Tip: Combine Strategies
The most robust approach combines all four: a request queue keeps you under the limit, exponential backoff handles spikes, multi-key rotation multiplies capacity, and a fallback chain ensures availability. Start with a queue + backoff, add multi-key if you need more throughput.
Rate Limits vs Cost: The Tradeoff
Higher rate limits often come with higher costs. Here's how rate limits correlate with pricing for each provider's budget model:
| Model | Input ($/1M) | Output ($/1M) | Base RPM | Cost per 1K Requests |
|---|---|---|---|---|
| Gemini 2.0 Flash Lite | $0.075 | $0.30 | 6,000 | $0.08 |
| DeepSeek V4 Flash | $0.14 | $0.28 | 60 | $0.10 |
| GPT-oss 20B | $0.08 | $0.35 | 500 | $0.11 |
| GPT-4o mini | $0.15 | $0.60 | 1,000 | $0.18 |
| Mistral Small 4 | $0.15 | $0.60 | 60 | $0.18 |
| GPT-5 Mini | $0.25 | $2.00 | 1,000 | $0.58 |
| DeepSeek V4 Pro | $0.44 | $0.87 | 60 | $0.26 |
| Claude Haiku 4.5 | $1.00 | $5.00 | 1,000 | $1.50 |
Gemini 2.0 Flash Lite is the clear winner on both cost AND rate limits. At $0.075/$0.30 per million tokens and 6,000 RPM, it's the cheapest AND fastest option available. For high-throughput, cost-sensitive workloads, Flash Lite is unbeatable.
Choosing a Provider by Throughput Needs
| Your Throughput Need | Best Provider | Why |
|---|---|---|
| Prototype / Side project | Google (Free tier) | 30 RPM free, no credit card required |
| Chatbot (100-500 users) | Google or OpenAI | 4,000-6,000 RPM handles concurrent users |
| High-volume API (>1K RPM) | Google Gemini Flash | 6,000 RPM at $0.075/$0.30 |
| Enterprise (10K+ RPM) | OpenAI (Tier 5) | 20,000 RPM for GPT-5 Mini with spend |
| Batch processing | Any provider | Batch APIs have separate, higher limits |
| Budget + moderate throughput | DeepSeek or Mistral | 60 RPM is enough for most apps under 100 users |
| Quality + moderate throughput | Anthropic | 500-1,000 RPM for Sonnet/Haiku at good prices |
Practical Throughput Calculations
How many concurrent users can each provider handle? Assume 2 requests per user per minute (typical for a chatbot):
Concurrent Users at Base Tier
At 2 requests per user per minute, DeepSeek can only handle 30 concurrent users. That's fine for internal tools and small chatbots, but not for production apps with real traffic. Google Flash Lite handles 100x more users at 1/10th the cost.
The Bottom Line
For most developers, rate limits are not the bottleneck — cost is. But if you're building a high-traffic chatbot or real-time API, rate limits become critical. Google Gemini Flash Lite offers the best combination of high RPM (6,000), high TPM (8M), and low cost ($0.075/$0.30). OpenAI scales well with spend. DeepSeek and Mistral are limited to 60 RPM — fine for low-traffic apps, but you'll hit the ceiling fast.
Strategy: Start with the cheapest model that meets your quality needs. If you hit rate limits, add request queuing and exponential backoff before switching providers. If you still need more throughput, use multi-key rotation or upgrade to a provider with higher limits. Use the APIpulse calculator to model costs at your target throughput.
Modeling AI API costs at your target throughput? Enter your usage patterns and see exact monthly costs — plus rate limit guidance for every model.
Calculate Your Costs or Compare All ModelsWant to optimize your AI API costs?
APIpulse Pro ($29 one-time) includes saved scenarios, cost report exports, and personalized recommendations that can save you up to 40%.
Get Pro — $29