AI API rate limits, OpenAI rate limits, Anthropic RPM, Gemini rate limits, DeepSeek rate limits, API 429 errors, TPM limits, LLM rate limits 2026">
← Back to blog

AI API Rate Limits Compared: 2026 Guide to RPM, TPM, and Quotas

Your AI API integration works perfectly in development — then you hit production and start seeing 429 Too Many Requests errors. Rate limits are the invisible ceiling that determines how fast your app can scale, and every provider handles them differently.

This guide compares rate limits across all major AI API providers — OpenAI, Anthropic, Google, DeepSeek, Mistral, Cohere, xAI, and others — with specific numbers for each model tier. Plus practical strategies to handle rate limits without losing users.

Rate Limit Basics: RPM, TPM, and RPD

Every AI API provider enforces rate limits using three primary metrics:

Metric Full Name What It Measures Why It Matters
RPM Requests per Minute How many API calls you can make in 60 seconds Limits concurrent users and real-time features
TPM Tokens per Minute Total tokens (input + output) processed per minute Limits throughput for long-context or high-volume workloads
RPD Requests per Day Total API calls allowed in a 24-hour window Limits daily batch processing and cron jobs

Most providers enforce all three simultaneously. You might hit the TPM limit before RPM, or vice versa. The tightest constraint is your actual bottleneck.

Quick Rule of Thumb

If you're processing short requests (chat, classification), RPM is your bottleneck. If you're processing long requests (document analysis, code generation), TPM is your bottleneck. Plan for both.

Provider Rate Limits: Complete Comparison

OpenAI Rate Limits

OpenAI uses a tiered system based on your total spend. Higher tiers unlock higher limits. New accounts start at Tier 1.

Model Tier 1 RPM Tier 2 RPM Tier 3 RPM Tier 4 RPM Tier 5 RPM
GPT-5.5 500 2,000 5,000 8,000 10,000
GPT-5 500 2,000 5,000 8,000 10,000
GPT-5 Mini 1,000 4,000 10,000 15,000 20,000
GPT-4o 500 2,000 5,000 8,000 10,000
GPT-4o mini 1,000 4,000 10,000 15,000 20,000
GPT-oss 120B 500 2,000 5,000 8,000 10,000

OpenAI TPM limits: Tier 1 starts at 40K TPM for GPT-5, scaling to 2M TPM at Tier 5. GPT-5 Mini gets 2x the TPM of GPT-5 at each tier. Budget models (GPT-oss, GPT-4o mini) get higher limits than flagship models.

How to tier up: Spend $5+ for Tier 2, $50+ for Tier 3, $100+ for Tier 4, $250+ for Tier 5. Tiers unlock automatically based on cumulative spend.

Anthropic Rate Limits

Anthropic uses a simpler tier system based on spend. Limits are per-model and scale aggressively with usage.

Model Tier 1 RPM Tier 2 RPM Tier 3 RPM Tier 4 RPM
Claude Opus 4.7 200 1,000 2,000 4,000
Claude Sonnet 4.6 500 2,000 4,000 8,000
Claude Haiku 4.5 1,000 4,000 8,000 15,000

Anthropic TPM limits: Tier 1 starts at 40K TPM for Opus, 80K for Sonnet, and 100K for Haiku. Tier 4 scales to 2M TPM for Sonnet and Haiku, 1M for Opus.

Batch API bonus: Anthropic's Batch API has separate, higher rate limits — typically 2-3x the standard limits. If you have non-urgent workloads, Batch API gives you both 50% cost savings AND higher throughput.

Google Gemini Rate Limits

Google offers the most generous free-tier limits of any provider. Paid tiers scale to very high RPM.

Model Free RPM Pay-as-you-go RPM Free TPM Pay TPM
Gemini 3.1 Pro 5 2,000 32K 4M
Gemini 2.5 Pro 5 2,000 32K 4M
Gemini 2.0 Flash 15 4,000 1M 8M
Gemini 2.0 Flash Lite 30 6,000 1M 8M

Google's free tier is unmatched. Gemini 2.0 Flash Lite gives you 30 RPM for free — enough for side projects and prototypes. No other provider offers this. The pay-as-you-go TPM limits (4-8M) are also the highest in the industry.

DeepSeek Rate Limits

DeepSeek uses a flat rate limit system — no tiers, no spend-based scaling. Simple and predictable.

Model RPM TPM Notes
DeepSeek V4 Pro 60 100K Same limits for all users
DeepSeek V4 Flash 60 100K Same limits for all users
DeepSeek V3 60 100K Legacy model

DeepSeek Limitation

DeepSeek's 60 RPM is the lowest of any major provider. For a chatbot handling 100 concurrent users making 1 request/minute each, you'd need 100 RPM — exceeding DeepSeek's limit. At scale, you'll need request queuing or multiple API keys.

Mistral Rate Limits

Model RPM TPM Notes
Mistral Large 3 60 200K Scale plan available
Mistral Small 4 60 200K Scale plan available

Other Providers

Provider Model RPM TPM
xAI Grok 3 60 100K
Moonshot Kimi K2.6 60 100K
Cohere Command R+ 100 200K
Cohere Command R 100 200K
AI21 Jamba 1.5 Large 60 200K
Together.ai Llama 3.1 70B 60 200K
Together.ai Llama 3.1 8B 60 200K

RPM Comparison: All Providers Side by Side

Here's the base-tier RPM (lowest paid tier) for each provider's flagship model:

Provider Flagship Model Base RPM Max RPM Base TPM
Google Gemini 2.0 Flash Lite 6,000 6,000 8M
Google Gemini 2.0 Flash 4,000 4,000 8M
OpenAI GPT-5 Mini 1,000 20,000 40K
Anthropic Claude Haiku 4.5 1,000 15,000 100K
OpenAI GPT-5 500 10,000 40K
Anthropic Claude Sonnet 4.6 500 8,000 80K
Cohere Command R+ 100 100 200K
DeepSeek DeepSeek V4 Pro 60 60 100K
Mistral Mistral Large 3 60 60 200K
xAI Grok 3 60 60 100K

Google dominates on rate limits. Gemini 2.0 Flash Lite at 6,000 RPM and 8M TPM is 100x DeepSeek's RPM and 80x its TPM. If throughput is your primary constraint, Google is the clear winner. OpenAI scales well with spend. Anthropic is solid in the middle. DeepSeek, Mistral, and xAI all cap at 60 RPM.

How to Handle 429 Rate Limit Errors

When you hit a rate limit, the API returns a 429 Too Many Requests status code with a Retry-After header. Here's how to handle it gracefully:

1. Exponential Backoff

The standard approach: wait longer after each failed attempt.

async function callWithRetry(apiCall, maxRetries = 5) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await apiCall();
    } catch (error) {
      if (error.status === 429 && i < maxRetries - 1) {
        const delay = Math.pow(2, i) * 1000; // 1s, 2s, 4s, 8s, 16s
        const retryAfter = error.headers?.['retry-after'];
        const waitMs = retryAfter ? retryAfter * 1000 : delay;
        console.log(`Rate limited. Retrying in ${waitMs}ms...`);
        await new Promise(r => setTimeout(r, waitMs));
      } else {
        throw error;
      }
    }
  }
}

2. Request Queue

For high-volume applications, queue requests and process them at a controlled rate:

class RateLimiter {
  constructor(rpm) {
    this.interval = 60000 / rpm; // ms between requests
    this.queue = [];
    this.processing = false;
  }

  async add(apiCall) {
    return new Promise((resolve, reject) => {
      this.queue.push({ apiCall, resolve, reject });
      if (!this.processing) this.process();
    });
  }

  async process() {
    this.processing = true;
    while (this.queue.length > 0) {
      const { apiCall, resolve, reject } = this.queue.shift();
      try {
        const result = await apiCall();
        resolve(result);
      } catch (e) {
        reject(e);
      }
      if (this.queue.length > 0) {
        await new Promise(r => setTimeout(r, this.interval));
      }
    }
    this.processing = false;
  }
}

// Usage: Stay under DeepSeek's 60 RPM
const limiter = new RateLimiter(55); // 55 RPM (buffer)
const result = await limiter.add(() => callDeepSeek(prompt));

3. Multi-Key Rotation

For providers with low RPM (DeepSeek, Mistral), use multiple API keys to multiply your effective rate limit:

const keys = [process.env.KEY_1, process.env.KEY_2, process.env.KEY_3];
let keyIndex = 0;

function getNextKey() {
  const key = keys[keyIndex];
  keyIndex = (keyIndex + 1) % keys.length;
  return key;
}

// Effective RPM: 60 x 3 keys = 180 RPM

4. Model Fallback Chain

When your primary model is rate-limited, fall back to a cheaper alternative:

const fallbackChain = [
  { model: 'claude-sonnet-4.6', rpm: 500 },
  { model: 'gpt-5-mini', rpm: 1000 },
  { model: 'gemini-2.0-flash', rpm: 4000 },
];

async function callWithFallback(prompt) {
  for (const { model } of fallbackChain) {
    try {
      return await callModel(model, prompt);
    } catch (e) {
      if (e.status === 429) continue;
      throw e;
    }
  }
  throw new Error('All models rate-limited');
}

Pro Tip: Combine Strategies

The most robust approach combines all four: a request queue keeps you under the limit, exponential backoff handles spikes, multi-key rotation multiplies capacity, and a fallback chain ensures availability. Start with a queue + backoff, add multi-key if you need more throughput.

Rate Limits vs Cost: The Tradeoff

Higher rate limits often come with higher costs. Here's how rate limits correlate with pricing for each provider's budget model:

Model Input ($/1M) Output ($/1M) Base RPM Cost per 1K Requests
Gemini 2.0 Flash Lite $0.075 $0.30 6,000 $0.08
DeepSeek V4 Flash $0.14 $0.28 60 $0.10
GPT-oss 20B $0.08 $0.35 500 $0.11
GPT-4o mini $0.15 $0.60 1,000 $0.18
Mistral Small 4 $0.15 $0.60 60 $0.18
GPT-5 Mini $0.25 $2.00 1,000 $0.58
DeepSeek V4 Pro $0.44 $0.87 60 $0.26
Claude Haiku 4.5 $1.00 $5.00 1,000 $1.50

Gemini 2.0 Flash Lite is the clear winner on both cost AND rate limits. At $0.075/$0.30 per million tokens and 6,000 RPM, it's the cheapest AND fastest option available. For high-throughput, cost-sensitive workloads, Flash Lite is unbeatable.

Choosing a Provider by Throughput Needs

Your Throughput Need Best Provider Why
Prototype / Side project Google (Free tier) 30 RPM free, no credit card required
Chatbot (100-500 users) Google or OpenAI 4,000-6,000 RPM handles concurrent users
High-volume API (>1K RPM) Google Gemini Flash 6,000 RPM at $0.075/$0.30
Enterprise (10K+ RPM) OpenAI (Tier 5) 20,000 RPM for GPT-5 Mini with spend
Batch processing Any provider Batch APIs have separate, higher limits
Budget + moderate throughput DeepSeek or Mistral 60 RPM is enough for most apps under 100 users
Quality + moderate throughput Anthropic 500-1,000 RPM for Sonnet/Haiku at good prices

Practical Throughput Calculations

How many concurrent users can each provider handle? Assume 2 requests per user per minute (typical for a chatbot):

Concurrent Users at Base Tier

Google Flash Lite (6,000 RPM) 3,000 users
Google Flash (4,000 RPM) 2,000 users
OpenAI GPT-5 Mini Tier 1 (1,000 RPM) 500 users
Anthropic Haiku Tier 1 (1,000 RPM) 500 users
OpenAI GPT-5 Tier 1 (500 RPM) 250 users
Anthropic Sonnet Tier 1 (500 RPM) 250 users
DeepSeek V4 Pro (60 RPM) 30 users
Mistral Large (60 RPM) 30 users

At 2 requests per user per minute, DeepSeek can only handle 30 concurrent users. That's fine for internal tools and small chatbots, but not for production apps with real traffic. Google Flash Lite handles 100x more users at 1/10th the cost.

The Bottom Line

For most developers, rate limits are not the bottleneck — cost is. But if you're building a high-traffic chatbot or real-time API, rate limits become critical. Google Gemini Flash Lite offers the best combination of high RPM (6,000), high TPM (8M), and low cost ($0.075/$0.30). OpenAI scales well with spend. DeepSeek and Mistral are limited to 60 RPM — fine for low-traffic apps, but you'll hit the ceiling fast.

Strategy: Start with the cheapest model that meets your quality needs. If you hit rate limits, add request queuing and exponential backoff before switching providers. If you still need more throughput, use multi-key rotation or upgrade to a provider with higher limits. Use the APIpulse calculator to model costs at your target throughput.

Modeling AI API costs at your target throughput? Enter your usage patterns and see exact monthly costs — plus rate limit guidance for every model.

Calculate Your Costs or Compare All Models

Want to optimize your AI API costs?

APIpulse Pro ($29 one-time) includes saved scenarios, cost report exports, and personalized recommendations that can save you up to 40%.

Get Pro — $29