← Back to blog

Guide May 14, 2026

AI API Rate Limits Compared: 2026 Guide to RPM, TPM, and Quotas

Your AI API integration works perfectly in development — then you hit production and start seeing 429 Too Many Requests errors. Rate limits are the invisible ceiling that determines how fast your app can scale, and every provider handles them differently.

This guide compares rate limits across all major AI API providers — OpenAI, Anthropic, Google, DeepSeek, Mistral, Cohere, xAI, and others — with specific numbers for each model tier. Plus practical strategies to handle rate limits without losing users.

Rate Limit Basics: RPM, TPM, and RPD

Every AI API provider enforces rate limits using three primary metrics:

Metric	Full Name	What It Measures	Why It Matters
RPM	Requests per Minute	How many API calls you can make in 60 seconds	Limits concurrent users and real-time features
TPM	Tokens per Minute	Total tokens (input + output) processed per minute	Limits throughput for long-context or high-volume workloads
RPD	Requests per Day	Total API calls allowed in a 24-hour window	Limits daily batch processing and cron jobs

Most providers enforce all three simultaneously. You might hit the TPM limit before RPM, or vice versa. The tightest constraint is your actual bottleneck.

Quick Rule of Thumb

If you're processing short requests (chat, classification), RPM is your bottleneck. If you're processing long requests (document analysis, code generation), TPM is your bottleneck. Plan for both.

Provider Rate Limits: Complete Comparison

OpenAI Rate Limits

OpenAI uses a tiered system based on your total spend. Higher tiers unlock higher limits. New accounts start at Tier 1.

Model	Tier 1 RPM	Tier 2 RPM	Tier 3 RPM	Tier 4 RPM	Tier 5 RPM
GPT-5.5	500	2,000	5,000	8,000	10,000
GPT-5	500	2,000	5,000	8,000	10,000
GPT-5 Mini	1,000	4,000	10,000	15,000	20,000
GPT-4o	500	2,000	5,000	8,000	10,000
GPT-4o mini	1,000	4,000	10,000	15,000	20,000
GPT-oss 120B	500	2,000	5,000	8,000	10,000

OpenAI TPM limits: Tier 1 starts at 40K TPM for GPT-5, scaling to 2M TPM at Tier 5. GPT-5 Mini gets 2x the TPM of GPT-5 at each tier. Budget models (GPT-oss, GPT-4o mini) get higher limits than flagship models.

How to tier up: Spend $5+ for Tier 2, $50+ for Tier 3, $100+ for Tier 4, $250+ for Tier 5. Tiers unlock automatically based on cumulative spend.

Anthropic Rate Limits

Anthropic uses a simpler tier system based on spend. Limits are per-model and scale aggressively with usage.

Model	Tier 1 RPM	Tier 2 RPM	Tier 3 RPM	Tier 4 RPM
Claude Opus 4.7	200	1,000	2,000	4,000
Claude Sonnet 4.6	500	2,000	4,000	8,000
Claude Haiku 4.5	1,000	4,000	8,000	15,000

Anthropic TPM limits: Tier 1 starts at 40K TPM for Opus, 80K for Sonnet, and 100K for Haiku. Tier 4 scales to 2M TPM for Sonnet and Haiku, 1M for Opus.

Batch API bonus: Anthropic's Batch API has separate, higher rate limits — typically 2-3x the standard limits. If you have non-urgent workloads, Batch API gives you both 50% cost savings AND higher throughput.

Google Gemini Rate Limits

Google offers the most generous free-tier limits of any provider. Paid tiers scale to very high RPM.

Model	Free RPM	Pay-as-you-go RPM	Free TPM	Pay TPM
Gemini 3.1 Pro	5	2,000	32K	4M
Gemini 2.5 Pro	5	2,000	32K	4M
Gemini 2.0 Flash	15	4,000	1M	8M
Gemini 2.0 Flash Lite	30	6,000	1M	8M

Google's free tier is unmatched. Gemini 2.0 Flash Lite gives you 30 RPM for free — enough for side projects and prototypes. No other provider offers this. The pay-as-you-go TPM limits (4-8M) are also the highest in the industry.

DeepSeek Rate Limits

DeepSeek uses a flat rate limit system — no tiers, no spend-based scaling. Simple and predictable.

Model	RPM	TPM	Notes
DeepSeek V4 Pro	60	100K	Same limits for all users
DeepSeek V4 Flash	60	100K	Same limits for all users
DeepSeek V3	60	100K	Legacy model

DeepSeek Limitation

DeepSeek's 60 RPM is the lowest of any major provider. For a chatbot handling 100 concurrent users making 1 request/minute each, you'd need 100 RPM — exceeding DeepSeek's limit. At scale, you'll need request queuing or multiple API keys.

Mistral Rate Limits

Model	RPM	TPM	Notes
Mistral Large 3	60	200K	Scale plan available
Mistral Small 4	60	200K	Scale plan available

Other Providers

Provider	Model	RPM	TPM
xAI	Grok 3	60	100K
Moonshot	Kimi K2.6	60	100K
Cohere	Command R+	100	200K
Cohere	Command R	100	200K
AI21	Jamba 1.5 Large	60	200K
Together.ai	Llama 3.1 70B	60	200K
Together.ai	Llama 3.1 8B	60	200K

RPM Comparison: All Providers Side by Side

Here's the base-tier RPM (lowest paid tier) for each provider's flagship model:

Provider	Flagship Model	Base RPM	Max RPM	Base TPM
Google	Gemini 2.0 Flash Lite	6,000	6,000	8M
Google	Gemini 2.0 Flash	4,000	4,000	8M
OpenAI	GPT-5 Mini	1,000	20,000	40K
Anthropic	Claude Haiku 4.5	1,000	15,000	100K
OpenAI	GPT-5	500	10,000	40K
Anthropic	Claude Sonnet 4.6	500	8,000	80K
Cohere	Command R+	100	100	200K
DeepSeek	DeepSeek V4 Pro	60	60	100K
Mistral	Mistral Large 3	60	60	200K
xAI	Grok 3	60	60	100K

Google dominates on rate limits. Gemini 2.0 Flash Lite at 6,000 RPM and 8M TPM is 100x DeepSeek's RPM and 80x its TPM. If throughput is your primary constraint, Google is the clear winner. OpenAI scales well with spend. Anthropic is solid in the middle. DeepSeek, Mistral, and xAI all cap at 60 RPM.

How to Handle 429 Rate Limit Errors

When you hit a rate limit, the API returns a 429 Too Many Requests status code with a Retry-After header. Here's how to handle it gracefully:

1. Exponential Backoff

The standard approach: wait longer after each failed attempt.

async function callWithRetry(apiCall, maxRetries = 5) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await apiCall();
    } catch (error) {
      if (error.status === 429 && i < maxRetries - 1) {
        const delay = Math.pow(2, i) * 1000; // 1s, 2s, 4s, 8s, 16s
        const retryAfter = error.headers?.['retry-after'];
        const waitMs = retryAfter ? retryAfter * 1000 : delay;
        console.log(`Rate limited. Retrying in ${waitMs}ms...`);
        await new Promise(r => setTimeout(r, waitMs));
      } else {
        throw error;
      }
    }
  }
}

2. Request Queue

For high-volume applications, queue requests and process them at a controlled rate:

class RateLimiter {
  constructor(rpm) {
    this.interval = 60000 / rpm; // ms between requests
    this.queue = [];
    this.processing = false;
  }

  async add(apiCall) {
    return new Promise((resolve, reject) => {
      this.queue.push({ apiCall, resolve, reject });
      if (!this.processing) this.process();
    });
  }

  async process() {
    this.processing = true;
    while (this.queue.length > 0) {
      const { apiCall, resolve, reject } = this.queue.shift();
      try {
        const result = await apiCall();
        resolve(result);
      } catch (e) {
        reject(e);
      }
      if (this.queue.length > 0) {
        await new Promise(r => setTimeout(r, this.interval));
      }
    }
    this.processing = false;
  }
}

// Usage: Stay under DeepSeek's 60 RPM
const limiter = new RateLimiter(55); // 55 RPM (buffer)
const result = await limiter.add(() => callDeepSeek(prompt));

3. Multi-Key Rotation

For providers with low RPM (DeepSeek, Mistral), use multiple API keys to multiply your effective rate limit:

const keys = [process.env.KEY_1, process.env.KEY_2, process.env.KEY_3];
let keyIndex = 0;

function getNextKey() {
  const key = keys[keyIndex];
  keyIndex = (keyIndex + 1) % keys.length;
  return key;
}

// Effective RPM: 60 x 3 keys = 180 RPM

4. Model Fallback Chain

When your primary model is rate-limited, fall back to a cheaper alternative:

const fallbackChain = [
  { model: 'claude-sonnet-4.6', rpm: 500 },
  { model: 'gpt-5-mini', rpm: 1000 },
  { model: 'gemini-2.0-flash', rpm: 4000 },
];

async function callWithFallback(prompt) {
  for (const { model } of fallbackChain) {
    try {
      return await callModel(model, prompt);
    } catch (e) {
      if (e.status === 429) continue;
      throw e;
    }
  }
  throw new Error('All models rate-limited');
}

Pro Tip: Combine Strategies

The most robust approach combines all four: a request queue keeps you under the limit, exponential backoff handles spikes, multi-key rotation multiplies capacity, and a fallback chain ensures availability. Start with a queue + backoff, add multi-key if you need more throughput.

Rate Limits vs Cost: The Tradeoff

Higher rate limits often come with higher costs. Here's how rate limits correlate with pricing for each provider's budget model:

Model	Input ($/1M)	Output ($/1M)	Base RPM	Cost per 1K Requests
Gemini 2.0 Flash Lite	$0.075	$0.30	6,000	$0.08
DeepSeek V4 Flash	$0.14	$0.28	60	$0.10
GPT-oss 20B	$0.08	$0.35	500	$0.11
GPT-4o mini	$0.15	$0.60	1,000	$0.18
Mistral Small 4	$0.15	$0.60	60	$0.18
GPT-5 Mini	$0.25	$2.00	1,000	$0.58
DeepSeek V4 Pro	$0.44	$0.87	60	$0.26
Claude Haiku 4.5	$1.00	$5.00	1,000	$1.50

Gemini 2.0 Flash Lite is the clear winner on both cost AND rate limits. At $0.075/$0.30 per million tokens and 6,000 RPM, it's the cheapest AND fastest option available. For high-throughput, cost-sensitive workloads, Flash Lite is unbeatable.

Choosing a Provider by Throughput Needs

Your Throughput Need	Best Provider	Why
Prototype / Side project	Google (Free tier)	30 RPM free, no credit card required
Chatbot (100-500 users)	Google or OpenAI	4,000-6,000 RPM handles concurrent users
High-volume API (>1K RPM)	Google Gemini Flash	6,000 RPM at $0.075/$0.30
Enterprise (10K+ RPM)	OpenAI (Tier 5)	20,000 RPM for GPT-5 Mini with spend
Batch processing	Any provider	Batch APIs have separate, higher limits
Budget + moderate throughput	DeepSeek or Mistral	60 RPM is enough for most apps under 100 users
Quality + moderate throughput	Anthropic	500-1,000 RPM for Sonnet/Haiku at good prices

Practical Throughput Calculations

How many concurrent users can each provider handle? Assume 2 requests per user per minute (typical for a chatbot):

Concurrent Users at Base Tier

Google Flash Lite (6,000 RPM) 3,000 users

Google Flash (4,000 RPM) 2,000 users

OpenAI GPT-5 Mini Tier 1 (1,000 RPM) 500 users

Anthropic Haiku Tier 1 (1,000 RPM) 500 users

OpenAI GPT-5 Tier 1 (500 RPM) 250 users

Anthropic Sonnet Tier 1 (500 RPM) 250 users

DeepSeek V4 Pro (60 RPM) 30 users

Mistral Large (60 RPM) 30 users

At 2 requests per user per minute, DeepSeek can only handle 30 concurrent users. That's fine for internal tools and small chatbots, but not for production apps with real traffic. Google Flash Lite handles 100x more users at 1/10th the cost.

The Bottom Line

For most developers, rate limits are not the bottleneck — cost is. But if you're building a high-traffic chatbot or real-time API, rate limits become critical. Google Gemini Flash Lite offers the best combination of high RPM (6,000), high TPM (8M), and low cost ($0.075/$0.30). OpenAI scales well with spend. DeepSeek and Mistral are limited to 60 RPM — fine for low-traffic apps, but you'll hit the ceiling fast.

Strategy: Start with the cheapest model that meets your quality needs. If you hit rate limits, add request queuing and exponential backoff before switching providers. If you still need more throughput, use multi-key rotation or upgrade to a provider with higher limits. Use the APIpulse calculator to model costs at your target throughput.

Modeling AI API costs at your target throughput? Enter your usage patterns and see exact monthly costs — plus rate limit guidance for every model.

Calculate Your Costs or Compare All Models

Want to optimize your AI API costs?

APIpulse Pro ($29 one-time) includes saved scenarios, cost report exports, and personalized recommendations that can save you up to 40%.

Get Pro — $29