AI API Rate Limits 2026: OpenAI, Anthropic, Gemini RPM, TPM & 429 Fixes

AI API rate limits decide whether your app can scale. The price can look perfect, but if your provider throttles you at a low RPM or TPM ceiling, production traffic quickly turns into 429 errors, retries, and queue delays.

This guide compares current 2026 rate-limit behavior across OpenAI, Anthropic Claude, Google Gemini, DeepSeek, xAI Grok, and Mistral. It focuses on the numbers developers search for most: requests per minute, tokens per minute, usage tiers, free limits, and practical ways to avoid 429 errors.

Quick Answer: Which API Has The Best Rate Limits?

Provider	Public limit model	Practical takeaway
OpenAI	Usage tiers by spend and account age	Strongest published high-tier throughput for production teams
Anthropic Claude	RPM plus separate input/output TPM	Great for Claude workloads, but Tier 1 is tight and ITPM/OTPM must be planned separately
Google Gemini	AI Studio / quota-tier dependent	Often generous, but live project quota should be treated as the source of truth
DeepSeek	Dynamic concurrency, no fixed public RPM/TPM table	Very cheap, but production apps need queues, timeouts, and fallback routing
xAI Grok	Free credits plus scaling paid limits	Useful for experimentation and X-related workflows
Mistral	Moderate published paid RPM	Not the highest throughput, but useful for EU/compliance-sensitive workloads

If you are capacity planning, use the API Throughput Planner alongside this guide. If you are optimizing cost at the same time, use the AI Model Pricing Calculator.

Key Terms

Before diving into the numbers, here are the three metrics every provider uses:

RPM (Requests Per Minute) — The maximum number of API calls you can make in a 60-second window.
TPM (Tokens Per Minute) — The maximum number of tokens (input + output combined) the API will process for you in a 60-second window.
RPD (Requests Per Day) — Some providers also cap total daily requests, especially on free tiers.

In practice, TPM is the limit that matters most for production applications.

Rate Limits by Provider

OpenAI — Tier-Based System (Free through Tier 5)

OpenAI’s rate limits scale with your cumulative platform spend and account age. Tiers upgrade automatically, but exact limits vary by model.

GPT-5.4 and GPT-5 Rate Limits:

Model	Tier	Qualification	RPM	TPM
GPT-5.4	Free	Not supported	—	—
GPT-5.4	Tier 1	$5 paid	500	500,000
GPT-5.4	Tier 2	$50 paid + 7 days	5,000	1,000,000
GPT-5.4	Tier 3	$100 paid + 7 days	5,000	2,000,000
GPT-5.4	Tier 4	$250 paid + 14 days	10,000	4,000,000
GPT-5.4	Tier 5	$1,000 paid + 30 days	15,000	40,000,000
GPT-5	Free	Not supported	—	—
GPT-5	Tier 1	$5 paid	500	500,000
GPT-5	Tier 2	$50 paid + 7 days	5,000	1,000,000
GPT-5	Tier 3	$100 paid + 7 days	5,000	2,000,000
GPT-5	Tier 4	$250 paid + 14 days	10,000	4,000,000
GPT-5	Tier 5	$1,000 paid + 30 days	15,000	40,000,000

Selected reasoning model examples (Tier 3):

Model	RPM	TPM
o3	5,000	800,000
o3-mini	5,000	4,000,000

Anthropic (Claude) — Four-Tier System

Anthropic uses a spend-based tier system. Its Messages API limits are measured separately as RPM, input tokens per minute (ITPM), and output tokens per minute (OTPM), so a single combined TPM number can be misleading.

Claude Opus 4.6 & Sonnet 4.6 Rate Limits:

Tier	RPM	ITPM	OTPM
Tier 1	50	30,000	8,000
Tier 2	1,000	450,000	90,000
Tier 3	2,000	800,000	160,000
Tier 4	4,000	2,000,000	400,000

Anthropic publishes Opus 4.x and Sonnet 4.x as shared family pools rather than separate limits for each model version. Cached reads generally do not count against ITPM for current Claude models, which can make effective throughput higher for cache-heavy workloads.

Google Gemini — Tier-Dependent Throughput

Google structures its rate limits based on “Usage Tiers.” Your actual limits depend on whether you are using the Free of charge tier or the Pay-as-you-go tier.

Google’s public Gemini rate-limit page no longer exposes a complete stable RPM/TPM table in the docs page itself. It says active limits depend on quota tier and should be checked in AI Studio, and that listed limits are not guaranteed. The table below keeps only conservative, planner-facing 2.5-series baselines where the site already uses published presets.

Gemini 2.5 Series (published baseline presets):

Plan	RPM	TPM	RPD
Tier 1 (2.5 Pro)	150	2,000,000 input TPM	10,000
Tier 1 (2.5 Flash)	1,000	1,000,000 input TPM	10,000
Tier 1 (2.5 Flash-Lite)	4,000	4,000,000 input TPM	Check AI Studio

xAI (Grok) — Free Credits + Scaling Tiers

xAI provides $25 in free signup credits and structures rate limits that scale with usage.

Grok 3 & 4 Rate Limits:

Model	Free Tier RPM	Free Tier TPM	Paid RPM	Paid TPM
Grok 4	60	100,000	Up to 2,000	Up to 1,000,000
Grok 3 Mini	100	200,000	Up to 4,000	Up to 2,000,000

DeepSeek — Dynamic RPM/TPM, Published Concurrency Caps

DeepSeek does not publish fixed RPM/TPM tables for V4. Its official docs say concurrency can be affected by server load and short-term usage history, and the pricing page publishes current concurrency caps of 2,500 for V4 Flash and 500 for V4 Pro. When the platform is busy, requests may wait on an open HTTP connection, return keep-alive lines, or receive a 429 if the dynamic limit is reached.

DeepSeek V4 behavior:

Model	Public RPM/TPM	Published concurrency	Practical note
DeepSeek V4 Flash (`deepseek-v4-flash`)	Dynamic	2,500	Best default for low-cost agent traffic
DeepSeek V4 Pro (`deepseek-v4-pro`)	Dynamic	500	Stronger model at official 1/4-of-original pricing after May 31, 2026
`deepseek-chat` / `deepseek-reasoner`	Compatibility aliases	-	Scheduled for deprecation on July 24, 2026

Key observations about DeepSeek:

Extremely low effective cost. V4 Flash is $0.14/M cache-miss input, $0.0028/M cached input, and $0.28/M output, so cache-heavy agent workloads can cost far less than fixed-price tables suggest.
Concurrency is the real bottleneck. DeepSeek can still return 429 or delay scheduling under high load, so production systems should keep timeouts, queues, and fallback providers.
No paid tier ladder. There is no public “spend $X to unlock Y TPM” path, so plan around dynamic capacity rather than fixed guarantees.

Mistral — Free Tier Available

Mistral offers a free tier for experimentation and paid plans with competitive limits.

Mistral Rate Limits:

Model	Free Tier RPM	Paid RPM
Mistral Large 3	Lower (varies)	300
Mistral Medium 3	Lower (varies)	300
Mistral Small 3.1	Lower (varies)	300

Key observations about Mistral:

Free tier is available without a credit card, similar to Google. Useful for evaluation and prototyping.
300 RPM across all paid models is moderate — higher than Anthropic Tier 1 (50), but well below OpenAI Tier 2 (5,000) and Gemini (2,000-4,000). DeepSeek should be evaluated separately because its public docs describe dynamic concurrency rather than a fixed RPM table.
European data residency is Mistral’s unique advantage. Rate limits are not their differentiator — compliance is.

The Master Comparison Table

Here is every provider side by side at the paid tier that most startups and production applications use.

Provider	Model	Tier/Plan	RPM	TPM	Min Spend
OpenAI	GPT-5.4 / GPT-5	Tier 5	15,000	40,000,000	$1,000 + 30 days
OpenAI	GPT-5.4 / GPT-5	Tier 3	5,000	2,000,000	$100 + 7 days
Google	Gemini 2.5 Flash-Lite	Tier 1 baseline	4,000	4,000,000 input TPM	Billing enabled
Google	Gemini 2.5 Pro	Tier 1 baseline	150	2,000,000 input TPM	Billing enabled
DeepSeek	V4 Flash	Dynamic	Dynamic	Dynamic	$0
Mistral	Large 3	Paid	500	2,000,000	$0
xAI	Grok 4	Paid (high)	2,000	1,000,000	—
Anthropic	Sonnet 4.6	Tier 4	4,000	2,000,000 ITPM / 400,000 OTPM	Standard Tier 4

Ranking by TPM (tokens per minute):

OpenAI Tier 5 — 40,000,000 TPM on GPT-5.4 / GPT-5 (requires $1,000+ spend and 30+ days)
Google Gemini 2.5 Flash-Lite Tier 1 baseline — 4,000,000 input TPM, with active limits shown in AI Studio
OpenAI Tier 3 — 2,000,000 TPM on GPT-5.4 / GPT-5 (requires $100+ spend and 7+ days)
Anthropic Tier 4 — 2,000,000 ITPM and 400,000 OTPM for Sonnet 4.x / Opus 4.x
DeepSeek V4 — dynamic concurrency, very low token cost

The pattern is clear: OpenAI offers the highest published standard ceiling (Tier 5), while Google and DeepSeek require more live-account verification because Gemini limits are project-specific and DeepSeek uses dynamic scheduling. Anthropic remains output-token constrained, but its split ITPM/OTPM design and cache-aware accounting are more nuanced than a single TPM comparison suggests.

Comparison by Use Case

For Prototyping and Evaluation (Free Tier)

If you are just getting started, experimenting with models, or building a proof of concept, here is what each provider offers at zero cost:

Provider	Free RPM	Free TPM	Credit Card Required?	Notes
Gemini 2.5 Flash / Flash-Lite	Check AI Studio	Check AI Studio	No	Useful free evaluation tier, but active quota is project-specific
Grok 4	60	100,000	No	$25 free credit
DeepSeek V4	Dynamic	Dynamic	No	Very cheap, but capacity changes with server load
OpenAI GPT-5	—	—	Yes	API docs list GPT-5 as not supported on Free tier
Claude Sonnet 4.6	—	—	Yes	No free tier

Winner: Gemini or DeepSeek for low-cost evaluation, depending on whether you prefer published quota controls in AI Studio or DeepSeek’s dynamic low-cost capacity.

Worst for free-tier API throughput: OpenAI GPT-5, because the current model docs list it as not supported on Free tier.

For Startups (Medium Volume: 1K-10K Requests/Day)

At this scale, you are past prototyping and need reliable throughput for real users. The key question is which provider gives you enough headroom without requiring a large upfront spend.

Provider	Model	RPM	TPM	Monthly Min Spend
Gemini	2.5 Pro	150 baseline	2,000,000 input TPM baseline	Pay-as-you-go; verify in AI Studio
Gemini	2.5 Flash-Lite	4,000 baseline	4,000,000 input TPM baseline	Pay-as-you-go; verify in AI Studio
OpenAI	GPT-5 (Tier 2)	5,000	1,000,000	$50+ cumulative + 7 days
OpenAI	GPT-5 (Tier 3)	5,000	2,000,000	$100+ cumulative + 7 days
DeepSeek	V4 Flash	Dynamic	Dynamic	Pay-as-you-go
Anthropic	Sonnet 4.x (Tier 2)	1,000	450,000 ITPM / 90,000 OTPM	Standard Tier 2
xAI	Grok 3	Up to 1,200	Up to 600,000	Pay-as-you-go

Winner: OpenAI Tier 2-3 for the highest published RPM among these standard examples, or Gemini when your own AI Studio quota shows higher project-specific headroom.

Watch Anthropic output tokens. Tier 2 allows much more input than the old combined-TPM summary implied, but 90K output tokens per minute can still be the binding limiter for generation-heavy workloads.

For Enterprise (High Volume: 50K+ Requests/Day)

At enterprise scale, all providers offer custom rate limits through sales engagements. But here is what you get on standard plans:

Provider	Model	Best Standard RPM	Best Standard TPM
OpenAI	GPT-5 / GPT-5.4 (Tier 5)	15,000	40,000,000
Google	Gemini 2.5 Flash-Lite	4,000	4,000,000 input TPM baseline
Google	Gemini 2.5 Pro	150	2,000,000 input TPM baseline
Anthropic	Sonnet 4.x / Opus 4.x (Tier 4)	4,000	2,000,000 ITPM / 400,000 OTPM
DeepSeek	V4 Flash	Dynamic	Dynamic

Winner: OpenAI Tier 5 at 40M TPM is in a class of its own among published standard model limits. If you are at enterprise scale and need the highest standard published throughput, OpenAI is the clearest documented ceiling. Gemini may still be strong, but your exact project quota should be read from AI Studio.

Note on custom limits: At $5,000+/month spend, every major provider will negotiate custom rate limits. Contact sales teams directly for Anthropic, OpenAI, Google, and xAI if standard limits are insufficient.

How Rate Limits Affect Your Architecture

Rate limits are not just an API annoyance — they should influence your entire system design. Here are the architectural patterns that matter.

1. Queue Management and Backpressure

When your application receives more requests than your API rate limit can handle, you need a queue. The simplest approach is a token bucket algorithm that tracks your remaining RPM and TPM budget and delays requests when limits are close.

The critical mistake is not accounting for TPM limits separately from RPM limits. A system that only tracks RPM will work fine for short messages but fail spectacularly when a user submits a 50K-token document that consumes half your TPM budget in a single request.

2. Multi-Provider Failover

The most robust architecture uses multiple providers as fallbacks. When your primary provider returns a 429 (rate limit exceeded), route to a secondary:

Primary: OpenAI GPT-5 (best overall quality)
Failover 1: Gemini 2.5 Pro (same price, higher TPM)
Failover 2: DeepSeek V4 Flash (much cheaper, dynamic capacity)

This gives you effective throughput across independent provider pools instead of relying on one account. With OpenAI Tier 3 + Gemini + DeepSeek, you get two published TPM pools plus a very cheap dynamic DeepSeek overflow route.

3. Token Estimation Before Sending

Pre-counting tokens before sending a request lets you predict whether it will push you over your TPM limit. This avoids wasting an API call (and consuming RPM budget) on a request that will be rejected anyway.

Use our AI Token Counter to understand token counts for different models. For programmatic estimation, the tiktoken library (Python) or gpt-tokenizer (JavaScript) provides exact counts for OpenAI models, and approximate counts for others.

4. Separate Rate Limit Pools per Model

OpenAI and Anthropic both allocate rate limits per model, not per account. This means using GPT-5 and GPT-5 Nano simultaneously gives you two separate pools. Architect your system to spread load across models:

Route simple tasks to budget models (GPT-5 Nano, Gemini Flash, Haiku)
Route complex tasks to flagship models (GPT-5, Claude Sonnet, Gemini Pro)

Each model has its own RPM and TPM allocation, effectively multiplying your total throughput.

Tips to Maximize Throughput

1. Use Batch API for Non-Real-Time Workloads

Both OpenAI and Anthropic offer Batch APIs that process requests asynchronously (typically within 24 hours). Batch requests are exempt from standard rate limits and come with a 50% price discount. If any part of your workload — content generation, data extraction, evaluation, nightly processing — does not need real-time responses, move it to the Batch API immediately.

This is the single highest-impact optimization for throughput-constrained applications.

2. Implement Exponential Backoff with Jitter

When you hit a rate limit (HTTP 429), do not retry immediately. Use exponential backoff with random jitter to spread retry attempts:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI()

def call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-5",
                messages=messages
            )
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            base_wait = 2 ** attempt
            # Add jitter: random 0-50% extra
            jitter = base_wait * random.uniform(0, 0.5)
            wait = base_wait + jitter
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
    raise Exception("Max retries exceeded")

The jitter is important because without it, all your retry attempts (and those of other clients) happen at exactly the same time, causing another burst of 429s. Jitter spreads the retries across the backoff window.

3. Pre-Count Tokens to Avoid Wasted Requests

Every rejected request (429 error) wastes your RPM budget. By estimating token counts before sending, you can hold requests in a local queue until you have enough TPM headroom:

import tiktoken

def estimate_tokens(messages, model="gpt-5"):
    """Estimate total tokens for a request."""
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += len(enc.encode(msg["content"])) + 4  # message overhead
    total += 2  # reply priming
    return total

# Before sending, check if we have budget
estimated = estimate_tokens(messages)
if estimated > remaining_tpm_budget:
    # Queue the request instead of sending immediately
    request_queue.append(messages)
else:
    remaining_tpm_budget -= estimated
    response = client.chat.completions.create(model="gpt-5", messages=messages)

4. Route Burst Traffic to High-Limit Providers

If your application experiences traffic spikes, route the excess to the provider with the highest available limits. In practice, this means:

Normal traffic: Use your preferred provider (e.g., OpenAI or Claude)
Burst traffic: Overflow to Gemini when your AI Studio quota shows available headroom, or to DeepSeek V4 Flash when your workload can tolerate dynamic capacity and lower guarantees

This pattern keeps your primary provider’s quality for most requests while preventing 429 errors during peaks.

5. Upgrade Tiers Strategically

For OpenAI and Anthropic, your tier is based on cumulative spend, not monthly spend. This means:

If you know you will need Tier 3+ limits, front-load your spending by purchasing credits early.
OpenAI: $100 cumulative spend plus 7+ days since first successful payment unlocks Tier 3 (2M TPM for GPT-5). That is a one-time threshold, not monthly.
Anthropic: Tier 3 raises Sonnet 4.x to 800K ITPM and 160K OTPM. Check the Console for the exact current spend and workspace limits.

Plan your tier progression based on your growth projections, and purchase credits slightly ahead of when you need the higher limits.

6. Use Streaming to Improve Perceived Throughput

Streaming responses does not change your actual rate limits, but it allows you to start displaying output to users before the full response is complete. This reduces perceived latency and makes rate-limit-induced delays less noticeable. All major providers support streaming via server-sent events (SSE).

Rate Limit Error Handling — Production Pattern

Here is a more complete production-ready pattern that handles rate limits across multiple providers with automatic failover:

import time
import random
from openai import OpenAI, RateLimitError

# Initialize clients for multiple providers
openai_client = OpenAI()
gemini_client = OpenAI(
    api_key="your-gemini-key",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
deepseek_client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

PROVIDERS = [
    {"client": openai_client, "model": "gpt-5", "name": "OpenAI"},
    {"client": gemini_client, "model": "gemini-2.5-pro", "name": "Gemini"},
    {"client": deepseek_client, "model": "deepseek-v4-flash", "name": "DeepSeek"},
]

def call_with_failover(messages, max_retries=3):
    """Try each provider in order, with retries per provider."""
    for provider in PROVIDERS:
        for attempt in range(max_retries):
            try:
                response = provider["client"].chat.completions.create(
                    model=provider["model"],
                    messages=messages
                )
                return response, provider["name"]
            except RateLimitError:
                if attempt < max_retries - 1:
                    wait = (2 ** attempt) + random.uniform(0, 1)
                    time.sleep(wait)
                else:
                    print(f"{provider['name']} exhausted. Trying next provider.")
                    break
    raise Exception("All providers rate-limited. Consider queuing this request.")

This pattern ensures your application stays responsive even when individual providers are throttling you. The key is that rate limits are per-provider, so being rate-limited on OpenAI says nothing about your remaining capacity on Gemini or DeepSeek.

Provider Recommendation by Daily Volume

Daily Requests	Best Provider	Why
Under 1,000	Any provider	All handle this volume comfortably
1,000 - 5,000	OpenAI (Tier 2) or Gemini	5,000 RPM on OpenAI Tier 2; Gemini depends on AI Studio quota
5,000 - 20,000	Gemini 2.5 Flash-Lite or OpenAI Tier 3	High published Gemini baseline or 5,000 RPM on OpenAI
20,000 - 50,000	Gemini + DeepSeek failover	Higher combined capacity, but DeepSeek remains dynamically scheduled
50,000+	OpenAI Tier 5 or custom enterprise	40M TPM on GPT-5 / GPT-5.4, or negotiated limits

For token-heavy workloads (long documents, large context):

Daily Token Volume	Best Provider	Why
Under 10M tokens	Any provider	All handle this at paid tier
10M - 100M tokens	Gemini or DeepSeek	Gemini has published high TPM; DeepSeek is much cheaper but dynamically scheduled
100M - 500M tokens	OpenAI Tier 3+	2M TPM at Tier 3, scaling to 40M at Tier 5
500M+ tokens	OpenAI Tier 5 + Gemini	Use OpenAI’s 40M TPM standard ceiling plus verified Gemini quota, or contact sales for custom

The Hidden Cost of Low Rate Limits

Rate limits have a real financial impact beyond just throttled requests. When your application hits a 429, several things happen:

Wasted compute. Your server processed the user’s request, built the prompt, estimated tokens — all before discovering the API will not accept it.
User-facing latency. The retry delay (even with exponential backoff) adds seconds or minutes to response times. Users notice.
Queue depth explosion. If incoming requests exceed your API throughput, queues grow unboundedly. You need either a cap (reject requests) or a very large buffer.
Over-provisioning costs. To avoid hitting limits, many teams over-provision by buying higher tiers than their average usage requires — paying for headroom they rarely use.

This is why rate limits should be factored into your total cost analysis alongside per-token pricing. A provider that costs 20% more per token but offers 10x the throughput may actually be cheaper when you account for infrastructure complexity, queue management, and user experience impact.

Bottom Line

Rate limits in May 2026 vary dramatically across providers, and the differences are larger than most developers realize:

Google Gemini can offer strong throughput, but current Google docs direct you to AI Studio for active project limits and warn that listed limits are not guaranteed. Treat public presets as planning baselines, not contractual ceilings.
OpenAI has the highest published standard ceiling at 40M TPM (Tier 5), but you need $1,000+ in cumulative spend and 30+ days since first successful payment to unlock it. For most startups, Tier 2-3 at 1-2M TPM is more realistic and still competitive.
Anthropic (Claude) is output-token constrained compared with OpenAI’s highest standard tier, but its current docs split ITPM and OTPM and exclude cached reads from ITPM for most current models. If you choose Claude for quality, budget for output-token and acceleration limits.
DeepSeek offers a unique value proposition: extremely low V4 Flash pricing, automatic context caching, and no public paid tier ladder. Best as a high-volume low-cost route when you can tolerate dynamic scheduling and keep fallback providers.
xAI (Grok) sits in the middle with reasonable limits and free credits for getting started, though it cannot match Gemini or OpenAI on raw throughput.

For most production applications, the optimal strategy is: start with Gemini for free-tier development, add OpenAI for high-confidence tasks at Tier 2+, and overweight DeepSeek V4 Flash for low-cost repeated-context agent traffic. Use the Batch API for everything that does not need real-time responses.

Rate limits will continue to change as providers scale their infrastructure, but as of May 6, 2026, these are the official-doc baselines you should verify against your own provider dashboards before production launch.

Related tools and guides:

AI Token Counter — Pre-count tokens before sending API requests
AI Model Pricing Calculator — Compare costs across 40+ models
How to Cut AI API Costs by 80% — 8 optimization strategies including batch API and model routing
AI API Pricing Comparison 2026 — Full pricing table for all 7 major providers
OpenAI API Pricing Guide 2026 — GPT-5, GPT-5 Nano, o3 pricing and tier details
Claude API Pricing Guide 2026 — Opus, Sonnet, Haiku pricing and prompt caching
Google Gemini API Pricing Guide 2026 — Gemini 2.5 Pro/Flash, free tier, 1M context
Grok API Pricing Guide 2026 — Grok 3 pricing, $25 free credits
DeepSeek API Pricing Guide 2026 — V4 Flash cache-hit pricing and agent cost math
Mistral API Pricing Guide 2026 — EU-compliant, open-weight options