AI API Rate Limits 2026: GPT-5 (30M TPM), Gemini (10M TPM) & Claude 4.6

Rate limits are the second most important factor — after price — when choosing an AI API provider. You can find the cheapest model on the market, but if its rate limits throttle you at 50 requests per minute, your production application grinds to a halt the moment real traffic arrives. Rate limits determine how many requests you can send per minute, how many tokens you can process per minute, and ultimately whether your application can scale.

The problem is that every provider structures rate limits differently. OpenAI uses a five-tier system based on cumulative spend. Anthropic uses a four-tier system. Google offers a multi-tier structure where limits depend heavily on your billing status (Free vs. Pay-as-you-go) and usage history. xAI gives you $25 in free credits. DeepSeek keeps things flat and simple. Mistral bundles everything under a free-to-start plan.

This guide compares every major AI API provider’s rate limits side by side, breaks down which provider wins at each usage level, and shows you exactly how to architect your application to maximize throughput without hitting 429 errors.

Key Terms

Before diving into the numbers, here are the three metrics every provider uses:

RPM (Requests Per Minute) — The maximum number of API calls you can make in a 60-second window.
TPM (Tokens Per Minute) — The maximum number of tokens (input + output combined) the API will process for you in a 60-second window.
RPD (Requests Per Day) — Some providers also cap total daily requests, especially on free tiers.

In practice, TPM is the limit that matters most for production applications.

Rate Limits by Provider

OpenAI — Tier-Based System (Free through Tier 5)

OpenAI’s rate limits scale with your cumulative platform spend. Tiers upgrade automatically.

GPT-5.4 & GPT-5 Rate Limits:

Tier	Spend Required	RPM	TPM	RPD
Free	$0	3	40,000	200
Tier 1	$5+	500	200,000	—
Tier 2	$50+	5,000	2,000,000	—
Tier 3	$100+	5,000	4,000,000	—
Tier 4	$250+	10,000	10,000,000	—
Tier 5	$1,000+	10,000	30,000,000	—

o3 and o1 Rate Limits (Tier 3):

Model	RPM	TPM
o3	5,000	4,000,000
o3-mini	5,000	4,000,000
o1	1,000	1,000,000

Anthropic (Claude) — Four-Tier System

Anthropic uses a spend-based tier system. The limits are notably lower than OpenAI at equivalent spend levels.

Claude Opus 4.6 & Sonnet 4.6 Rate Limits:

Tier	Spend Required	RPM	TPM
Tier 1	$5+ credit	50	40,000
Tier 2	$40+	1,000	80,000
Tier 3	$200+	2,000	160,000
Tier 4	$400+	4,000	400,000

Google Gemini — Tier-Dependent Throughput

Google structures its rate limits based on “Usage Tiers.” Your actual limits depend on whether you are using the Free of charge tier or the Pay-as-you-go tier.

Gemini 3.1 Pro & 3 Flash Rate Limits (Standard Tiers):

Plan / Tier	RPM (Pro)	TPM (Pro)	RPM (Flash)	TPM (Flash)
Free Tier	2	32,000	15	1,000,000
Paid Tier 1	50	100,000	2,000	1,000,000
Paid Tier 2	1,000	500,000	4,000	4,000,000
Paid Tier 3+	2,000+	4,000,000+	10,000+	10,000,000+

Gemini 2.5 Series (Legacy Stable):

Plan	RPM	TPM	RPD
Free (Flash)	500	1,000,000	—
Paid (Pro)	2,000	4,000,000	—

xAI (Grok) — Free Credits + Scaling Tiers

xAI provides $25 in free signup credits and structures rate limits that scale with usage.

Grok 3 & 4 Rate Limits:

Model	Free Tier RPM	Free Tier TPM	Paid RPM	Paid TPM
Grok 4	60	100,000	Up to 2,000	Up to 1,000,000
Grok 3 Mini	100	200,000	Up to 4,000	Up to 2,000,000

DeepSeek — Flat Rate, No Tiers

DeepSeek takes the simplest approach of any major provider: flat rate limits with no tier system.

DeepSeek V4 Rate Limits:

Model	RPM	TPM
DeepSeek V4 (deepseek-chat)	100	2,000,000
DeepSeek R1 (deepseek-reasoner)	60	1,000,000

DeepSeek Rate Limits:

Model	RPM	TPM
DeepSeek V3.2 (deepseek-chat)	60	1,000,000
DeepSeek R1 (deepseek-reasoner)	60	1,000,000

Key observations about DeepSeek:

Extremely generous TPM for the price. 1 million tokens per minute at $0.27/$1.10 per million tokens means you get massive throughput at rock-bottom pricing. No other provider offers this combination.
RPM is the bottleneck. 60 RPM is low for high-concurrency applications. If you are making many short requests, you will hit the RPM limit long before TPM. This makes DeepSeek better suited for long-context, batch-style workloads than for chatbots serving many concurrent users.
No tiers, no spend thresholds. You get the same limits from day one. No need to build up spend to unlock higher tiers. Simple and predictable.

Mistral — Free Tier Available

Mistral offers a free tier for experimentation and paid plans with competitive limits.

Mistral Rate Limits:

Model	Free Tier RPM	Paid RPM
Mistral Large 3	Lower (varies)	300
Mistral Medium 3	Lower (varies)	300
Mistral Small 3.1	Lower (varies)	300

Key observations about Mistral:

Free tier is available without a credit card, similar to Google. Useful for evaluation and prototyping.
300 RPM across all paid models is moderate — higher than DeepSeek (60) and Anthropic Tier 1 (50), but well below OpenAI Tier 2 (5,000) and Gemini (2,000-4,000).
European data residency is Mistral’s unique advantage. Rate limits are not their differentiator — compliance is.

The Master Comparison Table

Here is every provider side by side at the paid tier that most startups and production applications use.

Provider	Model	Tier/Plan	RPM	TPM	Min Spend
OpenAI	GPT-5.4	Tier 5	10,000	30,000,000	$1,000
Google	Gemini 3 Flash	Paid Tier 3	10,000	10,000,000	~$100
OpenAI	GPT-5	Tier 3	5,000	4,000,000	$100
Google	Gemini 3.1 Pro	Paid Tier 3	2,000	4,000,000	~$100
DeepSeek	V4	Flat	100	2,000,000	$0
Mistral	Large 3	Paid	500	2,000,000	$0
xAI	Grok 4	Paid (high)	2,000	1,000,000	—
Anthropic	Sonnet 4.6	Tier 4	4,000	400,000	$400

Ranking by TPM (tokens per minute):

OpenAI Tier 5 — 30,000,000 TPM (requires $1,000+ spend)
Google Gemini 3 Flash — 10,000,000 TPM (requires ~$100 spend)
OpenAI Tier 3 — 4,000,000 TPM (requires $100+ spend)
Google Gemini 3.1 Pro — 4,000,000 TPM (requires ~$100 spend)
DeepSeek V4 — 2,000,000 TPM (flat, no tiers)
Anthropic Tier 4 — 400,000 TPM (requires $400+ spend)

The pattern is clear: OpenAI offers the absolute highest possible ceiling (Tier 5), while Google and DeepSeek provide the best scaling at lower spend thresholds. Anthropic remains the most throughput-constrained flagship provider.

Comparison by Use Case

For Prototyping and Evaluation (Free Tier)

If you are just getting started, experimenting with models, or building a proof of concept, here is what each provider offers at zero cost:

Provider	Free RPM	Free TPM	Credit Card Required?	Notes
Gemini 3 Flash	15	1,000,000	No	Best free tier for volume
Grok 4	60	100,000	No	$25 free credit
DeepSeek V4	100	2,000,000	No	Full limits from start
OpenAI GPT-5	3	40,000	No	Very limited
Claude Sonnet 4.6	—	—	Yes	No free tier

Winner: DeepSeek V4 for immediate high-volume access, and Gemini 3 Flash for the best long-term free throughput.

Worst: Anthropic has no free tier at all, and OpenAI’s 3 RPM free tier is borderline unusable for anything beyond single manual tests.

For Startups (Medium Volume: 1K-10K Requests/Day)

At this scale, you are past prototyping and need reliable throughput for real users. The key question is which provider gives you enough headroom without requiring a large upfront spend.

Provider	Model	RPM	TPM	Monthly Min Spend
Gemini	2.5 Pro	2,000	4,000,000	Pay-as-you-go
Gemini	2.5 Flash	4,000	4,000,000	Pay-as-you-go
OpenAI	GPT-5 (Tier 2)	5,000	2,000,000	$50+ cumulative
OpenAI	GPT-5 (Tier 3)	5,000	4,000,000	$100+ cumulative
DeepSeek	V3.2	60	1,000,000	Pay-as-you-go
Anthropic	Sonnet 4.5 (Tier 2)	1,000	80,000	$40+ cumulative
xAI	Grok 3	Up to 1,200	Up to 600,000	Pay-as-you-go

Winner: Gemini for raw throughput (4M TPM immediately), or OpenAI Tier 2-3 for the highest RPM (5,000) if your workload consists of many short requests.

Worst: Anthropic Tier 2 at 80K TPM. If your startup processes documents averaging 10K tokens each, you can only handle about 8 documents per minute. That is not enough for most production use cases, and you need to spend $200+ just to reach Tier 3’s 160K TPM.

For Enterprise (High Volume: 50K+ Requests/Day)

At enterprise scale, all providers offer custom rate limits through sales engagements. But here is what you get on standard plans:

Provider	Model	Best Standard RPM	Best Standard TPM
OpenAI	GPT-5 (Tier 5)	10,000	30,000,000
Google	Gemini 2.5 Pro	2,000	4,000,000
Google	Gemini 2.5 Flash	4,000	4,000,000
Anthropic	Sonnet 4.5 (Tier 4)	4,000	400,000
DeepSeek	V3.2	60	1,000,000

Winner: OpenAI Tier 5 at 30M TPM is in a class of its own. If you are at enterprise scale and need the absolute highest throughput on standard plans, OpenAI is the clear choice. Gemini’s 4M TPM is the next best, available immediately without tier requirements.

Note on custom limits: At $5,000+/month spend, every major provider will negotiate custom rate limits. Contact sales teams directly for Anthropic, OpenAI, Google, and xAI if standard limits are insufficient.

How Rate Limits Affect Your Architecture

Rate limits are not just an API annoyance — they should influence your entire system design. Here are the architectural patterns that matter.

1. Queue Management and Backpressure

When your application receives more requests than your API rate limit can handle, you need a queue. The simplest approach is a token bucket algorithm that tracks your remaining RPM and TPM budget and delays requests when limits are close.

The critical mistake is not accounting for TPM limits separately from RPM limits. A system that only tracks RPM will work fine for short messages but fail spectacularly when a user submits a 50K-token document that consumes half your TPM budget in a single request.

2. Multi-Provider Failover

The most robust architecture uses multiple providers as fallbacks. When your primary provider returns a 429 (rate limit exceeded), route to a secondary:

Primary: OpenAI GPT-5 (best overall quality)
Failover 1: Gemini 2.5 Pro (same price, higher TPM)
Failover 2: DeepSeek V3.2 (much cheaper, 1M TPM)

This gives you effective throughput that is the sum of all providers’ limits, not just one. With OpenAI Tier 3 + Gemini + DeepSeek, you get 4M + 4M + 1M = 9M TPM combined.

3. Token Estimation Before Sending

Pre-counting tokens before sending a request lets you predict whether it will push you over your TPM limit. This avoids wasting an API call (and consuming RPM budget) on a request that will be rejected anyway.

Use our AI Token Counter to understand token counts for different models. For programmatic estimation, the tiktoken library (Python) or gpt-tokenizer (JavaScript) provides exact counts for OpenAI models, and approximate counts for others.

4. Separate Rate Limit Pools per Model

OpenAI and Anthropic both allocate rate limits per model, not per account. This means using GPT-5 and GPT-4.1 Nano simultaneously gives you two separate pools. Architect your system to spread load across models:

Route simple tasks to budget models (GPT-4.1 Nano, Gemini Flash, Haiku)
Route complex tasks to flagship models (GPT-5, Claude Sonnet, Gemini Pro)

Each model has its own RPM and TPM allocation, effectively multiplying your total throughput.

Tips to Maximize Throughput

1. Use Batch API for Non-Real-Time Workloads

Both OpenAI and Anthropic offer Batch APIs that process requests asynchronously (typically within 24 hours). Batch requests are exempt from standard rate limits and come with a 50% price discount. If any part of your workload — content generation, data extraction, evaluation, nightly processing — does not need real-time responses, move it to the Batch API immediately.

This is the single highest-impact optimization for throughput-constrained applications.

2. Implement Exponential Backoff with Jitter

When you hit a rate limit (HTTP 429), do not retry immediately. Use exponential backoff with random jitter to spread retry attempts:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI()

def call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-5",
                messages=messages
            )
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            base_wait = 2 ** attempt
            # Add jitter: random 0-50% extra
            jitter = base_wait * random.uniform(0, 0.5)
            wait = base_wait + jitter
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
    raise Exception("Max retries exceeded")

The jitter is important because without it, all your retry attempts (and those of other clients) happen at exactly the same time, causing another burst of 429s. Jitter spreads the retries across the backoff window.

3. Pre-Count Tokens to Avoid Wasted Requests

Every rejected request (429 error) wastes your RPM budget. By estimating token counts before sending, you can hold requests in a local queue until you have enough TPM headroom:

import tiktoken

def estimate_tokens(messages, model="gpt-5"):
    """Estimate total tokens for a request."""
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += len(enc.encode(msg["content"])) + 4  # message overhead
    total += 2  # reply priming
    return total

# Before sending, check if we have budget
estimated = estimate_tokens(messages)
if estimated > remaining_tpm_budget:
    # Queue the request instead of sending immediately
    request_queue.append(messages)
else:
    remaining_tpm_budget -= estimated
    response = client.chat.completions.create(model="gpt-5", messages=messages)

4. Route Burst Traffic to High-Limit Providers

If your application experiences traffic spikes, route the excess to the provider with the highest available limits. In practice, this means:

Normal traffic: Use your preferred provider (e.g., OpenAI or Claude)
Burst traffic: Overflow to Gemini (4M TPM, no tier requirements) or DeepSeek (1M TPM, flat limits)

This pattern keeps your primary provider’s quality for most requests while preventing 429 errors during peaks.

5. Upgrade Tiers Strategically

For OpenAI and Anthropic, your tier is based on cumulative spend, not monthly spend. This means:

If you know you will need Tier 3+ limits, front-load your spending by purchasing credits early.
OpenAI: $100 cumulative spend unlocks Tier 3 (4M TPM for GPT-5). That is a one-time threshold, not monthly.
Anthropic: $200 cumulative spend unlocks Tier 3 (160K TPM for Sonnet). Again, one-time.

Plan your tier progression based on your growth projections, and purchase credits slightly ahead of when you need the higher limits.

6. Use Streaming to Improve Perceived Throughput

Streaming responses does not change your actual rate limits, but it allows you to start displaying output to users before the full response is complete. This reduces perceived latency and makes rate-limit-induced delays less noticeable. All major providers support streaming via server-sent events (SSE).

Rate Limit Error Handling — Production Pattern

Here is a more complete production-ready pattern that handles rate limits across multiple providers with automatic failover:

import time
import random
from openai import OpenAI, RateLimitError

# Initialize clients for multiple providers
openai_client = OpenAI()
gemini_client = OpenAI(
    api_key="your-gemini-key",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
deepseek_client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

PROVIDERS = [
    {"client": openai_client, "model": "gpt-5", "name": "OpenAI"},
    {"client": gemini_client, "model": "gemini-2.5-pro", "name": "Gemini"},
    {"client": deepseek_client, "model": "deepseek-chat", "name": "DeepSeek"},
]

def call_with_failover(messages, max_retries=3):
    """Try each provider in order, with retries per provider."""
    for provider in PROVIDERS:
        for attempt in range(max_retries):
            try:
                response = provider["client"].chat.completions.create(
                    model=provider["model"],
                    messages=messages
                )
                return response, provider["name"]
            except RateLimitError:
                if attempt < max_retries - 1:
                    wait = (2 ** attempt) + random.uniform(0, 1)
                    time.sleep(wait)
                else:
                    print(f"{provider['name']} exhausted. Trying next provider.")
                    break
    raise Exception("All providers rate-limited. Consider queuing this request.")

This pattern ensures your application stays responsive even when individual providers are throttling you. The key is that rate limits are per-provider, so being rate-limited on OpenAI says nothing about your remaining capacity on Gemini or DeepSeek.

Provider Recommendation by Daily Volume

Daily Requests	Best Provider	Why
Under 1,000	Any provider	All handle this volume comfortably
1,000 - 5,000	OpenAI (Tier 2) or Gemini	5,000 RPM (OpenAI) or 4,000 RPM (Gemini)
5,000 - 20,000	Gemini 2.5 Flash	4,000 RPM + 4M TPM with no tier requirements
20,000 - 50,000	Gemini + DeepSeek failover	Combined 5M+ TPM, lowest effective cost
50,000+	OpenAI Tier 5 or custom enterprise	30M TPM (OpenAI) or negotiated limits

For token-heavy workloads (long documents, large context):

Daily Token Volume	Best Provider	Why
Under 10M tokens	Any provider	All handle this at paid tier
10M - 100M tokens	Gemini or DeepSeek	4M and 1M TPM respectively, no tiers needed
100M - 500M tokens	OpenAI Tier 3+	4M TPM at Tier 3, scaling to 30M at Tier 5
500M+ tokens	OpenAI Tier 5 + Gemini	Combined 34M TPM, or contact sales for custom

The Hidden Cost of Low Rate Limits

Rate limits have a real financial impact beyond just throttled requests. When your application hits a 429, several things happen:

Wasted compute. Your server processed the user’s request, built the prompt, estimated tokens — all before discovering the API will not accept it.
User-facing latency. The retry delay (even with exponential backoff) adds seconds or minutes to response times. Users notice.
Queue depth explosion. If incoming requests exceed your API throughput, queues grow unboundedly. You need either a cap (reject requests) or a very large buffer.
Over-provisioning costs. To avoid hitting limits, many teams over-provision by buying higher tiers than their average usage requires — paying for headroom they rarely use.

This is why rate limits should be factored into your total cost analysis alongside per-token pricing. A provider that costs 20% more per token but offers 10x the throughput may actually be cheaper when you account for infrastructure complexity, queue management, and user experience impact.

Bottom Line

Rate limits in February 2026 vary dramatically across providers, and the differences are larger than most developers realize:

Google Gemini offers the highest immediately-available throughput at 4M TPM with no tier system and no minimum spend. Combined with a genuinely useful free tier (500 RPM for Flash), Gemini is the best choice for applications that need reliable, high throughput from day one.
OpenAI has the highest theoretical ceiling at 30M TPM (Tier 5), but you need $1,000+ in cumulative spend to unlock it. For most startups, Tier 2-3 at 2-4M TPM is more realistic and still competitive.
Anthropic (Claude) has the most restrictive rate limits of any major provider. Even at Tier 4 ($400+ spend), Sonnet 4.5 maxes out at 400K TPM — 10x lower than Gemini’s standard paid tier. If you choose Claude for its quality, budget for the throughput constraints.
DeepSeek offers a unique value proposition: 1M TPM with flat pricing and no tiers, but only 60 RPM. Best for batch-style workloads with long contexts rather than high-concurrency chatbots.
xAI (Grok) sits in the middle with reasonable limits and free credits for getting started, though it cannot match Gemini or OpenAI on raw throughput.

For most production applications, the optimal strategy is: start with Gemini for free-tier development, add OpenAI as your primary provider at Tier 2+, and keep DeepSeek or Gemini Flash as a high-TPM failover for burst traffic. Use the Batch API for everything that does not need real-time responses.

Rate limits will continue to increase as providers scale their infrastructure, but in February 2026, these are the numbers you need to plan around.

Related tools and guides:

AI Token Counter — Pre-count tokens before sending API requests
AI Model Pricing Calculator — Compare costs across 40+ models
How to Cut AI API Costs by 80% — 8 optimization strategies including batch API and model routing
AI API Pricing Comparison 2026 — Full pricing table for all 7 major providers
OpenAI API Pricing Guide 2026 — GPT-5, GPT-4.1, o3 pricing and tier details
Claude API Pricing Guide 2026 — Opus, Sonnet, Haiku pricing and prompt caching
Google Gemini API Pricing Guide 2026 — Gemini 2.5 Pro/Flash, free tier, 1M context
Grok API Pricing Guide 2026 — Grok 3 pricing, $25 free credits
DeepSeek API Pricing Guide 2026 — The cheapest capable AI model
Mistral API Pricing Guide 2026 — EU-compliant, open-weight options