DevTk.AI
AI API Rate LimitsAPI ThroughputOpenAI LimitsClaude LimitsGemini Limits

AI API Rate Limits 2026: OpenAI, Anthropic, Gemini RPM, TPM & 429 Fixes

Current AI API rate limits for OpenAI, Anthropic Claude, Gemini, DeepSeek, xAI, and Mistral. Compare RPM, TPM, usage tiers, free limits, and how to avoid 429 errors.

DevTk.AI 2026-02-24 Updated 2026-05-24 22 min read

AI API rate limits decide whether your app can scale. The price can look perfect, but if your provider throttles you at a low RPM or TPM ceiling, production traffic quickly turns into 429 errors, retries, and queue delays.

This guide compares current 2026 rate-limit behavior across OpenAI, Anthropic Claude, Google Gemini, DeepSeek, xAI Grok, and Mistral. It focuses on the numbers developers search for most: requests per minute, tokens per minute, usage tiers, free limits, and practical ways to avoid 429 errors.

Quick Answer: Which API Has The Best Rate Limits?

ProviderPublic limit modelPractical takeaway
OpenAIUsage tiers by spend and account ageStrongest published high-tier throughput for production teams
Anthropic ClaudeRPM plus separate input/output TPMGreat for Claude workloads, but Tier 1 is tight and ITPM/OTPM must be planned separately
Google GeminiAI Studio / quota-tier dependentOften generous, but live project quota should be treated as the source of truth
DeepSeekDynamic concurrency, no fixed public RPM/TPM tableVery cheap, but production apps need queues, timeouts, and fallback routing
xAI GrokFree credits plus scaling paid limitsUseful for experimentation and X-related workflows
MistralModerate published paid RPMNot the highest throughput, but useful for EU/compliance-sensitive workloads

If you are capacity planning, use the API Throughput Planner alongside this guide. If you are optimizing cost at the same time, use the AI Model Pricing Calculator.

Key Terms

Before diving into the numbers, here are the three metrics every provider uses:

  • RPM (Requests Per Minute) — The maximum number of API calls you can make in a 60-second window.
  • TPM (Tokens Per Minute) — The maximum number of tokens (input + output combined) the API will process for you in a 60-second window.
  • RPD (Requests Per Day) — Some providers also cap total daily requests, especially on free tiers.

In practice, TPM is the limit that matters most for production applications.

Rate Limits by Provider

OpenAI — Tier-Based System (Free through Tier 5)

OpenAI’s rate limits scale with your cumulative platform spend and account age. Tiers upgrade automatically, but exact limits vary by model.

GPT-5.4 and GPT-5 Rate Limits:

ModelTierQualificationRPMTPM
GPT-5.4FreeNot supported
GPT-5.4Tier 1$5 paid500500,000
GPT-5.4Tier 2$50 paid + 7 days5,0001,000,000
GPT-5.4Tier 3$100 paid + 7 days5,0002,000,000
GPT-5.4Tier 4$250 paid + 14 days10,0004,000,000
GPT-5.4Tier 5$1,000 paid + 30 days15,00040,000,000
GPT-5FreeNot supported
GPT-5Tier 1$5 paid500500,000
GPT-5Tier 2$50 paid + 7 days5,0001,000,000
GPT-5Tier 3$100 paid + 7 days5,0002,000,000
GPT-5Tier 4$250 paid + 14 days10,0004,000,000
GPT-5Tier 5$1,000 paid + 30 days15,00040,000,000

Selected reasoning model examples (Tier 3):

ModelRPMTPM
o35,000800,000
o3-mini5,0004,000,000

Anthropic (Claude) — Four-Tier System

Anthropic uses a spend-based tier system. Its Messages API limits are measured separately as RPM, input tokens per minute (ITPM), and output tokens per minute (OTPM), so a single combined TPM number can be misleading.

Claude Opus 4.6 & Sonnet 4.6 Rate Limits:

TierRPMITPMOTPM
Tier 15030,0008,000
Tier 21,000450,00090,000
Tier 32,000800,000160,000
Tier 44,0002,000,000400,000

Anthropic publishes Opus 4.x and Sonnet 4.x as shared family pools rather than separate limits for each model version. Cached reads generally do not count against ITPM for current Claude models, which can make effective throughput higher for cache-heavy workloads.

Google Gemini — Tier-Dependent Throughput

Google structures its rate limits based on “Usage Tiers.” Your actual limits depend on whether you are using the Free of charge tier or the Pay-as-you-go tier.

Google’s public Gemini rate-limit page no longer exposes a complete stable RPM/TPM table in the docs page itself. It says active limits depend on quota tier and should be checked in AI Studio, and that listed limits are not guaranteed. The table below keeps only conservative, planner-facing 2.5-series baselines where the site already uses published presets.

Gemini 2.5 Series (published baseline presets):

PlanRPMTPMRPD
Tier 1 (2.5 Pro)1502,000,000 input TPM10,000
Tier 1 (2.5 Flash)1,0001,000,000 input TPM10,000
Tier 1 (2.5 Flash-Lite)4,0004,000,000 input TPMCheck AI Studio

xAI (Grok) — Free Credits + Scaling Tiers

xAI provides $25 in free signup credits and structures rate limits that scale with usage.

Grok 3 & 4 Rate Limits:

ModelFree Tier RPMFree Tier TPMPaid RPMPaid TPM
Grok 460100,000Up to 2,000Up to 1,000,000
Grok 3 Mini100200,000Up to 4,000Up to 2,000,000

DeepSeek — Dynamic RPM/TPM, Published Concurrency Caps

DeepSeek does not publish fixed RPM/TPM tables for V4. Its official docs say concurrency can be affected by server load and short-term usage history, and the pricing page publishes current concurrency caps of 2,500 for V4 Flash and 500 for V4 Pro. When the platform is busy, requests may wait on an open HTTP connection, return keep-alive lines, or receive a 429 if the dynamic limit is reached.

DeepSeek V4 behavior:

ModelPublic RPM/TPMPublished concurrencyPractical note
DeepSeek V4 Flash (deepseek-v4-flash)Dynamic2,500Best default for low-cost agent traffic
DeepSeek V4 Pro (deepseek-v4-pro)Dynamic500Stronger model at official 1/4-of-original pricing after May 31, 2026
deepseek-chat / deepseek-reasonerCompatibility aliases-Scheduled for deprecation on July 24, 2026

Key observations about DeepSeek:

  • Extremely low effective cost. V4 Flash is $0.14/M cache-miss input, $0.0028/M cached input, and $0.28/M output, so cache-heavy agent workloads can cost far less than fixed-price tables suggest.
  • Concurrency is the real bottleneck. DeepSeek can still return 429 or delay scheduling under high load, so production systems should keep timeouts, queues, and fallback providers.
  • No paid tier ladder. There is no public “spend $X to unlock Y TPM” path, so plan around dynamic capacity rather than fixed guarantees.

Mistral — Free Tier Available

Mistral offers a free tier for experimentation and paid plans with competitive limits.

Mistral Rate Limits:

ModelFree Tier RPMPaid RPM
Mistral Large 3Lower (varies)300
Mistral Medium 3Lower (varies)300
Mistral Small 3.1Lower (varies)300

Key observations about Mistral:

  • Free tier is available without a credit card, similar to Google. Useful for evaluation and prototyping.
  • 300 RPM across all paid models is moderate — higher than Anthropic Tier 1 (50), but well below OpenAI Tier 2 (5,000) and Gemini (2,000-4,000). DeepSeek should be evaluated separately because its public docs describe dynamic concurrency rather than a fixed RPM table.
  • European data residency is Mistral’s unique advantage. Rate limits are not their differentiator — compliance is.

The Master Comparison Table

Here is every provider side by side at the paid tier that most startups and production applications use.

ProviderModelTier/PlanRPMTPMMin Spend
OpenAIGPT-5.4 / GPT-5Tier 515,00040,000,000$1,000 + 30 days
OpenAIGPT-5.4 / GPT-5Tier 35,0002,000,000$100 + 7 days
GoogleGemini 2.5 Flash-LiteTier 1 baseline4,0004,000,000 input TPMBilling enabled
GoogleGemini 2.5 ProTier 1 baseline1502,000,000 input TPMBilling enabled
DeepSeekV4 FlashDynamicDynamicDynamic$0
MistralLarge 3Paid5002,000,000$0
xAIGrok 4Paid (high)2,0001,000,000
AnthropicSonnet 4.6Tier 44,0002,000,000 ITPM / 400,000 OTPMStandard Tier 4

Ranking by TPM (tokens per minute):

  1. OpenAI Tier 5 — 40,000,000 TPM on GPT-5.4 / GPT-5 (requires $1,000+ spend and 30+ days)
  2. Google Gemini 2.5 Flash-Lite Tier 1 baseline — 4,000,000 input TPM, with active limits shown in AI Studio
  3. OpenAI Tier 3 — 2,000,000 TPM on GPT-5.4 / GPT-5 (requires $100+ spend and 7+ days)
  4. Anthropic Tier 4 — 2,000,000 ITPM and 400,000 OTPM for Sonnet 4.x / Opus 4.x
  5. DeepSeek V4 — dynamic concurrency, very low token cost

The pattern is clear: OpenAI offers the highest published standard ceiling (Tier 5), while Google and DeepSeek require more live-account verification because Gemini limits are project-specific and DeepSeek uses dynamic scheduling. Anthropic remains output-token constrained, but its split ITPM/OTPM design and cache-aware accounting are more nuanced than a single TPM comparison suggests.

Comparison by Use Case

For Prototyping and Evaluation (Free Tier)

If you are just getting started, experimenting with models, or building a proof of concept, here is what each provider offers at zero cost:

ProviderFree RPMFree TPMCredit Card Required?Notes
Gemini 2.5 Flash / Flash-LiteCheck AI StudioCheck AI StudioNoUseful free evaluation tier, but active quota is project-specific
Grok 460100,000No$25 free credit
DeepSeek V4DynamicDynamicNoVery cheap, but capacity changes with server load
OpenAI GPT-5YesAPI docs list GPT-5 as not supported on Free tier
Claude Sonnet 4.6YesNo free tier

Winner: Gemini or DeepSeek for low-cost evaluation, depending on whether you prefer published quota controls in AI Studio or DeepSeek’s dynamic low-cost capacity.

Worst for free-tier API throughput: OpenAI GPT-5, because the current model docs list it as not supported on Free tier.

For Startups (Medium Volume: 1K-10K Requests/Day)

At this scale, you are past prototyping and need reliable throughput for real users. The key question is which provider gives you enough headroom without requiring a large upfront spend.

ProviderModelRPMTPMMonthly Min Spend
Gemini2.5 Pro150 baseline2,000,000 input TPM baselinePay-as-you-go; verify in AI Studio
Gemini2.5 Flash-Lite4,000 baseline4,000,000 input TPM baselinePay-as-you-go; verify in AI Studio
OpenAIGPT-5 (Tier 2)5,0001,000,000$50+ cumulative + 7 days
OpenAIGPT-5 (Tier 3)5,0002,000,000$100+ cumulative + 7 days
DeepSeekV4 FlashDynamicDynamicPay-as-you-go
AnthropicSonnet 4.x (Tier 2)1,000450,000 ITPM / 90,000 OTPMStandard Tier 2
xAIGrok 3Up to 1,200Up to 600,000Pay-as-you-go

Winner: OpenAI Tier 2-3 for the highest published RPM among these standard examples, or Gemini when your own AI Studio quota shows higher project-specific headroom.

Watch Anthropic output tokens. Tier 2 allows much more input than the old combined-TPM summary implied, but 90K output tokens per minute can still be the binding limiter for generation-heavy workloads.

For Enterprise (High Volume: 50K+ Requests/Day)

At enterprise scale, all providers offer custom rate limits through sales engagements. But here is what you get on standard plans:

ProviderModelBest Standard RPMBest Standard TPM
OpenAIGPT-5 / GPT-5.4 (Tier 5)15,00040,000,000
GoogleGemini 2.5 Flash-Lite4,0004,000,000 input TPM baseline
GoogleGemini 2.5 Pro1502,000,000 input TPM baseline
AnthropicSonnet 4.x / Opus 4.x (Tier 4)4,0002,000,000 ITPM / 400,000 OTPM
DeepSeekV4 FlashDynamicDynamic

Winner: OpenAI Tier 5 at 40M TPM is in a class of its own among published standard model limits. If you are at enterprise scale and need the highest standard published throughput, OpenAI is the clearest documented ceiling. Gemini may still be strong, but your exact project quota should be read from AI Studio.

Note on custom limits: At $5,000+/month spend, every major provider will negotiate custom rate limits. Contact sales teams directly for Anthropic, OpenAI, Google, and xAI if standard limits are insufficient.

How Rate Limits Affect Your Architecture

Rate limits are not just an API annoyance — they should influence your entire system design. Here are the architectural patterns that matter.

1. Queue Management and Backpressure

When your application receives more requests than your API rate limit can handle, you need a queue. The simplest approach is a token bucket algorithm that tracks your remaining RPM and TPM budget and delays requests when limits are close.

The critical mistake is not accounting for TPM limits separately from RPM limits. A system that only tracks RPM will work fine for short messages but fail spectacularly when a user submits a 50K-token document that consumes half your TPM budget in a single request.

2. Multi-Provider Failover

The most robust architecture uses multiple providers as fallbacks. When your primary provider returns a 429 (rate limit exceeded), route to a secondary:

  • Primary: OpenAI GPT-5 (best overall quality)
  • Failover 1: Gemini 2.5 Pro (same price, higher TPM)
  • Failover 2: DeepSeek V4 Flash (much cheaper, dynamic capacity)

This gives you effective throughput across independent provider pools instead of relying on one account. With OpenAI Tier 3 + Gemini + DeepSeek, you get two published TPM pools plus a very cheap dynamic DeepSeek overflow route.

3. Token Estimation Before Sending

Pre-counting tokens before sending a request lets you predict whether it will push you over your TPM limit. This avoids wasting an API call (and consuming RPM budget) on a request that will be rejected anyway.

Use our AI Token Counter to understand token counts for different models. For programmatic estimation, the tiktoken library (Python) or gpt-tokenizer (JavaScript) provides exact counts for OpenAI models, and approximate counts for others.

4. Separate Rate Limit Pools per Model

OpenAI and Anthropic both allocate rate limits per model, not per account. This means using GPT-5 and GPT-5 Nano simultaneously gives you two separate pools. Architect your system to spread load across models:

  • Route simple tasks to budget models (GPT-5 Nano, Gemini Flash, Haiku)
  • Route complex tasks to flagship models (GPT-5, Claude Sonnet, Gemini Pro)

Each model has its own RPM and TPM allocation, effectively multiplying your total throughput.

Tips to Maximize Throughput

1. Use Batch API for Non-Real-Time Workloads

Both OpenAI and Anthropic offer Batch APIs that process requests asynchronously (typically within 24 hours). Batch requests are exempt from standard rate limits and come with a 50% price discount. If any part of your workload — content generation, data extraction, evaluation, nightly processing — does not need real-time responses, move it to the Batch API immediately.

This is the single highest-impact optimization for throughput-constrained applications.

2. Implement Exponential Backoff with Jitter

When you hit a rate limit (HTTP 429), do not retry immediately. Use exponential backoff with random jitter to spread retry attempts:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI()

def call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-5",
                messages=messages
            )
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            base_wait = 2 ** attempt
            # Add jitter: random 0-50% extra
            jitter = base_wait * random.uniform(0, 0.5)
            wait = base_wait + jitter
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
    raise Exception("Max retries exceeded")

The jitter is important because without it, all your retry attempts (and those of other clients) happen at exactly the same time, causing another burst of 429s. Jitter spreads the retries across the backoff window.

3. Pre-Count Tokens to Avoid Wasted Requests

Every rejected request (429 error) wastes your RPM budget. By estimating token counts before sending, you can hold requests in a local queue until you have enough TPM headroom:

import tiktoken

def estimate_tokens(messages, model="gpt-5"):
    """Estimate total tokens for a request."""
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += len(enc.encode(msg["content"])) + 4  # message overhead
    total += 2  # reply priming
    return total

# Before sending, check if we have budget
estimated = estimate_tokens(messages)
if estimated > remaining_tpm_budget:
    # Queue the request instead of sending immediately
    request_queue.append(messages)
else:
    remaining_tpm_budget -= estimated
    response = client.chat.completions.create(model="gpt-5", messages=messages)

4. Route Burst Traffic to High-Limit Providers

If your application experiences traffic spikes, route the excess to the provider with the highest available limits. In practice, this means:

  • Normal traffic: Use your preferred provider (e.g., OpenAI or Claude)
  • Burst traffic: Overflow to Gemini when your AI Studio quota shows available headroom, or to DeepSeek V4 Flash when your workload can tolerate dynamic capacity and lower guarantees

This pattern keeps your primary provider’s quality for most requests while preventing 429 errors during peaks.

5. Upgrade Tiers Strategically

For OpenAI and Anthropic, your tier is based on cumulative spend, not monthly spend. This means:

  • If you know you will need Tier 3+ limits, front-load your spending by purchasing credits early.
  • OpenAI: $100 cumulative spend plus 7+ days since first successful payment unlocks Tier 3 (2M TPM for GPT-5). That is a one-time threshold, not monthly.
  • Anthropic: Tier 3 raises Sonnet 4.x to 800K ITPM and 160K OTPM. Check the Console for the exact current spend and workspace limits.

Plan your tier progression based on your growth projections, and purchase credits slightly ahead of when you need the higher limits.

6. Use Streaming to Improve Perceived Throughput

Streaming responses does not change your actual rate limits, but it allows you to start displaying output to users before the full response is complete. This reduces perceived latency and makes rate-limit-induced delays less noticeable. All major providers support streaming via server-sent events (SSE).

Rate Limit Error Handling — Production Pattern

Here is a more complete production-ready pattern that handles rate limits across multiple providers with automatic failover:

import time
import random
from openai import OpenAI, RateLimitError

# Initialize clients for multiple providers
openai_client = OpenAI()
gemini_client = OpenAI(
    api_key="your-gemini-key",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
deepseek_client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

PROVIDERS = [
    {"client": openai_client, "model": "gpt-5", "name": "OpenAI"},
    {"client": gemini_client, "model": "gemini-2.5-pro", "name": "Gemini"},
    {"client": deepseek_client, "model": "deepseek-v4-flash", "name": "DeepSeek"},
]

def call_with_failover(messages, max_retries=3):
    """Try each provider in order, with retries per provider."""
    for provider in PROVIDERS:
        for attempt in range(max_retries):
            try:
                response = provider["client"].chat.completions.create(
                    model=provider["model"],
                    messages=messages
                )
                return response, provider["name"]
            except RateLimitError:
                if attempt < max_retries - 1:
                    wait = (2 ** attempt) + random.uniform(0, 1)
                    time.sleep(wait)
                else:
                    print(f"{provider['name']} exhausted. Trying next provider.")
                    break
    raise Exception("All providers rate-limited. Consider queuing this request.")

This pattern ensures your application stays responsive even when individual providers are throttling you. The key is that rate limits are per-provider, so being rate-limited on OpenAI says nothing about your remaining capacity on Gemini or DeepSeek.

Provider Recommendation by Daily Volume

Daily RequestsBest ProviderWhy
Under 1,000Any providerAll handle this volume comfortably
1,000 - 5,000OpenAI (Tier 2) or Gemini5,000 RPM on OpenAI Tier 2; Gemini depends on AI Studio quota
5,000 - 20,000Gemini 2.5 Flash-Lite or OpenAI Tier 3High published Gemini baseline or 5,000 RPM on OpenAI
20,000 - 50,000Gemini + DeepSeek failoverHigher combined capacity, but DeepSeek remains dynamically scheduled
50,000+OpenAI Tier 5 or custom enterprise40M TPM on GPT-5 / GPT-5.4, or negotiated limits

For token-heavy workloads (long documents, large context):

Daily Token VolumeBest ProviderWhy
Under 10M tokensAny providerAll handle this at paid tier
10M - 100M tokensGemini or DeepSeekGemini has published high TPM; DeepSeek is much cheaper but dynamically scheduled
100M - 500M tokensOpenAI Tier 3+2M TPM at Tier 3, scaling to 40M at Tier 5
500M+ tokensOpenAI Tier 5 + GeminiUse OpenAI’s 40M TPM standard ceiling plus verified Gemini quota, or contact sales for custom

The Hidden Cost of Low Rate Limits

Rate limits have a real financial impact beyond just throttled requests. When your application hits a 429, several things happen:

  1. Wasted compute. Your server processed the user’s request, built the prompt, estimated tokens — all before discovering the API will not accept it.
  2. User-facing latency. The retry delay (even with exponential backoff) adds seconds or minutes to response times. Users notice.
  3. Queue depth explosion. If incoming requests exceed your API throughput, queues grow unboundedly. You need either a cap (reject requests) or a very large buffer.
  4. Over-provisioning costs. To avoid hitting limits, many teams over-provision by buying higher tiers than their average usage requires — paying for headroom they rarely use.

This is why rate limits should be factored into your total cost analysis alongside per-token pricing. A provider that costs 20% more per token but offers 10x the throughput may actually be cheaper when you account for infrastructure complexity, queue management, and user experience impact.

Bottom Line

Rate limits in May 2026 vary dramatically across providers, and the differences are larger than most developers realize:

  • Google Gemini can offer strong throughput, but current Google docs direct you to AI Studio for active project limits and warn that listed limits are not guaranteed. Treat public presets as planning baselines, not contractual ceilings.
  • OpenAI has the highest published standard ceiling at 40M TPM (Tier 5), but you need $1,000+ in cumulative spend and 30+ days since first successful payment to unlock it. For most startups, Tier 2-3 at 1-2M TPM is more realistic and still competitive.
  • Anthropic (Claude) is output-token constrained compared with OpenAI’s highest standard tier, but its current docs split ITPM and OTPM and exclude cached reads from ITPM for most current models. If you choose Claude for quality, budget for output-token and acceleration limits.
  • DeepSeek offers a unique value proposition: extremely low V4 Flash pricing, automatic context caching, and no public paid tier ladder. Best as a high-volume low-cost route when you can tolerate dynamic scheduling and keep fallback providers.
  • xAI (Grok) sits in the middle with reasonable limits and free credits for getting started, though it cannot match Gemini or OpenAI on raw throughput.

For most production applications, the optimal strategy is: start with Gemini for free-tier development, add OpenAI for high-confidence tasks at Tier 2+, and overweight DeepSeek V4 Flash for low-cost repeated-context agent traffic. Use the Batch API for everything that does not need real-time responses.

Rate limits will continue to change as providers scale their infrastructure, but as of May 6, 2026, these are the official-doc baselines you should verify against your own provider dashboards before production launch.

Related tools and guides:

Related Posts