Skip to main content

Rate Limiting

Rate limiting is a critical feature for managing AI service usage, controlling costs, and protecting your upstream services from abuse. AISIX provides a flexible rate limiting engine that can be applied to both API Keys and Models.

How Rate Limiting Works

The RateLimitHook is a default hook that enforces rate limits. It operates in both the pre_call and post_call stages:

  • pre_call: Before forwarding a request, the hook checks if it would exceed the configured requests-per-minute/day limit. If so, the request is rejected with a 429 Too Many Requests error.
  • post_call: After a successful response is received, the hook inspects the token usage and updates the tokens-per-minute/day counters.

Configuring Rate Limits

Rate limits can be defined in the rate_limit field of both ApiKey and Model entities for granular control.

Rate Limit Metrics

You can set limits based on five metrics:

MetricDescription
rpmRequests Per Minute
rpdRequests Per Day
tpmTokens Per Minute
tpdTokens Per Day
concurrencyRequest Concurrency

Example Configuration

Here is how to configure a rate limit on a Model entity. The same structure applies to ApiKey entities.

# Create a model with a rate limit
curl -X POST http://127.0.0.1:3001/aisix/admin/models \
-H "Authorization: Bearer your-strong-admin-key-here" \
-H "Content-Type: application/json" \
-d '{
"name": "limited-model",
"model": "openai/gpt-4.1-mini",
"provider_config": { "api_key": "..." },
"rate_limit": {
"rpm": 100,
"tpm": 10000,
"concurrency": 10
}
}'

This configuration limits the limited-model to 100 requests per minute, 10,000 tokens per minute, and 10 concurrent requests.

Independent Limits

When rate limits are configured on both an ApiKey and a Model for a request, they are evaluated independently. Both limits must be satisfied for the request to proceed.

  • The API Key's limit controls the usage for that client.
  • The Model's limit controls the aggregate usage for that model.

If either limit is exceeded, the request is rejected with a 429 error, and the error message indicates which entity and metric caused the rejection.

Quota Timing

Request-based quotas (RPM/RPD) on an API Key are consumed during the pre_call stage before model-level checks run. If a subsequent model pre-check rejects the request, the API Key quota has already been decremented.

Rate Limit Headers

For every request that passes through the RateLimitHook, AISIX adds HTTP headers to the response. These headers give the client real-time visibility into their rate limit status, allowing them to self-regulate their request rate.

Request-Based Limit Headers

HeaderDescription
x-ratelimit-limit-requestsThe total request limit for the current window.
x-ratelimit-remaining-requestsThe number of requests remaining in the current window.
x-ratelimit-reset-requestsThe time remaining until the request limit window resets, in a human-readable format (e.g., 55s).

Concurrency Limit Headers

HeaderDescription
x-ratelimit-limit-concurrentThe maximum number of concurrent requests allowed.
x-ratelimit-remaining-concurrentThe number of available concurrent request slots.

Token-Based Limit Headers

HeaderDescription
x-ratelimit-limit-tokensThe total token limit for the current window.
x-ratelimit-remaining-tokensThe number of tokens remaining in the current window.
x-ratelimit-reset-tokensThe time remaining until the token limit window resets.

When limits are on both the API Key and the Model, the headers reflect the strictest of the two limits.

Error Response

When a rate limit is exceeded, AISIX returns a 429 Too Many Requests error with a JSON body with details:

{
"error": {
"message": "Rate limit exceeded for API key ID: my-app. Limited on rpm, current limit: 100, remaining: 0",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}

The response also includes a Retry-After header indicating how many seconds the client should wait before making another request.

Concurrency Limit

When the concurrent request limit is exceeded, AISIX returns a 429 Too Many Requests error with a different error code:

{
"error": {
"message": "Concurrency limit exceeded for model 'openai-gpt4-mini'",
"type": "rate_limit_error",
"code": "concurrency_limit_exceeded"
}
}
API7.ai Logo

The digital world is connected by APIs,
API7.ai exists to make APIs more efficient, reliable, and secure.

Sign up for API7 newsletter

Product

API7 Gateway

SOC2 Type IIISO 27001HIPAAGDPRRed Herring

Copyright © APISEVEN PTE. LTD 2019 – 2026. Apache, Apache APISIX, APISIX, and associated open source project names are trademarks of the Apache Software Foundation