Rate Limits
Rate limits protect upstream providers and keep one caller or model alias from consuming shared gateway capacity. They are useful when applications share the same model alias, upstream credential, or provider quota.
In this guide, you will add a caller-specific request limit to a self-hosted gateway. You will send traffic through the gateway and confirm that AISIX rejects requests after the quota is exceeded. You can also use similar limits on a model alias.
Prerequisites
Before starting, prepare the following:
- A self-hosted AISIX gateway with the admin and proxy listeners available.
- The admin key from the gateway
config.yaml. - A working model alias that can serve proxy requests. If you have not created one yet, configure Provider Credentials and Model Aliases first.
Choose a Rate-Limit Scope
In self-hosted gateways, configure rate limits on caller API keys or models. Choose the smallest scope that matches the quota you want to protect:
- A caller API key, when one application or tenant should have its own quota.
- A model, when several caller API keys share the same expensive or capacity-constrained model alias.
The example below protects one application by applying the limit to that application's caller API key.
Configure a Caller Limit
Choose the caller API key value that the application will send, set the model alias, and hash the key before creating the admin resource. This example allows the key to send one request per minute to the configured model alias:
export AISIX_ADMIN_KEY="YOUR_ADMIN_KEY"
export AISIX_API_KEY="YOUR_CALLER_API_KEY"
export AISIX_MODEL="gpt-4o-prod"
AISIX_API_KEY_HASH=$(printf '%s' "${AISIX_API_KEY}" | shasum -a 256 | awk '{print $1}')
Create a caller API key resource with a one-request-per-minute limit. The request goes to the caller API key admin route, and the rate_limit field in the request body defines the limit. The rpm field means requests per minute:
curl -sS -X POST "http://127.0.0.1:3001/admin/v1/apikeys" \
-H "Authorization: Bearer ${AISIX_ADMIN_KEY}" \
-H "Content-Type: application/json" \
-d '{
"key_hash": "'"${AISIX_API_KEY_HASH}"'",
"allowed_models": ["'"${AISIX_MODEL}"'"],
"rate_limit": {
"rpm": 1
}
}'
You should see a response similar to the following:
{
"id": "4ae2b1b8-5e2c-4f44-8d8a-2f6a6f5ef7f8",
"value": {
"key_hash": "4b4f91305bd7f14a04ef6c850b3f4d0a8ce9ac67bc63f8b342ccdfd0d2f5b8f8",
"allowed_models": [
"gpt-4o-prod"
],
"rate_limit": {
"rpm": 1
}
},
"revision": 1
}
Verify Rate Limiting
Before AISIX sends a request to the upstream provider, it checks every matching limit that has been configured. If any one limit has no remaining capacity, AISIX rejects the request with 429.
Send three requests with the rate-limited caller API key:
for i in 1 2 3; do
printf "request %s: " "${i}"
curl -sS -o /dev/null -w "%{http_code}\n" -X POST "http://127.0.0.1:3000/v1/chat/completions" \
-H "Authorization: Bearer ${AISIX_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "'"${AISIX_MODEL}"'",
"messages": [
{"role": "user", "content": "Hello from AISIX."}
]
}'
done
The first request should reach the upstream model. The following requests should exceed the one-request-per-minute limit:
request 1: 200
request 2: 429
request 3: 429
When AISIX rejects the request, the response starts with HTTP/1.1 429 Too Many Requests. It uses the proxy error format and includes Retry-After when the limiter can calculate a retry window:
{
"error": {
"message": "request limit exceeded (requests)",
"type": "rate_limit_exceeded"
}
}
Rate Limit Fields
The example above limits request count. Rate limits attached to caller API keys and model aliases can also limit token usage and in-flight requests. Each field is optional. When a field is omitted, AISIX does not enforce that limit.
| Field | Meaning | Window |
|---|---|---|
rps | Requests per second | 1 second |
rpm | Requests per minute | 60 seconds |
rph | Requests per hour | 3,600 seconds |
rpd | Requests per day | 86,400 seconds |
tpm | Tokens per minute | 60 seconds |
tpd | Tokens per day | 86,400 seconds |
concurrency | In-flight requests | Not windowed |
Request fields count requests before AISIX sends them upstream. Token fields cap prompt and completion tokens. AISIX records token usage after the upstream response returns provider-reported usage, so a large response can consume remaining token capacity and cause a later request to be rejected. Concurrency caps how many requests can be in progress at the same time.
Token limits support minute and day windows only.
By default, rate-limit counters are local to each gateway process. In multi-instance self-hosted deployments that use the default memory backend, account for the number of instances when you set caps. When quotas must be tighter, route the same tenant, caller API key, or model alias to a consistent gateway group.
You can attach the same rate_limit object to a model when the limit should be shared by every caller of that model alias. Add it when you create or update the model through /admin/v1/models. The following example allows up to 300 requests per minute and 20 concurrent requests:
{
"display_name": "gpt-4o-prod",
"provider": "openai",
"model_name": "gpt-4o",
"provider_key_id": "YOUR_PROVIDER_KEY_ID",
"rate_limit": {
"rpm": 300,
"concurrency": 20
}
}
Choose Counter Storage for Self-Hosted Gateways
Self-hosted gateways use startup configuration to decide where rate-limit counters live.
The default memory backend keeps counters in each gateway process. This is exact for a single gateway instance. In a multi-instance deployment behind a load balancer, each instance counts only the traffic it handles. As a result, the effective cluster-wide cap can be higher than the configured per-process limit.
Use the Redis backend when several gateway instances must enforce one shared quota window:
ratelimit:
backend: redis
redis:
mode: single
url: redis://127.0.0.1:6379/
With the Redis backend, request, token, and concurrency counters are shared across gateway instances that use the same Redis backend. The gateway requires the ratelimit.redis block when the Redis backend is selected. Concurrency slots are reclaimed after concurrency_ttl_secs, which defaults to 300 seconds.
Redis Connection Modes
The ratelimit.redis.mode field selects how AISIX connects to Redis.
Use single for one Redis endpoint:
ratelimit:
backend: redis
redis:
mode: single
url: redis://127.0.0.1:6379/
Use cluster for Redis Cluster seed nodes:
ratelimit:
backend: redis
redis:
mode: cluster
nodes:
- redis://10.0.0.1:6379/
- redis://10.0.0.2:6379/
Use sentinel for a Sentinel-managed master:
ratelimit:
backend: redis
redis:
mode: sentinel
sentinels:
- redis://10.0.0.1:26379/
- redis://10.0.0.2:26379/
master_name: mymaster
For cluster and sentinel modes, set username and password when Redis data nodes require ACL authentication. In sentinel mode, sentinel-node credentials belong in the sentinel URLs, while username, password, and database apply to the discovered Redis master.
AISIX Cloud Rate-Limit Policies
AISIX Cloud supports the caller API key and model limits described above. It also provides shared rate-limit policies for API keys, models, teams, members, and per-member team defaults.
Use a shared policy when you need a second or hour request-count window, or when the quota should apply to a team or member scope. Shared policies use window with max_requests, max_tokens, or both. Request-count limits can use second, minute, or hour windows. Token limits should use a minute window.
The following example shows a shared policy that limits one team bucket to 1,000,000 tokens per minute:
{
"name": "team-acme-tpm",
"scope": "team",
"scope_ref": "team-uuid-acme",
"window": "minute",
"max_tokens": 1000000
}
When a shared policy includes both request and token caps, use a minute window so both caps are enforced by the same policy.
Next Steps
You have now configured a caller-specific rate limit and seen how AISIX rejects traffic after the quota is exceeded. Next, continue with Caching to reuse eligible chat-completion responses.