Control AI Costs with Token-Based Rate Limiting
This guide covers how to implement token-based rate limiting for LLM traffic using the ai-rate-limiting plugin. You will learn how to set token budgets per route and per model instance.
Overview
Traditional request-based rate limiting is insufficient for LLM traffic — a single request can consume anywhere from 10 to 100,000 tokens depending on the prompt and response. Token-based rate limiting lets you control costs by budgeting based on actual token consumption.
Prerequisites
- Install Docker.
- Install cURL to send requests to the services for validation.
- Have a running API7 Enterprise Gateway instance. See the Getting Started Guide for setup instructions.
Token-Based vs. Request-Based Rate Limiting
| Aspect | Request-Based | Token-Based |
|---|---|---|
| Unit | Number of HTTP requests | Number of LLM tokens consumed |
| Precision | Coarse — all requests treated equally | Fine — limits match actual resource consumption |
| Cost control | Weak — a few verbose requests can exhaust budgets | Strong — directly tied to provider billing |
| Best for | Traditional APIs | LLM traffic |
Configure Token Rate Limits
- Admin API
- ADC
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-rate-limited",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy": {
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" }
},
"ai-rate-limiting": {
"limit_strategy": "total_tokens",
"limit": 10000,
"time_window": 3600
}
}
}'
❶ Count total tokens (prompt + completion). Other options: "prompt_tokens" or "completion_tokens".
❷ Allow 10,000 tokens per time window.
❸ Reset the limit every 3,600 seconds (1 hour).
services:
- name: AI Rate Limited
routes:
- uris:
- /ai
name: ai-rate-limited
plugins:
ai-proxy:
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o
ai-rate-limiting:
limit_strategy: total_tokens
limit: 10000
time_window: 3600
❶ Count total tokens (prompt + completion). Other options: "prompt_tokens" or "completion_tokens".
❷ Allow 10,000 tokens per time window.
❸ Reset the limit every 3,600 seconds (1 hour).
Synchronize the configuration to API7 Gateway:
adc sync -f adc.yaml
When the limit is exceeded, the gateway returns HTTP 429 with rate limit headers:
X-AI-RateLimit-Limit-{name}— The configured token limit.X-AI-RateLimit-Remaining-{name}— Tokens remaining in the current window.X-AI-RateLimit-Reset-{name}— Seconds until the window resets.
Per-Instance Rate Limits (Multi-Model)
When using ai-proxy-multi, set per-instance token budgets to limit expensive models more aggressively:
- Admin API
- ADC
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-per-instance-limits",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "gpt-4o-mini",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o-mini" },
"weight": 1
},
{
"name": "gpt-4o",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 1
}
]
},
"ai-rate-limiting": {
"limit_strategy": "total_tokens",
"instances": [
{
"name": "gpt-4o",
"limit": 5000,
"time_window": 3600
}
]
}
}
}'
❶ Apply the rate limit only to the gpt-4o instance with its own limit and time_window. Traffic to gpt-4o-mini is not affected by this limit.
services:
- name: AI Per-Instance Limits
routes:
- uris:
- /ai
name: ai-per-instance-limits
plugins:
ai-proxy-multi:
instances:
- name: gpt-4o-mini
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o-mini
weight: 1
- name: gpt-4o
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o
weight: 1
ai-rate-limiting:
limit_strategy: total_tokens
instances:
- name: gpt-4o
limit: 5000
time_window: 3600
❶ Apply the rate limit only to the gpt-4o instance with its own limit and time_window. Traffic to gpt-4o-mini is not affected by this limit.
Synchronize the configuration to API7 Gateway:
adc sync -f adc.yaml
Scaling with Redis
For multi-instance gateway deployments, use Redis to share rate limit counters across Data Plane nodes:
Redis-backed rate limiting (policy, redis_host, redis_port, allow_degradation) is available in API7 Enterprise edition only.
- Admin API
- ADC
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-rate-redis",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy": {
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" }
},
"ai-rate-limiting": {
"limit_strategy": "total_tokens",
"limit": 100000,
"time_window": 3600,
"policy": "redis",
"redis_host": "redis.example.com",
"redis_port": 6379,
"allow_degradation": true
}
}
}'
❶ Set the policy to redis for shared counters. Other options: "redis-cluster", "redis-sentinel".
❷ Configure the Redis connection.
❸ When set to true, the gateway continues serving requests if Redis is unavailable (fail-open).
services:
- name: AI Rate Redis
routes:
- uris:
- /ai
name: ai-rate-redis
plugins:
ai-proxy:
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o
ai-rate-limiting:
limit_strategy: total_tokens
limit: 100000
time_window: 3600
policy: redis
redis_host: redis.example.com
redis_port: 6379
allow_degradation: true
❶ Set the policy to redis for shared counters. Other options: "redis-cluster", "redis-sentinel".
❷ Configure the Redis connection.
❸ When set to true, the gateway continues serving requests if Redis is unavailable (fail-open).
Synchronize the configuration to API7 Gateway:
adc sync -f adc.yaml
Verify
Send requests until the rate limit is reached:
curl "http://127.0.0.1:9080/ai" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Write a 500-word essay about API gateways." }
]
}'
After exceeding the token limit, the gateway returns:
HTTP/1.1 429 Too Many Requests
Next Steps
- AI Observability and Cost Tracking — Monitor token consumption and build cost dashboards.
- Multi-LLM Routing and Fallback — Combine rate limiting with multi-model routing.
- For the full configuration reference, see
ai-rate-limiting.