Skip to main content

Version: latest

Control AI Costs with Token-Based Rate Limiting

This guide covers how to implement token-based rate limiting for LLM traffic using the ai-rate-limiting plugin. You will learn how to set token budgets per route and per model instance.

Overview

Traditional request-based rate limiting is insufficient for LLM traffic — a single request can consume anywhere from 10 to 100,000 tokens depending on the prompt and response. Token-based rate limiting lets you control costs by budgeting based on actual token consumption.

Prerequisites

  • Install Docker.
  • Install cURL to send requests to the services for validation.
  • Have a running API7 Enterprise Gateway instance. See the Getting Started Guide for setup instructions.

Token-Based vs. Request-Based Rate Limiting

AspectRequest-BasedToken-Based
UnitNumber of HTTP requestsNumber of LLM tokens consumed
PrecisionCoarse — all requests treated equallyFine — limits match actual resource consumption
Cost controlWeak — a few verbose requests can exhaust budgetsStrong — directly tied to provider billing
Best forTraditional APIsLLM traffic

Configure Token Rate Limits

curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-rate-limited",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy": {
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" }
},
"ai-rate-limiting": {
"limit_strategy": "total_tokens",
"limit": 10000,
"time_window": 3600
}
}
}'

❶ Count total tokens (prompt + completion). Other options: "prompt_tokens" or "completion_tokens".

❷ Allow 10,000 tokens per time window.

❸ Reset the limit every 3,600 seconds (1 hour).

When the limit is exceeded, the gateway returns HTTP 429 with rate limit headers:

  • X-AI-RateLimit-Limit-{name} — The configured token limit.
  • X-AI-RateLimit-Remaining-{name} — Tokens remaining in the current window.
  • X-AI-RateLimit-Reset-{name} — Seconds until the window resets.

Per-Instance Rate Limits (Multi-Model)

When using ai-proxy-multi, set per-instance token budgets to limit expensive models more aggressively:

curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-per-instance-limits",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "gpt-4o-mini",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o-mini" },
"weight": 1
},
{
"name": "gpt-4o",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 1
}
]
},
"ai-rate-limiting": {
"limit_strategy": "total_tokens",
"instances": [
{
"name": "gpt-4o",
"limit": 5000,
"time_window": 3600
}
]
}
}
}'

❶ Apply the rate limit only to the gpt-4o instance with its own limit and time_window. Traffic to gpt-4o-mini is not affected by this limit.

Scaling with Redis

For multi-instance gateway deployments, use Redis to share rate limit counters across Data Plane nodes:

Enterprise Only

Redis-backed rate limiting (policy, redis_host, redis_port, allow_degradation) is available in API7 Enterprise edition only.

curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-rate-redis",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy": {
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" }
},
"ai-rate-limiting": {
"limit_strategy": "total_tokens",
"limit": 100000,
"time_window": 3600,
"policy": "redis",
"redis_host": "redis.example.com",
"redis_port": 6379,
"allow_degradation": true
}
}
}'

❶ Set the policy to redis for shared counters. Other options: "redis-cluster", "redis-sentinel".

❷ Configure the Redis connection.

❸ When set to true, the gateway continues serving requests if Redis is unavailable (fail-open).

Verify

Send requests until the rate limit is reached:

curl "http://127.0.0.1:9080/ai" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Write a 500-word essay about API gateways." }
]
}'

After exceeding the token limit, the gateway returns:

HTTP/1.1 429 Too Many Requests

Next Steps

API7.ai Logo

The digital world is connected by APIs,
API7.ai exists to make APIs more efficient, reliable, and secure.

Sign up for API7 newsletter

Product

API7 Gateway

SOC2 Type IIISO 27001HIPAAGDPRRed Herring

Copyright © APISEVEN PTE. LTD 2019 – 2026. Apache, Apache APISIX, APISIX, and associated open source project names are trademarks of the Apache Software Foundation