Version: 3.10.x

Control AI Costs with Token-Based Rate Limiting

This guide covers how to implement token-based rate limiting for LLM traffic using the ai-rate-limiting plugin. You will learn how to set token budgets per route and per model instance.

Overview

Traditional request-based rate limiting is insufficient for LLM traffic — a single request can consume anywhere from 10 to 100,000 tokens depending on the prompt and response. Token-based rate limiting lets you control costs by budgeting based on actual token consumption.

Prerequisites

Install Docker.
Install cURL to send requests to the services for validation.
Have a running API7 Gateway instance.

Create a token from the Dashboard and save it to an environment variable:

export API_KEY=your-dashboard-token   # replace with your Dashboard token

Replace {gateway_group_id} with your gateway group ID. Use default if you are following the quickstart.
If you are following the Admin API examples, create or reuse a service in API7 Gateway. If you do not have one yet, follow Create or Reuse a Service, then save its ID to an environment variable:
```
export SERVICE_ID=your-service-id         # replace with your service ID
```

Token-Based vs. Request-Based Rate Limiting

Aspect	Request-Based	Token-Based
Unit	Number of HTTP requests	Number of LLM tokens consumed
Precision	Coarse — all requests treated equally	Fine — limits match actual resource consumption
Cost control	Weak — a few verbose requests can exhaust budgets	Strong — directly tied to provider billing
Best for	Traditional APIs	LLM traffic

Configure Token Rate Limits

Admin API
ADC

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
  -H "X-API-KEY: ${API_KEY}" \
  -d '{
    "id": "ai-rate-limited",
    "service_id": "'"$SERVICE_ID"'",
    "paths": ["/ai"],
    "plugins": {
      "ai-proxy": {
        "provider": "openai",
        "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
        "options": { "model": "gpt-4o" }
      },
      "ai-rate-limiting": {
        "limit_strategy": "total_tokens",
        "limit": 10000,
        "time_window": 3600
      }
    }
  }'

❶ Count total tokens (prompt + completion). Other options: "prompt_tokens" or "completion_tokens".

❷ Allow 10,000 tokens per time window.

❸ Reset the limit every 3,600 seconds (1 hour).

adc.yaml
services:
  - name: AI Rate Limited
    routes:
      - uris:
          - /ai
        name: ai-rate-limited
        plugins:
          ai-proxy:
            provider: openai
            auth:
              header:
                Authorization: "Bearer ${OPENAI_API_KEY}"
            options:
              model: gpt-4o
          ai-rate-limiting:
            limit_strategy: total_tokens
            limit: 10000
            time_window: 3600

❶ Count total tokens (prompt + completion). Other options: "prompt_tokens" or "completion_tokens".

❷ Allow 10,000 tokens per time window.

❸ Reset the limit every 3,600 seconds (1 hour).

Synchronize the configuration to API7 Gateway:

adc sync -f adc.yaml

When the limit is exceeded, the gateway returns HTTP 429 with rate limit headers:

X-AI-RateLimit-Limit-{name} — The configured token limit.
X-AI-RateLimit-Remaining-{name} — Tokens remaining in the current window.
X-AI-RateLimit-Reset-{name} — Seconds until the window resets.

Per-Instance Rate Limits (Multi-Model)

When using ai-proxy-multi, set per-instance token budgets to limit expensive models more aggressively:

Admin API
ADC

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
  -H "X-API-KEY: ${API_KEY}" \
  -d '{
    "id": "ai-per-instance-limits",
    "service_id": "'"$SERVICE_ID"'",
    "paths": ["/ai"],
    "plugins": {
      "ai-proxy-multi": {
        "instances": [
          {
            "name": "gpt-4o-mini",
            "provider": "openai",
            "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
            "options": { "model": "gpt-4o-mini" },
            "weight": 1
          },
          {
            "name": "gpt-4o",
            "provider": "openai",
            "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
            "options": { "model": "gpt-4o" },
            "weight": 1
          }
        ]
      },
      "ai-rate-limiting": {
        "limit_strategy": "total_tokens",
        "instances": [
          {
            "name": "gpt-4o",
            "limit": 5000,
            "time_window": 3600
          }
        ]
      }
    }
  }'

❶ Apply the rate limit only to the gpt-4o instance with its own limit and time_window. Traffic to gpt-4o-mini is not affected by this limit.

adc.yaml
services:
  - name: AI Per-Instance Limits
    routes:
      - uris:
          - /ai
        name: ai-per-instance-limits
        plugins:
          ai-proxy-multi:
            instances:
              - name: gpt-4o-mini
                provider: openai
                auth:
                  header:
                    Authorization: "Bearer ${OPENAI_API_KEY}"
                options:
                  model: gpt-4o-mini
                weight: 1
              - name: gpt-4o
                provider: openai
                auth:
                  header:
                    Authorization: "Bearer ${OPENAI_API_KEY}"
                options:
                  model: gpt-4o
                weight: 1
          ai-rate-limiting:
            limit_strategy: total_tokens
            instances:
              - name: gpt-4o
                limit: 5000
                time_window: 3600

❶ Apply the rate limit only to the gpt-4o instance with its own limit and time_window. Traffic to gpt-4o-mini is not affected by this limit.

Synchronize the configuration to API7 Gateway:

adc sync -f adc.yaml

Scaling with Redis

For multi-instance gateway deployments, use Redis to share rate limit counters across Data Plane nodes:

Admin API
ADC

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
  -H "X-API-KEY: ${API_KEY}" \
  -d '{
    "id": "ai-rate-redis",
    "service_id": "'"$SERVICE_ID"'",
    "paths": ["/ai"],
    "plugins": {
      "ai-proxy": {
        "provider": "openai",
        "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
        "options": { "model": "gpt-4o" }
      },
      "ai-rate-limiting": {
        "limit_strategy": "total_tokens",
        "limit": 100000,
        "time_window": 3600,
        "policy": "redis",
        "redis_host": "redis.example.com",
        "redis_port": 6379,
        "allow_degradation": true
      }
    }
  }'

❶ Set the policy to redis for shared counters. Other options: "redis-cluster", "redis-sentinel".

❷ Configure the Redis connection.

❸ When set to true, the gateway continues serving requests if Redis is unavailable (fail-open).

adc.yaml
services:
  - name: AI Rate Redis
    routes:
      - uris:
          - /ai
        name: ai-rate-redis
        plugins:
          ai-proxy:
            provider: openai
            auth:
              header:
                Authorization: "Bearer ${OPENAI_API_KEY}"
            options:
              model: gpt-4o
          ai-rate-limiting:
            limit_strategy: total_tokens
            limit: 100000
            time_window: 3600
            policy: redis
            redis_host: redis.example.com
            redis_port: 6379
            allow_degradation: true

❶ Set the policy to redis for shared counters. Other options: "redis-cluster", "redis-sentinel".

❷ Configure the Redis connection.

❸ When set to true, the gateway continues serving requests if Redis is unavailable (fail-open).

Synchronize the configuration to API7 Gateway:

adc sync -f adc.yaml

Verify

Send requests until the rate limit is reached:

curl "http://127.0.0.1:9080/ai" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Write a 500-word essay about API gateways." }
    ]
  }'

After exceeding the token limit, the gateway returns:

HTTP/1.1 429 Too Many Requests

Next Steps

AI Observability and Cost Tracking — Monitor token consumption and build cost dashboards.
Multi-LLM Routing and Fallback — Combine rate limiting with multi-model routing.
For the full configuration reference, see ai-rate-limiting.

Overview​

Prerequisites​

Token-Based vs. Request-Based Rate Limiting​

Configure Token Rate Limits​

Per-Instance Rate Limits (Multi-Model)​

Scaling with Redis​

Verify​

Next Steps​

Overview

Prerequisites

Token-Based vs. Request-Based Rate Limiting

Configure Token Rate Limits

Per-Instance Rate Limits (Multi-Model)

Scaling with Redis

Verify

Next Steps