Version: 3.10.x

Set Up Multi-LLM Routing and Automatic Fallback

This guide covers how to distribute AI traffic across multiple models and providers using the ai-proxy-multi plugin. You will learn how to configure weighted load balancing, automatic failover, and priority-based routing.

Overview

Relying on a single LLM provider creates risks: outages, rate limit exhaustion, and cost spikes. The ai-proxy-multi plugin solves this by routing traffic across multiple model instances with configurable load balancing, health checks, and fallback strategies.

Common use cases:

Cost optimization — Route most traffic to a cheaper model, fall back to a premium model for quality.
High availability — Automatic failover when a provider is down or rate-limited.
Capacity distribution — Spread load across providers to stay under individual rate limits.

Prerequisites

Install Docker.
Install cURL to send requests to the services for validation.
Have a running API7 Gateway instance.

Create a token from the Dashboard and save it to an environment variable:

export API_KEY=your-dashboard-token   # replace with your Dashboard token

Replace {gateway_group_id} with your gateway group ID. Use default if you are following the quickstart.
If you are following the Admin API examples, create or reuse a service in API7 Gateway. If you do not have one yet, follow Create or Reuse a Service, then save its ID to an environment variable:
```
export SERVICE_ID=your-service-id         # replace with your service ID
```

Weighted Load Balancing

Distribute traffic across models based on cost and performance trade-offs:

Admin API
ADC

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
  -H "X-API-KEY: ${API_KEY}" \
  -d '{
    "id": "multi-llm-weighted",
    "service_id": "'"$SERVICE_ID"'",
    "paths": ["/ai"],
    "plugins": {
      "ai-proxy-multi": {
        "balancer": {
          "algorithm": "roundrobin"
        },
        "instances": [
          {
            "name": "gpt-4o-mini",
            "provider": "openai",
            "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
            "options": { "model": "gpt-4o-mini" },
            "weight": 8
          },
          {
            "name": "gpt-4o",
            "provider": "openai",
            "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
            "options": { "model": "gpt-4o" },
            "weight": 2
          }
        ]
      }
    }
  }'

❶ Set the balancer algorithm. roundrobin distributes requests based on instance weights.

❷ Assign weight 8 to gpt-4o-mini — roughly 80% of traffic goes here.

❸ Assign weight 2 to gpt-4o — 20% of traffic for premium quality.

adc.yaml
services:
  - name: Multi-LLM Weighted
    routes:
      - uris:
          - /ai
        name: multi-llm-weighted
        plugins:
          ai-proxy-multi:
            balancer:
              algorithm: roundrobin
            instances:
              - name: gpt-4o-mini
                provider: openai
                auth:
                  header:
                    Authorization: "Bearer ${OPENAI_API_KEY}"
                options:
                  model: gpt-4o-mini
                weight: 8
              - name: gpt-4o
                provider: openai
                auth:
                  header:
                    Authorization: "Bearer ${OPENAI_API_KEY}"
                options:
                  model: gpt-4o
                weight: 2

❶ Set the balancer algorithm. roundrobin distributes requests based on instance weights.

❷ Assign weight 8 to gpt-4o-mini — roughly 80% of traffic goes here.

❸ Assign weight 2 to gpt-4o — 20% of traffic for premium quality.

Synchronize the configuration to API7 Gateway:

adc sync -f adc.yaml

Automatic Failover

Configure fallback strategies so that traffic reroutes automatically when a provider is unavailable or rate-limited.

The fallback_strategy field supports two modes:

Single strategy (string): "instance_health_and_rate_limiting", "http_429", or "http_5xx".
Combined strategy (array): Triggers fallback on any matched condition, for example ["rate_limiting", "http_429", "http_5xx"].

instance_health_and_rate_limiting is kept for backward compatibility and is functionally the same as rate_limiting. In new configurations, prefer rate_limiting when you use the array form.

Admin API
ADC

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
  -H "X-API-KEY: ${API_KEY}" \
  -d '{
    "id": "multi-llm-failover",
    "service_id": "'"$SERVICE_ID"'",
    "paths": ["/ai"],
    "plugins": {
      "ai-proxy-multi": {
        "fallback_strategy": ["http_429", "http_5xx"],
        "instances": [
          {
            "name": "openai-primary",
            "provider": "openai",
            "auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
            "options": { "model": "gpt-4o" },
            "weight": 1,
            "priority": 2
          },
          {
            "name": "anthropic-fallback",
            "provider": "anthropic",
            "auth": { "header": { "Authorization": "Bearer '"$ANTHROPIC_API_KEY"'" } },
            "options": { "model": "claude-sonnet-4-20250514" },
            "weight": 1,
            "priority": 1
          }
        ]
      }
    }
  }'

❶ Trigger fallback when the current instance returns HTTP 429 (rate limited) or 5xx (server error).

❷ OpenAI is the primary instance with the highest priority (2).

❸ Anthropic serves as fallback with lower priority (1). Traffic routes here only when OpenAI returns 429 or 5xx.

adc.yaml
services:
  - name: Multi-LLM Failover
    routes:
      - uris:
          - /ai
        name: multi-llm-failover
        plugins:
          ai-proxy-multi:
            fallback_strategy:
              - http_429
              - http_5xx
            instances:
              - name: openai-primary
                provider: openai
                auth:
                  header:
                    Authorization: "Bearer ${OPENAI_API_KEY}"
                options:
                  model: gpt-4o
                weight: 1
                priority: 2
              - name: anthropic-fallback
                provider: anthropic
                auth:
                  header:
                    Authorization: "Bearer ${ANTHROPIC_API_KEY}"
                options:
                  model: claude-sonnet-4-20250514
                weight: 1
                priority: 1

❶ Trigger fallback when the current instance returns HTTP 429 (rate limited) or 5xx (server error).

❷ OpenAI is the primary instance with the highest priority (2).

❸ Anthropic serves as fallback with lower priority (1). Traffic routes here only when OpenAI returns 429 or 5xx.

Synchronize the configuration to API7 Gateway:

adc sync -f adc.yaml

Cross-Provider Routing Strategies

Combine different providers for specific goals:

Cost Optimization

Route to the cheapest model first, with a premium fallback:

Instance	Provider	Model	Priority	Purpose
`deepseek-primary`	DeepSeek	`deepseek-chat`	1	Lowest cost per token
`gpt-4o-mini-secondary`	OpenAI	`gpt-4o-mini`	2	Moderate cost fallback
`gpt-4o-premium`	OpenAI	`gpt-4o`	3	Highest quality fallback

Capacity Distribution

Spread load across providers to stay under individual rate limits:

Instance	Provider	Model	Weight	Purpose
`openai-pool`	OpenAI	`gpt-4o`	5	50% of traffic
`anthropic-pool`	Anthropic	`claude-sonnet-4-20250514`	3	30% of traffic
`deepseek-pool`	DeepSeek	`deepseek-chat`	2	20% of traffic

Response Streaming

The ai-proxy-multi plugin handles Server-Sent Events (SSE) streaming transparently. When a client sends "stream": true, the gateway streams tokens from whichever instance handles the request, regardless of the provider.

No additional configuration is required for streaming with multi-model routing. Use the proxy-buffering plugin to disable NGINX proxy_buffering if SSE events are being buffered.

Verify

Send a request to test the multi-model route:

curl "http://127.0.0.1:9080/ai" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Hello" }
    ]
  }'

You should receive a response from one of the configured instances. The model field in the response indicates which instance handled the request.

Next Steps

Token Rate Limiting — Set per-instance token budgets.
AI Observability — Monitor which instances handle traffic and track costs.
For the full configuration reference, see ai-proxy-multi.

Overview​

Prerequisites​

Weighted Load Balancing​

Automatic Failover​

Cross-Provider Routing Strategies​

Cost Optimization​

Capacity Distribution​

Response Streaming​

Verify​

Next Steps​