Set Up Multi-LLM Routing and Automatic Fallback
This guide covers how to distribute AI traffic across multiple models and providers using the ai-proxy-multi plugin. You will learn how to configure weighted load balancing, automatic failover, and priority-based routing.
Overview
Relying on a single LLM provider creates risks: outages, rate limit exhaustion, and cost spikes. The ai-proxy-multi plugin solves this by routing traffic across multiple model instances with configurable load balancing, health checks, and fallback strategies.
Common use cases:
- Cost optimization — Route most traffic to a cheaper model, fall back to a premium model for quality.
- High availability — Automatic failover when a provider is down or rate-limited.
- Capacity distribution — Spread load across providers to stay under individual rate limits.
Prerequisites
- Install Docker.
- Install cURL to send requests to the services for validation.
- Have a running API7 Enterprise Gateway instance. See the Getting Started Guide for setup instructions.
Weighted Load Balancing
Distribute traffic across models based on cost and performance trade-offs:
- Admin API
- ADC
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "multi-llm-weighted",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"balancer": {
"algorithm": "roundrobin"
},
"instances": [
{
"name": "gpt-4o-mini",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o-mini" },
"weight": 8
},
{
"name": "gpt-4o",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 2
}
]
}
}
}'
❶ Set the balancer algorithm. roundrobin distributes requests based on instance weights.
❷ Assign weight 8 to gpt-4o-mini — roughly 80% of traffic goes here.
❸ Assign weight 2 to gpt-4o — 20% of traffic for premium quality.
services:
- name: Multi-LLM Weighted
routes:
- uris:
- /ai
name: multi-llm-weighted
plugins:
ai-proxy-multi:
balancer:
algorithm: roundrobin
instances:
- name: gpt-4o-mini
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o-mini
weight: 8
- name: gpt-4o
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o
weight: 2
❶ Set the balancer algorithm. roundrobin distributes requests based on instance weights.
❷ Assign weight 8 to gpt-4o-mini — roughly 80% of traffic goes here.
❸ Assign weight 2 to gpt-4o — 20% of traffic for premium quality.
Synchronize the configuration to API7 Gateway:
adc sync -f adc.yaml
Automatic Failover
Configure fallback strategies so that traffic reroutes automatically when a provider is unavailable or rate-limited.
The fallback_strategy field supports two modes:
- Single strategy (string):
"instance_health_and_rate_limiting","http_429", or"http_5xx". - Combined strategy (array): Triggers fallback on any matched condition, for example
["rate_limiting", "http_429", "http_5xx"].
- Admin API
- ADC
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "multi-llm-failover",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"fallback_strategy": ["http_429", "http_5xx"],
"instances": [
{
"name": "openai-primary",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 1,
"priority": 1
},
{
"name": "anthropic-fallback",
"provider": "anthropic",
"auth": { "header": { "Authorization": "Bearer '"$ANTHROPIC_API_KEY"'" } },
"options": { "model": "claude-sonnet-4-20250514" },
"weight": 1,
"priority": 2
}
]
}
}
}'
❶ Trigger fallback when the current instance returns HTTP 429 (rate limited) or 5xx (server error).
❷ OpenAI is the primary instance with the highest priority (1).
❸ Anthropic serves as fallback with lower priority (2). Traffic routes here only when OpenAI returns 429 or 5xx.
services:
- name: Multi-LLM Failover
routes:
- uris:
- /ai
name: multi-llm-failover
plugins:
ai-proxy-multi:
fallback_strategy:
- http_429
- http_5xx
instances:
- name: openai-primary
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o
weight: 1
priority: 1
- name: anthropic-fallback
provider: anthropic
auth:
header:
Authorization: "Bearer sk-ant-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: claude-sonnet-4-20250514
weight: 1
priority: 2
❶ Trigger fallback when the current instance returns HTTP 429 (rate limited) or 5xx (server error).
❷ OpenAI is the primary instance with the highest priority (1).
❸ Anthropic serves as fallback with lower priority (2). Traffic routes here only when OpenAI returns 429 or 5xx.
Synchronize the configuration to API7 Gateway:
adc sync -f adc.yaml
Cross-Provider Routing Strategies
Combine different providers for specific goals:
Cost Optimization
Route to the cheapest model first, with a premium fallback:
| Instance | Provider | Model | Priority | Purpose |
|---|---|---|---|---|
deepseek-primary | DeepSeek | deepseek-chat | 1 | Lowest cost per token |
gpt-4o-mini-secondary | OpenAI | gpt-4o-mini | 2 | Moderate cost fallback |
gpt-4o-premium | OpenAI | gpt-4o | 3 | Highest quality fallback |
Capacity Distribution
Spread load across providers to stay under individual rate limits:
| Instance | Provider | Model | Weight | Purpose |
|---|---|---|---|---|
openai-pool | OpenAI | gpt-4o | 5 | 50% of traffic |
anthropic-pool | Anthropic | claude-sonnet-4-20250514 | 3 | 30% of traffic |
deepseek-pool | DeepSeek | deepseek-chat | 2 | 20% of traffic |
Response Streaming
The ai-proxy-multi plugin handles Server-Sent Events (SSE) streaming transparently. When a client sends "stream": true, the gateway streams tokens from whichever instance handles the request, regardless of the provider.
No additional configuration is required for streaming with multi-model routing. Use the proxy-buffering plugin to disable NGINX proxy_buffering if SSE events are being buffered.
Verify
Send a request to test the multi-model route:
curl "http://127.0.0.1:9080/ai" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Hello" }
]
}'
You should receive a response from one of the configured instances. The model field in the response indicates which instance handled the request.
Next Steps
- Token Rate Limiting — Set per-instance token budgets.
- AI Observability — Monitor which instances handle traffic and track costs.
- For the full configuration reference, see
ai-proxy-multi.