Skip to main content

Version: 3.9.x

Set Up Multi-LLM Routing and Automatic Fallback

This guide covers how to distribute AI traffic across multiple models and providers using the ai-proxy-multi plugin. You will learn how to configure weighted load balancing, automatic failover, and priority-based routing.

Overview

Relying on a single LLM provider creates risks: outages, rate limit exhaustion, and cost spikes. The ai-proxy-multi plugin solves this by routing traffic across multiple model instances with configurable load balancing, health checks, and fallback strategies.

Common use cases:

  • Cost optimization — Route most traffic to a cheaper model, fall back to a premium model for quality.
  • High availability — Automatic failover when a provider is down or rate-limited.
  • Capacity distribution — Spread load across providers to stay under individual rate limits.

Prerequisites

  • Install Docker.

  • Install cURL to send requests to the services for validation.

  • Have a running API7 Gateway instance.

  • Create a token from the Dashboard and save it to an environment variable:

    export API_KEY=your-dashboard-token   # replace with your Dashboard token
  • Replace {gateway_group_id} with your gateway group ID. Use default if you are following the quickstart.

  • If you are following the Admin API examples, create or reuse a service in API7 Gateway. If you do not have one yet, follow Create or Reuse a Service, then save its ID to an environment variable:

    export SERVICE_ID=your-service-id         # replace with your service ID

Weighted Load Balancing

Distribute traffic across models based on cost and performance trade-offs:

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
-H "X-API-KEY: ${API_KEY}" \
-d '{
"id": "multi-llm-weighted",
"service_id": "'"$SERVICE_ID"'",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"balancer": {
"algorithm": "roundrobin"
},
"instances": [
{
"name": "gpt-4o-mini",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o-mini" },
"weight": 8
},
{
"name": "gpt-4o",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 2
}
]
}
}
}'

❶ Set the balancer algorithm. roundrobin distributes requests based on instance weights.

❷ Assign weight 8 to gpt-4o-mini — roughly 80% of traffic goes here.

❸ Assign weight 2 to gpt-4o — 20% of traffic for premium quality.

Automatic Failover

Configure fallback strategies so that traffic reroutes automatically when a provider is unavailable or rate-limited.

The fallback_strategy field supports two modes:

  • Single strategy (string): "instance_health_and_rate_limiting", "http_429", or "http_5xx".
  • Combined strategy (array): Triggers fallback on any matched condition, for example ["rate_limiting", "http_429", "http_5xx"].

instance_health_and_rate_limiting is kept for backward compatibility and is functionally the same as rate_limiting. In new configurations, prefer rate_limiting when you use the array form.

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
-H "X-API-KEY: ${API_KEY}" \
-d '{
"id": "multi-llm-failover",
"service_id": "'"$SERVICE_ID"'",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"fallback_strategy": ["http_429", "http_5xx"],
"instances": [
{
"name": "openai-primary",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 1,
"priority": 1
},
{
"name": "anthropic-fallback",
"provider": "anthropic",
"auth": { "header": { "Authorization": "Bearer '"$ANTHROPIC_API_KEY"'" } },
"options": { "model": "claude-sonnet-4-20250514" },
"weight": 1,
"priority": 2
}
]
}
}
}'

❶ Trigger fallback when the current instance returns HTTP 429 (rate limited) or 5xx (server error).

❷ OpenAI is the primary instance with the highest priority (1).

❸ Anthropic serves as fallback with lower priority (2). Traffic routes here only when OpenAI returns 429 or 5xx.

Cross-Provider Routing Strategies

Combine different providers for specific goals:

Cost Optimization

Route to the cheapest model first, with a premium fallback:

InstanceProviderModelPriorityPurpose
deepseek-primaryDeepSeekdeepseek-chat1Lowest cost per token
gpt-4o-mini-secondaryOpenAIgpt-4o-mini2Moderate cost fallback
gpt-4o-premiumOpenAIgpt-4o3Highest quality fallback

Capacity Distribution

Spread load across providers to stay under individual rate limits:

InstanceProviderModelWeightPurpose
openai-poolOpenAIgpt-4o550% of traffic
anthropic-poolAnthropicclaude-sonnet-4-20250514330% of traffic
deepseek-poolDeepSeekdeepseek-chat220% of traffic

Response Streaming

The ai-proxy-multi plugin handles Server-Sent Events (SSE) streaming transparently. When a client sends "stream": true, the gateway streams tokens from whichever instance handles the request, regardless of the provider.

No additional configuration is required for streaming with multi-model routing. Use the proxy-buffering plugin to disable NGINX proxy_buffering if SSE events are being buffered.

Verify

Send a request to test the multi-model route:

curl "http://127.0.0.1:9080/ai" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Hello" }
]
}'

You should receive a response from one of the configured instances. The model field in the response indicates which instance handled the request.

Next Steps

API7.ai Logo

The digital world is connected by APIs,
API7.ai exists to make APIs more efficient, reliable, and secure.

Sign up for API7 newsletter

Product

API7 Gateway

SOC2 Type IIISO 27001HIPAAGDPRRed Herring

Copyright © APISEVEN PTE. LTD 2019 – 2026. Apache, Apache APISIX, APISIX, and associated open source project names are trademarks of the Apache Software Foundation