Skip to main content

Version: latest

Set Up Multi-LLM Routing and Automatic Fallback

This guide covers how to distribute AI traffic across multiple models and providers using the ai-proxy-multi plugin. You will learn how to configure weighted load balancing, automatic failover, and priority-based routing.

Overview

Relying on a single LLM provider creates risks: outages, rate limit exhaustion, and cost spikes. The ai-proxy-multi plugin solves this by routing traffic across multiple model instances with configurable load balancing, health checks, and fallback strategies.

Common use cases:

  • Cost optimization — Route most traffic to a cheaper model, fall back to a premium model for quality.
  • High availability — Automatic failover when a provider is down or rate-limited.
  • Capacity distribution — Spread load across providers to stay under individual rate limits.

Prerequisites

  • Install Docker.
  • Install cURL to send requests to the services for validation.
  • Have a running API7 Enterprise Gateway instance. See the Getting Started Guide for setup instructions.

Weighted Load Balancing

Distribute traffic across models based on cost and performance trade-offs:

curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "multi-llm-weighted",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"balancer": {
"algorithm": "roundrobin"
},
"instances": [
{
"name": "gpt-4o-mini",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o-mini" },
"weight": 8
},
{
"name": "gpt-4o",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 2
}
]
}
}
}'

❶ Set the balancer algorithm. roundrobin distributes requests based on instance weights.

❷ Assign weight 8 to gpt-4o-mini — roughly 80% of traffic goes here.

❸ Assign weight 2 to gpt-4o — 20% of traffic for premium quality.

Automatic Failover

Configure fallback strategies so that traffic reroutes automatically when a provider is unavailable or rate-limited.

The fallback_strategy field supports two modes:

  • Single strategy (string): "instance_health_and_rate_limiting", "http_429", or "http_5xx".
  • Combined strategy (array): Triggers fallback on any matched condition, for example ["rate_limiting", "http_429", "http_5xx"].
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "multi-llm-failover",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy-multi": {
"fallback_strategy": ["http_429", "http_5xx"],
"instances": [
{
"name": "openai-primary",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"weight": 1,
"priority": 1
},
{
"name": "anthropic-fallback",
"provider": "anthropic",
"auth": { "header": { "Authorization": "Bearer '"$ANTHROPIC_API_KEY"'" } },
"options": { "model": "claude-sonnet-4-20250514" },
"weight": 1,
"priority": 2
}
]
}
}
}'

❶ Trigger fallback when the current instance returns HTTP 429 (rate limited) or 5xx (server error).

❷ OpenAI is the primary instance with the highest priority (1).

❸ Anthropic serves as fallback with lower priority (2). Traffic routes here only when OpenAI returns 429 or 5xx.

Cross-Provider Routing Strategies

Combine different providers for specific goals:

Cost Optimization

Route to the cheapest model first, with a premium fallback:

InstanceProviderModelPriorityPurpose
deepseek-primaryDeepSeekdeepseek-chat1Lowest cost per token
gpt-4o-mini-secondaryOpenAIgpt-4o-mini2Moderate cost fallback
gpt-4o-premiumOpenAIgpt-4o3Highest quality fallback

Capacity Distribution

Spread load across providers to stay under individual rate limits:

InstanceProviderModelWeightPurpose
openai-poolOpenAIgpt-4o550% of traffic
anthropic-poolAnthropicclaude-sonnet-4-20250514330% of traffic
deepseek-poolDeepSeekdeepseek-chat220% of traffic

Response Streaming

The ai-proxy-multi plugin handles Server-Sent Events (SSE) streaming transparently. When a client sends "stream": true, the gateway streams tokens from whichever instance handles the request, regardless of the provider.

No additional configuration is required for streaming with multi-model routing. Use the proxy-buffering plugin to disable NGINX proxy_buffering if SSE events are being buffered.

Verify

Send a request to test the multi-model route:

curl "http://127.0.0.1:9080/ai" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "Hello" }
]
}'

You should receive a response from one of the configured instances. The model field in the response indicates which instance handled the request.

Next Steps

API7.ai Logo

The digital world is connected by APIs,
API7.ai exists to make APIs more efficient, reliable, and secure.

Sign up for API7 newsletter

Product

API7 Gateway

SOC2 Type IIISO 27001HIPAAGDPRRed Herring

Copyright © APISEVEN PTE. LTD 2019 – 2026. Apache, Apache APISIX, APISIX, and associated open source project names are trademarks of the Apache Software Foundation