Semantic Routing

A semantic router lets callers use one model alias while AISIX forwards each request to a different upstream model based on the meaning of the request. Callers always send the same model. AISIX reads the latest user message, embeds it, and compares it against example utterances you attach to each route. It then forwards the request to the best-matching route's target, or to a default model when nothing matches.

It is a virtual-model shape alongside direct models, routing groups (load balancing), and ensemble models. Like those, a semantic router is a model alias with a configuration block — here, a semantic block.

Use semantic routing when one entry point should fan out by topic. For example, send legal questions to a stronger reasoning model and translation requests to a cheaper multilingual model, with everything else going to a general-purpose default. The application never chooses a model per request.

How a Request Flows

For each chat request to a semantic router, AISIX:

Embeds the latest user message through the configured embedding model.
Scores it (cosine similarity) against every route's example embeddings.
Takes each route's best example score and keeps the highest route that clears its threshold.
Dispatches to that route's target model — or to the default model when no route clears its threshold.

Example utterances are embedded once and cached in the data plane, so the steady-state cost of a request is a single embedding call for the prompt plus local arithmetic.

What You Configure

Semantic routing uses two model kinds. Create the embedding model and the route target models first, then create the semantic router that references them.

Embedding Model

An embedding model is a model with kind set to embedding that points at an OpenAI-compatible /v1/embeddings endpoint, plus an embedding block recording its output dimensionality. It can also be called directly through /v1/embeddings.

{
  "kind": "embedding",
  "display_name": "bge-m3",
  "model_name": "bge-m3",
  "provider_key_id": "<provider key for the embeddings endpoint>",
  "embedding": {
    "dimensions": 1024,
    "normalize": true
  }
}

dimensions is required and must match the endpoint's output vector size. normalize defaults to true; set it to false only when the endpoint does not already return unit-length vectors. The endpoint must speak the OpenAI /v1/embeddings shape and return float vectors. Self-hosted runners such as Text Embeddings Inference and Ollama, as well as hosted providers, all qualify. The endpoint must be reachable from the control plane for the test and threshold helpers described below.

Semantic Router

A semantic router is a model with kind set to semantic and a semantic block that references the embedding model, lists the routes, and names a default. All model references are by model ID.

{
  "kind": "semantic",
  "display_name": "prod-chat",
  "semantic": {
    "embedding_model_id": "<id of the embedding model>",
    "default_model_id": "<id of a direct model>",
    "threshold": 0.75,
    "routes": [
      {
        "name": "legal",
        "target_model_id": "<id of a direct model>",
        "description": "Contract and legal-risk analysis",
        "examples": [
          "analyze this contract for legal risk",
          "review this NDA for liability exposure",
          "这条赔偿条款合法吗"
        ],
        "threshold": 0.8
      },
      {
        "name": "translate",
        "target_model_id": "<id of a direct model>",
        "examples": ["translate this paragraph to French"]
      }
    ]
  }
}

Each route needs a name, a target_model_id pointing at a direct model, and at least one examples entry. description is optional. A route's own threshold overrides the router-level threshold for that route. The embedding model, default model, and every route target must already exist in the same environment.

In the dashboard, the Models page exposes Embedding and Semantic Router sections that build these blocks for you: pick the embedding model, the default model, and a target model per route from dropdowns, and add example utterances per route.

How Matching Works

Only the latest user message is embedded. When that message has several text parts, they are concatenated; non-text content is ignored. System, assistant, and tool turns do not affect routing.
Each route's score is the maximum cosine similarity between the request and that route's example utterances.
A route matches when its score is at least its effective threshold — its own threshold, otherwise the router-level threshold. Among matching routes, the highest score wins. When no route matches, the request goes to the default model.

Cross-lingual matching works with a multilingual embedding model: a Chinese prompt can match English examples and the reverse. Cosine scores for related-but-not-identical text typically fall in the 0.4–0.65 range. Tune thresholds against your own examples rather than assuming high cutoffs.

Tuning Thresholds

Two dashboard helpers, available on the semantic router form, make tuning concrete instead of guesswork. Both call the embedding endpoint from the control plane, so the endpoint must be publicly reachable.

Test routing — enter a prompt and see which route it resolves to, along with each route's similarity score and whether it cleared its threshold. Use it to confirm that representative prompts land where you expect before saving.
Suggest thresholds — computes a recommended threshold per route from the geometry of your example sets (how tightly each route's own examples cluster versus how far apart different routes sit). It needs no live traffic; it works from the configured examples alone. Apply the suggestions as a starting point, then refine with Test routing.

What the Caller Sees

The response body reports the resolved upstream model. Two response headers expose the routing decision:

x-aisix-route — the name of the route that matched. It is absent when the request fell through to the default model.
x-aisix-served-by — the display name of the direct model that served the request.

When Embedding Fails

If the embedding call errors or times out, the router applies its on_embedding_failure policy. Set an optional embedding_timeout_ms to bound how long the router waits for the embedding call.

{
  "semantic": {
    "embedding_timeout_ms": 500,
    "on_embedding_failure": { "mode": "default" }
  }
}

mode is one of:

Mode	Behavior
`default`	Route to the router's default model. This is the default behavior.
`fail`	Reject the request with `503`.
`target`	Route to a specific safe model named by `target_model_id`.

target_model_id is required when, and only when, mode is target.

Notes

A semantic router is mutually exclusive with direct-upstream fields and with routing groups and ensembles. The embedding block is valid only on a direct model. The Admin API rejects invalid combinations with 400 INVALID_REQUEST.
Route example vectors are recomputed automatically when you change an example's text, the embedding model, or its dimensions.
Routing decisions can be influenced by adversarial prompts, so run input guardrails before routing and treat similarity scores as operator-only signals.

For the full model resource shape, see Model Aliases.

How a Request Flows​

What You Configure​

Embedding Model​

Semantic Router​

How Matching Works​

Tuning Thresholds​

What the Caller Sees​

When Embedding Fails​

Notes​