Skip to main content

Bring Your Own Endpoint

In this guide, you will connect AISIX AI Gateway to a private OpenAI-compatible endpoint, such as a vLLM or SGLang inference server, an Ollama host, or a self-hosted proxy in front of your own models.

Use a BYO endpoint when applications should keep calling AISIX with the OpenAI-compatible API while AISIX forwards traffic to a private or air-gapped model service. The endpoint must accept OpenAI-compatible chat-completions requests.

Prerequisites

Before starting, prepare the following:

  • A gateway with admin on :3001 and proxy on :3000.
  • The admin key from the gateway config.yaml.
  • A reachable OpenAI-compatible endpoint. The examples below assume vLLM at http://10.0.0.5:8000/v1 serving meta-llama/Llama-3.1-8B-Instruct.
  • The endpoint root your server expects, such as http://host:8000/v1 for vLLM, http://host:30000/v1 for SGLang, or http://host:11434/v1 for Ollama.

Configure the BYO Endpoint

Create a provider key, model alias, and caller API key for the private endpoint.

Create a Provider Key

Many self-hosted inference servers do not require an API key. For an unauthenticated endpoint, use a non-empty placeholder in the provider key; AISIX sends it as the bearer token, and your server can ignore it.

Create a provider key for the private OpenAI-compatible endpoint:

export AISIX_ADMIN_KEY="admin-local-only-change-me"

curl -sS -X POST "http://127.0.0.1:3001/admin/v1/provider_keys" \
-H "Authorization: Bearer ${AISIX_ADMIN_KEY}" \
-H "Content-Type: application/json" \
-d '{
"display_name": "vllm-private",
"provider": "vllm",
"adapter": "openai",
"secret": "not-used-by-vllm",
"api_base": "http://10.0.0.5:8000/v1"
}'

Provider key secrets follow the credential-handling behavior described in Provider Keys.

provider is any short label that makes sense for your environment.

adapter selects the OpenAI-compatible upstream format.

secret is a non-empty placeholder for unauthenticated endpoints.

api_base is the endpoint root. Include /v1 when that is part of the server's route.

Save the returned provider key ID for the model resource.

Create a Model

Map a caller-facing alias to the upstream model ID your endpoint serves:

export PROVIDER_KEY_ID="YOUR_PROVIDER_KEY_ID"

curl -sS -X POST "http://127.0.0.1:3001/admin/v1/models" \
-H "Authorization: Bearer ${AISIX_ADMIN_KEY}" \
-H "Content-Type: application/json" \
-d '{
"display_name": "llama-3-private",
"provider": "vllm",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"provider_key_id": "'"${PROVIDER_KEY_ID}"'",
"cost": {
"input_per_1k": 0.0,
"output_per_1k": 0.0
}
}'

display_name is the alias callers send in model.

model_name is the upstream ID your endpoint expects.

provider_key_id attaches the model alias to the provider key you created.

cost is optional. For vLLM and SGLang, use the served model name. For Ollama, use the local model tag, such as llama3.1:8b.

Create a Caller API Key

Choose the plaintext caller API key that the application will send to AISIX, then hash it for the admin resource:

export AISIX_API_KEY="sk-byo-caller"

CALLER_KEY_HASH=$(printf '%s' "${AISIX_API_KEY}" | shasum -a 256 | awk '{print $1}')

Create an API key resource with access to the private model alias:

curl -sS -X POST "http://127.0.0.1:3001/admin/v1/apikeys" \
-H "Authorization: Bearer ${AISIX_ADMIN_KEY}" \
-H "Content-Type: application/json" \
-d '{
"key_hash": "'"${CALLER_KEY_HASH}"'",
"allowed_models": ["llama-3-private"]
}'

allowed_models must match the model alias you created.

Pricing Metadata

Catalog providers carry pricing from the models.dev catalog. A BYO endpoint is not in that catalog, so set pricing metadata yourself if you need token-cost accounting.

Attach a cost block to the model to enable per-token budget accounting:

{
"cost": {
"input_per_1k": 0.10,
"output_per_1k": 0.30
}
}

Both values are in USD per 1,000 tokens. input_per_1k applies to prompt tokens and output_per_1k to completion tokens. Both fields are required when the cost block is present.

Self-hosted deployments store this metadata but do not enforce budget checks from it at request time. Include it on a BYO model so a managed deployment, or your own usage-event consumer, has the per-token rate available. See Models and Budgets.

Verify the Upstream

Send a request through the proxy with the caller API key and model alias you created:

curl -sS -X POST "http://127.0.0.1:3000/v1/chat/completions" \
-H "Authorization: Bearer ${AISIX_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-private",
"messages": [
{"role": "user", "content": "Say hello from the private model."}
]
}'

The response should be an OpenAI-compatible chat-completions response that echoes the caller-facing alias. Check the endpoint access log for a POST /v1/chat/completions entry from AISIX.

If AISIX returns an upstream route or connection error, check api_base, the served model name, and endpoint reachability.

Next Steps

You have now connected a private OpenAI-compatible endpoint to AISIX. Use the same pattern for other private OpenAI-compatible servers by changing the provider label, endpoint root, model ID, and optional pricing metadata.

API7.ai Logo

The digital world is connected by APIs,
API7.ai exists to make APIs more efficient, reliable, and secure.

Sign up for API7 newsletter

Product

API7 Gateway

SOC2 Type IIISO 27001HIPAAGDPRRed Herring

Copyright © APISEVEN PTE. LTD 2019 – 2026. Apache, Apache APISIX, APISIX, and associated open source project names are trademarks of the Apache Software Foundation