Bring Your Own Endpoint
In this guide, you will connect AISIX AI Gateway to a private OpenAI-compatible endpoint, such as a vLLM or SGLang inference server, an Ollama host, or a self-hosted proxy in front of your own models.
Use a BYO endpoint when applications should keep calling AISIX with the OpenAI-compatible API while AISIX forwards traffic to a private or air-gapped model service. The endpoint must accept OpenAI-compatible chat-completions requests.
Prerequisites
Before starting, prepare the following:
- A gateway with admin on
:3001and proxy on:3000. - The admin key from the gateway
config.yaml. - A reachable OpenAI-compatible endpoint. The examples below assume vLLM at
http://10.0.0.5:8000/v1servingmeta-llama/Llama-3.1-8B-Instruct. - The endpoint root your server expects, such as
http://host:8000/v1for vLLM,http://host:30000/v1for SGLang, orhttp://host:11434/v1for Ollama.
Configure the BYO Endpoint
Create a provider key, model alias, and caller API key for the private endpoint.
Create a Provider Key
Many self-hosted inference servers do not require an API key. For an unauthenticated endpoint, use a non-empty placeholder in the provider key; AISIX sends it as the bearer token, and your server can ignore it.
Create a provider key for the private OpenAI-compatible endpoint:
export AISIX_ADMIN_KEY="admin-local-only-change-me"
curl -sS -X POST "http://127.0.0.1:3001/admin/v1/provider_keys" \
-H "Authorization: Bearer ${AISIX_ADMIN_KEY}" \
-H "Content-Type: application/json" \
-d '{
"display_name": "vllm-private",
"provider": "vllm",
"adapter": "openai",
"secret": "not-used-by-vllm",
"api_base": "http://10.0.0.5:8000/v1"
}'
Provider key secrets follow the credential-handling behavior described in Provider Keys.
❶ provider is any short label that makes sense for your environment.
❷ adapter selects the OpenAI-compatible upstream format.
❸ secret is a non-empty placeholder for unauthenticated endpoints.
❹ api_base is the endpoint root. Include /v1 when that is part of the server's route.
Save the returned provider key ID for the model resource.
Create a Model
Map a caller-facing alias to the upstream model ID your endpoint serves:
export PROVIDER_KEY_ID="YOUR_PROVIDER_KEY_ID"
curl -sS -X POST "http://127.0.0.1:3001/admin/v1/models" \
-H "Authorization: Bearer ${AISIX_ADMIN_KEY}" \
-H "Content-Type: application/json" \
-d '{
"display_name": "llama-3-private",
"provider": "vllm",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"provider_key_id": "'"${PROVIDER_KEY_ID}"'",
"cost": {
"input_per_1k": 0.0,
"output_per_1k": 0.0
}
}'
❶ display_name is the alias callers send in model.
❷ model_name is the upstream ID your endpoint expects.
❸ provider_key_id attaches the model alias to the provider key you created.
❹ cost is optional. For vLLM and SGLang, use the served model name. For Ollama, use the local model tag, such as llama3.1:8b.
Create a Caller API Key
Choose the plaintext caller API key that the application will send to AISIX, then hash it for the admin resource:
export AISIX_API_KEY="sk-byo-caller"
CALLER_KEY_HASH=$(printf '%s' "${AISIX_API_KEY}" | shasum -a 256 | awk '{print $1}')
Create an API key resource with access to the private model alias:
curl -sS -X POST "http://127.0.0.1:3001/admin/v1/apikeys" \
-H "Authorization: Bearer ${AISIX_ADMIN_KEY}" \
-H "Content-Type: application/json" \
-d '{
"key_hash": "'"${CALLER_KEY_HASH}"'",
"allowed_models": ["llama-3-private"]
}'
❶ allowed_models must match the model alias you created.
Pricing Metadata
Catalog providers carry pricing from the models.dev catalog. A BYO endpoint is not in that catalog, so set pricing metadata yourself if you need token-cost accounting.
Attach a cost block to the model to enable per-token budget accounting:
{
"cost": {
"input_per_1k": 0.10,
"output_per_1k": 0.30
}
}
Both values are in USD per 1,000 tokens. input_per_1k applies to prompt tokens and output_per_1k to completion tokens. Both fields are required when the cost block is present.
Self-hosted deployments store this metadata but do not enforce budget checks from it at request time. Include it on a BYO model so a managed deployment, or your own usage-event consumer, has the per-token rate available. See Models and Budgets.
Verify the Upstream
Send a request through the proxy with the caller API key and model alias you created:
curl -sS -X POST "http://127.0.0.1:3000/v1/chat/completions" \
-H "Authorization: Bearer ${AISIX_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-private",
"messages": [
{"role": "user", "content": "Say hello from the private model."}
]
}'
The response should be an OpenAI-compatible chat-completions response that echoes the caller-facing alias. Check the endpoint access log for a POST /v1/chat/completions entry from AISIX.
If AISIX returns an upstream route or connection error, check api_base, the served model name, and endpoint reachability.
Next Steps
You have now connected a private OpenAI-compatible endpoint to AISIX. Use the same pattern for other private OpenAI-compatible servers by changing the provider label, endpoint root, model ID, and optional pricing metadata.