Speech and Audio

Audio routes let applications send speech-to-text, speech translation, and text-to-speech requests through the same gateway authentication and model-alias layer used by the other OpenAI-compatible proxy APIs.

AISIX forwards OpenAI-style audio requests to an upstream that supports the same audio route. It resolves the caller-facing model alias, applies caller access checks and supported policy checks, rewrites the upstream model ID, and returns the upstream response body and content type without converting it into a chat-style response.

In this guide, you will send a text-to-speech request through AISIX and review the request and response behavior that differs across audio endpoints.

Prerequisites

Before starting, prepare the following:

A running AISIX gateway that can serve proxy requests.
A caller API key that can access the model alias.
A model alias backed by a provider and model that support the audio route you want to call.

Send a Speech Request

Send a speech-generation request through the gateway proxy with the AISIX model alias in the request body:

curl -sS -X POST "http://127.0.0.1:3000/v1/audio/speech" \
  -H "Authorization: Bearer YOUR_CALLER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-prod",
    "input": "Hello from AISIX.",
    "voice": "alloy"
  }' \
  --output aisix-speech.mp3

For a successful request, the output file should contain the audio bytes returned by the upstream provider. Handle the response as a binary file, not as a chat-style JSON response.

Check that the file was written as audio output:

file aisix-speech.mp3

You should see output that identifies the file as audio. The exact wording depends on the operating system and the upstream response format:

aisix-speech.mp3: MPEG ADTS, layer III, v2, 160 kbps, 24 kHz, Monaural

Audio Endpoint Behavior

Audio endpoints do not all use the same request or response shape:

Endpoint	Request body	Response body
Transcriptions	Multipart form with an audio file and model alias	Upstream JSON transcription result
Translations	Multipart form with an audio file and model alias	Upstream JSON translation result
Speech	JSON body with model alias, text input, and voice	Binary audio bytes

For transcription and translation requests, AISIX rebuilds the multipart form with the upstream model ID before forwarding it. It preserves the other form fields, including the uploaded file name and content type when they are present.

For speech requests, AISIX rewrites the model field in the JSON body and forwards the remaining request fields to the upstream provider.

The gateway relays the upstream response body and content type. Clients should handle transcription and translation as JSON responses, and speech as binary audio output.

Provider Support

Audio support depends on the resolved provider and model. AISIX does not translate audio formats across provider families.

Use these routes with upstreams that expose matching OpenAI-style audio endpoints. If the upstream does not support the requested audio route, the failure is a provider capability or base-URL issue, not a caller-authentication issue.

Successful audio requests are attributed in gateway usage events. Token counts are populated only when the upstream response includes recognized token usage. Speech output and duration-based audio cost are not inferred from the binary response.

Input guardrails can inspect the text input on speech requests before AISIX calls the provider. Transcription and translation file bytes are forwarded as audio data and are not scanned as text by keyword guardrails.

Endpoint Selection and Checks

Use transcriptions for speech-to-text, translations for speech-to-text with translation semantics, and speech for text-to-audio output.

If a speech request succeeds but the client expects JSON, adjust the response handling. Speech returns audio bytes.

If a transcription or translation request returns 400 from AISIX or the upstream, check the multipart form construction. The request must include a model field and the expected audio file field.

If a speech guardrail does not block a request, check the request text. Speech guardrails inspect the input text, not the generated audio bytes.

Next Steps

You have now seen how AISIX forwards OpenAI-style audio requests and where audio response handling differs from JSON proxy routes. Next, continue with Provider Passthrough when you need a provider-native route that AISIX does not model directly.

Prerequisites​

Send a Speech Request​

Audio Endpoint Behavior​

Provider Support​

Endpoint Selection and Checks​

Next Steps​