Audio APIs
Audio routes let applications send speech-to-text, speech translation, and text-to-speech requests through the same gateway authentication and model-alias layer used by the other OpenAI-compatible proxy APIs.
AISIX forwards OpenAI-style audio requests to an upstream that supports the same audio route. It resolves the caller-facing model alias, applies caller access checks and supported policy checks, rewrites the upstream model ID, and returns the upstream response body and content type without converting it into a chat-style response.
In this guide, you will send a text-to-speech request through AISIX and review the request and response behavior that differs across audio endpoints.
Prerequisites
Before starting, prepare the following:
- A running AISIX gateway that can serve proxy requests.
- A caller API key that can access the model alias.
- A model alias backed by a provider and model that support the audio route you want to call.
Send a Speech Request
Send a speech-generation request through the gateway proxy with the AISIX model alias in the request body:
curl -sS -X POST "http://127.0.0.1:3000/v1/audio/speech" \
-H "Authorization: Bearer YOUR_CALLER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-prod",
"input": "Hello from AISIX.",
"voice": "alloy"
}' \
--output aisix-speech.mp3
For a successful request, the output file should contain the audio bytes returned by the upstream provider. Handle the response as a binary file, not as a chat-style JSON response.
Check that the file was written as audio output:
file aisix-speech.mp3
You should see output that identifies the file as audio. The exact wording depends on the operating system and the upstream response format:
aisix-speech.mp3: MPEG ADTS, layer III, v2, 160 kbps, 24 kHz, Monaural
Audio Endpoint Behavior
Audio endpoints do not all use the same request or response shape:
| Endpoint | Request body | Response body |
|---|---|---|
| Transcriptions | Multipart form with an audio file and model alias | Upstream JSON transcription result |
| Translations | Multipart form with an audio file and model alias | Upstream JSON translation result |
| Speech | JSON body with model alias, text input, and voice | Binary audio bytes |
For transcription and translation requests, AISIX rebuilds the multipart form with the upstream model ID before forwarding it. It preserves the other form fields, including the uploaded file name and content type when they are present.
For speech requests, AISIX rewrites the model field in the JSON body and forwards the remaining request fields to the upstream provider.
The gateway relays the upstream response body and content type. Clients should handle transcription and translation as JSON responses, and speech as binary audio output.
Provider Support
Audio support depends on the resolved provider and model. AISIX does not translate audio formats across provider families.
Use these routes with upstreams that expose matching OpenAI-style audio endpoints. If the upstream does not support the requested audio route, the failure is a provider capability or base-URL issue, not a caller-authentication issue.
Successful audio requests are attributed in gateway usage events. Token counts are populated only when the upstream response includes recognized token usage. Speech output and duration-based audio cost are not inferred from the binary response.
Input guardrails can inspect the text input on speech requests before AISIX calls the provider. Transcription and translation file bytes are forwarded as audio data and are not scanned as text by keyword guardrails.
Endpoint Selection and Checks
Use transcriptions for speech-to-text, translations for speech-to-text with translation semantics, and speech for text-to-audio output.
If a speech request succeeds but the client expects JSON, adjust the response handling. Speech returns audio bytes.
If a transcription or translation request returns 400 from AISIX or the upstream, check the multipart form construction. The request must include a model field and the expected audio file field.
If a speech guardrail does not block a request, check the request text. Speech guardrails inspect the input text, not the generated audio bytes.
Next Steps
You have now seen how AISIX forwards OpenAI-style audio requests and where audio response handling differs from JSON proxy routes. Next, continue with Image Generation to review image request behavior through AISIX.