Metrics Reference
AISIX exposes Prometheus metrics on GET /metrics through the dedicated metrics listener configured with observability.metrics.prometheus.addr.
The /metrics endpoint is unauthenticated by design. Keep the listener private to your monitoring network.
Metric families are registered lazily on first observation. Immediately after boot, /metrics can return an empty body. Send one request through the proxy, then scrape again for series to appear.
Request and Latency
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_requests_total | counter | provider, model, status, outcome | Total proxy requests (legacy series). outcome is success, client_error, upstream_error, or rate_limited. |
aisix_request_duration_seconds | summary | provider, model, status | End-to-end proxy request latency (legacy series). |
aisix_llm_requests_total | counter | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, is_fallback, status, outcome | LLM-shaped requests through the proxy. Counts both successful and failed requests, so a success rate is computable from outcome. |
aisix_llm_request_duration_seconds | summary | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, status, outcome | End-to-end latency for LLM requests. Filter stream="true" to compare its P90 against aisix_llm_time_to_first_token_seconds on the same streaming-only sample. |
aisix_llm_api_latency_seconds | summary | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name | Upstream API latency only, excluding gateway overhead. |
aisix_llm_time_to_first_token_seconds | summary | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name | Time from request entry to first generated token chunk. Streaming paths only. |
The following labels are shared across the request and usage series:
stream(true/false) — whether the client requested a streaming response. Present on the request counters and the E2E duration metric. Because time-to-first-token is measured only on streaming requests, restrict the E2E latency tostream="true"before comparing its percentiles againstaisix_llm_time_to_first_token_seconds.is_fallback(true/false) — whether the request was served via a fallback routing target. Present on the request counters only (aisix_llm_requests_total,aisix_proxy_requests_total,aisix_proxy_failed_requests_total), not on the duration metrics, so it can refine a success rate without multiplying every latency series.provider_key_name,user_name— human-readable companions toprovider_key_idanduser_id. They are one-to-one with the ids, so they add no extra series.user_nameis populated by the control plane and readsunknownuntil then.inbound_protocol— bounded protocol family. Values includeopenai,anthropic,mcp, andother.
Usage and Cost
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_tokens_consumed_total | counter | provider, model | Sum of usage.total_tokens across completed non-streaming calls. |
aisix_llm_input_tokens_total | counter | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name | Input tokens reported by the upstream. |
aisix_llm_output_tokens_total | counter | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name | Output tokens reported by the upstream. |
aisix_llm_total_tokens_total | counter | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name | Total tokens reported by the upstream. |
aisix_llm_spend_micro_usd_total | counter | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name | Estimated spend in micro-USD (1 USD = 1,000,000). |
aisix_llm_tokens_by_client_total | counter | client_type, token_type | Token volume broken down by inbound client type. token_type is input or output. A dedicated low-cardinality series so the client breakdown never multiplies the per-key token series above. Emitted for /v1/chat/completions and /v1/messages. |
client_type is derived from the inbound User-Agent and normalized to a bounded allowlist of known clients, so a client-controlled header can never grow Prometheus cardinality. Recognized values include openai-python, openai-node, anthropic-python, anthropic-typescript, claude-code, codex, cline, aider, langchain, llamaindex, litellm, curl, python-requests, httpx, aiohttp, okhttp, go-http-client, node, postman, and browser, plus other for any unrecognized agent and unknown for a missing one. The full user-agent string and its version are intentionally not metric labels — they are unbounded and client-controlled, so they are kept in request logs and analytics instead.
Proxy Health
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_proxy_requests_total | counter | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, is_fallback, status, outcome | All proxy requests with full label granularity. Counts both successful and failed requests. |
aisix_proxy_failed_requests_total | counter | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, is_fallback, status, outcome | Subset of aisix_proxy_requests_total where outcome is not success. |
aisix_proxy_request_duration_seconds | summary | endpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, status, outcome | End-to-end latency with full label granularity. |
aisix_proxy_in_flight_requests | gauge | endpoint, inbound_protocol | Currently active proxy requests. |
MCP requests use endpoint="/mcp" and inbound_protocol="mcp" on in-flight request metrics.
Rate Limits and Budgets
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_ratelimit_rejections_total | counter | scope | Rate-limit rejections by scope, such as requests or tokens. |
aisix_ratelimit_remaining_requests | gauge | api_key_id, model | Remaining request quota for the key/model pair. |
aisix_ratelimit_remaining_tokens | gauge | api_key_id, model | Remaining token quota for the key/model pair. |
aisix_budget_limit_usd | gauge | api_key_id, team_id, user_id | Budget limit in USD. |
aisix_budget_spent_usd | gauge | api_key_id, team_id, user_id | Budget spent in USD. |
aisix_budget_remaining_usd | gauge | api_key_id, team_id, user_id | Budget remaining in USD. |
aisix_budget_reset_seconds | gauge | api_key_id, team_id, user_id | Seconds until the budget period resets. |
aisix_budget_details_present | gauge | api_key_id, team_id, user_id | 1 when budget gauges are populated, 0 when cleared. |
Deployment and Routing
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_deployment_requests_total | counter | provider, model, upstream_model, provider_key_id | Total requests dispatched to a target model. |
aisix_deployment_success_responses_total | counter | provider, model, upstream_model, provider_key_id | Successful upstream responses from a target model. |
aisix_deployment_failure_responses_total | counter | provider, model, upstream_model, provider_key_id | Failed upstream responses from a target model. |
aisix_deployment_state | gauge | provider, model, upstream_model, provider_key_id | Runtime health state: 0 = healthy, 1 = partial failure, 2 = down. |
aisix_deployment_cooled_down_total | counter | provider, model, upstream_model, provider_key_id | Times a target model entered cooldown. |
aisix_routing_successful_fallbacks_total | counter | model | Successful failovers to the next routing candidate. |
aisix_routing_failed_fallbacks_total | counter | model | Failed failovers where no candidate was available. |
Guardrails
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_guardrail_blocks_total | counter | None | Requests rejected by a guardrail on input or output. |
aisix_guardrail_bypasses_total | counter | reason | Fail-open events where a remote guardrail was unreachable but fail_open allowed the request through. reason values include bedrock_5xx, bedrock_timeout, bedrock_throttled. |
Usage Events and Exporters
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_usage_events_emitted_total | counter | handler, status_code, inbound_protocol | Usage events successfully queued for delivery. status_code is bucketed as 2xx, 3xx, 4xx, 5xx, or other. handler is the endpoint name, such as chat, embeddings, messages, or mcp. |
aisix_usage_event_drops_total | counter | reason | Usage events dropped because the sink was full or closed. |
aisix_otlp_fanout_drops_total | counter | exporter, reason | OTLP trace spans dropped during fan-out. |
aisix_otlp_fanout_failures_total | counter | exporter | OTLP trace span delivery failures. |
MCP tool calls emit usage events with handler="mcp" and inbound_protocol="mcp". These events identify the MCP server and tool in the usage-event payload. Token and cost fields are zero for MCP tool calls.
Cache
| Metric | Type | Labels | Description |
|---|---|---|---|
aisix_redis_failures_total | counter | operation | Redis cache operation failures when the Redis backend is configured. |
Common Queries
Success Rate
Because aisix_llm_requests_total counts both successful and failed requests, a success rate is the ratio of success outcomes to all outcomes:
sum(rate(aisix_llm_requests_total{outcome="success"}[5m]))
/
sum(rate(aisix_llm_requests_total[5m]))
To measure the success rate of the primary path only — over the requests that were not served by a fallback target — restrict both the numerator and denominator with is_fallback="false":
sum(rate(aisix_llm_requests_total{outcome="success", is_fallback="false"}[5m]))
/
sum(rate(aisix_llm_requests_total{is_fallback="false"}[5m]))
Whether rate-limited requests count against the rate is a policy choice. 429 responses carry outcome="rate_limited", so exclude them from both numerator and denominator (for example outcome!="rate_limited") if a client hitting its own quota should not be treated as a gateway failure.
Streaming TTFT and End-to-End Latency
Time-to-first-token is recorded only for streaming requests, so compare it against the E2E latency restricted to the same streaming sample. AISIX exposes these latency series as Prometheus summaries, so read the precomputed quantile label directly:
# P90 time-to-first-token (streaming requests only)
aisix_llm_time_to_first_token_seconds{quantile="0.9"}
# P90 end-to-end latency, restricted to the same streaming sample
aisix_llm_request_duration_seconds{stream="true", quantile="0.9"}
Token Volume by Client
sum by (client_type) (rate(aisix_llm_tokens_by_client_total[5m]))
Add token_type to separate input from output, for example sum by (client_type, token_type) (...).