Metrics Reference

AISIX exposes Prometheus metrics on GET /metrics through the dedicated metrics listener configured with observability.metrics.prometheus.addr.

The /metrics endpoint is unauthenticated by design. Keep the listener private to your monitoring network.

Metric families are registered lazily on first observation. Immediately after boot, /metrics can return an empty body. Send one request through the proxy, then scrape again for series to appear.

Request and Latency

Metric	Type	Labels	Description
`aisix_requests_total`	counter	`provider`, `model`, `status`, `outcome`	Total proxy requests (legacy series). `outcome` is `success`, `client_error`, `upstream_error`, or `rate_limited`.
`aisix_request_duration_seconds`	summary	`provider`, `model`, `status`	End-to-end proxy request latency (legacy series).
`aisix_llm_requests_total`	counter	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`, `stream`, `is_fallback`, `status`, `outcome`	LLM-shaped requests through the proxy. Counts both successful and failed requests, so a success rate is computable from `outcome`.
`aisix_llm_request_duration_seconds`	summary	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`, `stream`, `status`, `outcome`	End-to-end latency for LLM requests. Filter `stream="true"` to compare its P90 against `aisix_llm_time_to_first_token_seconds` on the same streaming-only sample.
`aisix_llm_api_latency_seconds`	summary	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`	Upstream API latency only, excluding gateway overhead.
`aisix_llm_time_to_first_token_seconds`	summary	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`	Time from request entry to first generated token chunk. Streaming paths only.

The following labels are shared across the request and usage series:

stream (true / false) — whether the client requested a streaming response. Present on the request counters and the E2E duration metric. Because time-to-first-token is measured only on streaming requests, restrict the E2E latency to stream="true" before comparing its percentiles against aisix_llm_time_to_first_token_seconds.
is_fallback (true / false) — whether the request was served via a fallback routing target. Present on the request counters only (aisix_llm_requests_total, aisix_proxy_requests_total, aisix_proxy_failed_requests_total), not on the duration metrics, so it can refine a success rate without multiplying every latency series.
provider_key_name, user_name — human-readable companions to provider_key_id and user_id. They are one-to-one with the ids, so they add no extra series. user_name is populated by the control plane and reads unknown until then.
inbound_protocol — bounded protocol family. Values include openai, anthropic, mcp, and other.

Usage and Cost

Metric	Type	Labels	Description
`aisix_tokens_consumed_total`	counter	`provider`, `model`	Sum of `usage.total_tokens` across completed non-streaming calls.
`aisix_llm_input_tokens_total`	counter	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`	Input tokens reported by the upstream.
`aisix_llm_output_tokens_total`	counter	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`	Output tokens reported by the upstream.
`aisix_llm_total_tokens_total`	counter	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`	Total tokens reported by the upstream.
`aisix_llm_spend_micro_usd_total`	counter	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`	Estimated spend in micro-USD (1 USD = 1,000,000).
`aisix_llm_tokens_by_client_total`	counter	`client_type`, `token_type`	Token volume broken down by inbound client type. `token_type` is `input` or `output`. A dedicated low-cardinality series so the client breakdown never multiplies the per-key token series above. Emitted for `/v1/chat/completions` and `/v1/messages`.

client_type is derived from the inbound User-Agent and normalized to a bounded allowlist of known clients, so a client-controlled header can never grow Prometheus cardinality. Recognized values include openai-python, openai-node, anthropic-python, anthropic-typescript, claude-code, codex, cline, aider, langchain, llamaindex, litellm, curl, python-requests, httpx, aiohttp, okhttp, go-http-client, node, postman, and browser, plus other for any unrecognized agent and unknown for a missing one. The full user-agent string and its version are intentionally not metric labels — they are unbounded and client-controlled, so they are kept in request logs and analytics instead.

Proxy Health

Metric	Type	Labels	Description
`aisix_proxy_requests_total`	counter	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`, `stream`, `is_fallback`, `status`, `outcome`	All proxy requests with full label granularity. Counts both successful and failed requests.
`aisix_proxy_failed_requests_total`	counter	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`, `stream`, `is_fallback`, `status`, `outcome`	Subset of `aisix_proxy_requests_total` where `outcome` is not `success`.
`aisix_proxy_request_duration_seconds`	summary	`endpoint`, `inbound_protocol`, `provider`, `model`, `upstream_model`, `provider_key_id`, `provider_key_name`, `api_key_id`, `team_id`, `user_id`, `user_name`, `stream`, `status`, `outcome`	End-to-end latency with full label granularity.
`aisix_proxy_in_flight_requests`	gauge	`endpoint`, `inbound_protocol`	Currently active proxy requests.

MCP requests use endpoint="/mcp" and inbound_protocol="mcp" on in-flight request metrics.

Rate Limits and Budgets

Metric	Type	Labels	Description
`aisix_ratelimit_rejections_total`	counter	`scope`	Rate-limit rejections by scope, such as `requests` or `tokens`.
`aisix_ratelimit_remaining_requests`	gauge	`api_key_id`, `model`	Remaining request quota for the key/model pair.
`aisix_ratelimit_remaining_tokens`	gauge	`api_key_id`, `model`	Remaining token quota for the key/model pair.
`aisix_budget_limit_usd`	gauge	`api_key_id`, `team_id`, `user_id`	Budget limit in USD.
`aisix_budget_spent_usd`	gauge	`api_key_id`, `team_id`, `user_id`	Budget spent in USD.
`aisix_budget_remaining_usd`	gauge	`api_key_id`, `team_id`, `user_id`	Budget remaining in USD.
`aisix_budget_reset_seconds`	gauge	`api_key_id`, `team_id`, `user_id`	Seconds until the budget period resets.
`aisix_budget_details_present`	gauge	`api_key_id`, `team_id`, `user_id`	`1` when budget gauges are populated, `0` when cleared.

Deployment and Routing

Metric	Type	Labels	Description
`aisix_deployment_requests_total`	counter	`provider`, `model`, `upstream_model`, `provider_key_id`	Total requests dispatched to a target model.
`aisix_deployment_success_responses_total`	counter	`provider`, `model`, `upstream_model`, `provider_key_id`	Successful upstream responses from a target model.
`aisix_deployment_failure_responses_total`	counter	`provider`, `model`, `upstream_model`, `provider_key_id`	Failed upstream responses from a target model.
`aisix_deployment_state`	gauge	`provider`, `model`, `upstream_model`, `provider_key_id`	Runtime health state: `0` = healthy, `1` = partial failure, `2` = down.
`aisix_deployment_cooled_down_total`	counter	`provider`, `model`, `upstream_model`, `provider_key_id`	Times a target model entered cooldown.
`aisix_routing_successful_fallbacks_total`	counter	`model`	Successful failovers to the next routing candidate.
`aisix_routing_failed_fallbacks_total`	counter	`model`	Failed failovers where no candidate was available.

Guardrails

Metric	Type	Labels	Description
`aisix_guardrail_blocks_total`	counter	None	Requests rejected by a guardrail on input or output.
`aisix_guardrail_bypasses_total`	counter	`reason`	Fail-open events where a remote guardrail was unreachable but `fail_open` allowed the request through. `reason` values include `bedrock_5xx`, `bedrock_timeout`, `bedrock_throttled`.

Usage Events and Exporters

Metric	Type	Labels	Description
`aisix_usage_events_emitted_total`	counter	`handler`, `status_code`, `inbound_protocol`	Usage events successfully queued for delivery. `status_code` is bucketed as `2xx`, `3xx`, `4xx`, `5xx`, or `other`. `handler` is the endpoint name, such as `chat`, `embeddings`, `messages`, or `mcp`.
`aisix_usage_event_drops_total`	counter	`reason`	Usage events dropped because the sink was full or closed.
`aisix_otlp_fanout_drops_total`	counter	`exporter`, `reason`	OTLP trace spans dropped during fan-out.
`aisix_otlp_fanout_failures_total`	counter	`exporter`	OTLP trace span delivery failures.

MCP tool calls emit usage events with handler="mcp" and inbound_protocol="mcp". These events identify the MCP server and tool in the usage-event payload. Token and cost fields are zero for MCP tool calls.

Cache

Metric	Type	Labels	Description
`aisix_redis_failures_total`	counter	`operation`	Redis cache operation failures when the Redis backend is configured.

Common Queries

Success Rate

Because aisix_llm_requests_total counts both successful and failed requests, a success rate is the ratio of success outcomes to all outcomes:

sum(rate(aisix_llm_requests_total{outcome="success"}[5m]))
  /
sum(rate(aisix_llm_requests_total[5m]))

To measure the success rate of the primary path only — over the requests that were not served by a fallback target — restrict both the numerator and denominator with is_fallback="false":

sum(rate(aisix_llm_requests_total{outcome="success", is_fallback="false"}[5m]))
  /
sum(rate(aisix_llm_requests_total{is_fallback="false"}[5m]))

Whether rate-limited requests count against the rate is a policy choice. 429 responses carry outcome="rate_limited", so exclude them from both numerator and denominator (for example outcome!="rate_limited") if a client hitting its own quota should not be treated as a gateway failure.

Streaming TTFT and End-to-End Latency

Time-to-first-token is recorded only for streaming requests, so compare it against the E2E latency restricted to the same streaming sample. AISIX exposes these latency series as Prometheus summaries, so read the precomputed quantile label directly:

# P90 time-to-first-token (streaming requests only)
aisix_llm_time_to_first_token_seconds{quantile="0.9"}

# P90 end-to-end latency, restricted to the same streaming sample
aisix_llm_request_duration_seconds{stream="true", quantile="0.9"}

Token Volume by Client

sum by (client_type) (rate(aisix_llm_tokens_by_client_total[5m]))

Add token_type to separate input from output, for example sum by (client_type, token_type) (...).

Request and Latency​

Usage and Cost​

Proxy Health​

Rate Limits and Budgets​

Deployment and Routing​

Guardrails​

Usage Events and Exporters​

Cache​

Common Queries​

Success Rate​

Streaming TTFT and End-to-End Latency​

Token Volume by Client​