Skip to main content

Metrics Reference

AISIX exposes Prometheus metrics on GET /metrics through the dedicated metrics listener configured with observability.metrics.prometheus.addr.

The /metrics endpoint is unauthenticated by design. Keep the listener private to your monitoring network.

Metric families are registered lazily on first observation. Immediately after boot, /metrics can return an empty body. Send one request through the proxy, then scrape again for series to appear.

Request and Latency

MetricTypeLabelsDescription
aisix_requests_totalcounterprovider, model, status, outcomeTotal proxy requests (legacy series). outcome is success, client_error, upstream_error, or rate_limited.
aisix_request_duration_secondssummaryprovider, model, statusEnd-to-end proxy request latency (legacy series).
aisix_llm_requests_totalcounterendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, is_fallback, status, outcomeLLM-shaped requests through the proxy. Counts both successful and failed requests, so a success rate is computable from outcome.
aisix_llm_request_duration_secondssummaryendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, status, outcomeEnd-to-end latency for LLM requests. Filter stream="true" to compare its P90 against aisix_llm_time_to_first_token_seconds on the same streaming-only sample.
aisix_llm_api_latency_secondssummaryendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_nameUpstream API latency only, excluding gateway overhead.
aisix_llm_time_to_first_token_secondssummaryendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_nameTime from request entry to first generated token chunk. Streaming paths only.

The following labels are shared across the request and usage series:

  • stream (true / false) — whether the client requested a streaming response. Present on the request counters and the E2E duration metric. Because time-to-first-token is measured only on streaming requests, restrict the E2E latency to stream="true" before comparing its percentiles against aisix_llm_time_to_first_token_seconds.
  • is_fallback (true / false) — whether the request was served via a fallback routing target. Present on the request counters only (aisix_llm_requests_total, aisix_proxy_requests_total, aisix_proxy_failed_requests_total), not on the duration metrics, so it can refine a success rate without multiplying every latency series.
  • provider_key_name, user_name — human-readable companions to provider_key_id and user_id. They are one-to-one with the ids, so they add no extra series. user_name is populated by the control plane and reads unknown until then.
  • inbound_protocol — bounded protocol family. Values include openai, anthropic, mcp, and other.

Usage and Cost

MetricTypeLabelsDescription
aisix_tokens_consumed_totalcounterprovider, modelSum of usage.total_tokens across completed non-streaming calls.
aisix_llm_input_tokens_totalcounterendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_nameInput tokens reported by the upstream.
aisix_llm_output_tokens_totalcounterendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_nameOutput tokens reported by the upstream.
aisix_llm_total_tokens_totalcounterendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_nameTotal tokens reported by the upstream.
aisix_llm_spend_micro_usd_totalcounterendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_nameEstimated spend in micro-USD (1 USD = 1,000,000).
aisix_llm_tokens_by_client_totalcounterclient_type, token_typeToken volume broken down by inbound client type. token_type is input or output. A dedicated low-cardinality series so the client breakdown never multiplies the per-key token series above. Emitted for /v1/chat/completions and /v1/messages.

client_type is derived from the inbound User-Agent and normalized to a bounded allowlist of known clients, so a client-controlled header can never grow Prometheus cardinality. Recognized values include openai-python, openai-node, anthropic-python, anthropic-typescript, claude-code, codex, cline, aider, langchain, llamaindex, litellm, curl, python-requests, httpx, aiohttp, okhttp, go-http-client, node, postman, and browser, plus other for any unrecognized agent and unknown for a missing one. The full user-agent string and its version are intentionally not metric labels — they are unbounded and client-controlled, so they are kept in request logs and analytics instead.

Proxy Health

MetricTypeLabelsDescription
aisix_proxy_requests_totalcounterendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, is_fallback, status, outcomeAll proxy requests with full label granularity. Counts both successful and failed requests.
aisix_proxy_failed_requests_totalcounterendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, is_fallback, status, outcomeSubset of aisix_proxy_requests_total where outcome is not success.
aisix_proxy_request_duration_secondssummaryendpoint, inbound_protocol, provider, model, upstream_model, provider_key_id, provider_key_name, api_key_id, team_id, user_id, user_name, stream, status, outcomeEnd-to-end latency with full label granularity.
aisix_proxy_in_flight_requestsgaugeendpoint, inbound_protocolCurrently active proxy requests.

MCP requests use endpoint="/mcp" and inbound_protocol="mcp" on in-flight request metrics.

Rate Limits and Budgets

MetricTypeLabelsDescription
aisix_ratelimit_rejections_totalcounterscopeRate-limit rejections by scope, such as requests or tokens.
aisix_ratelimit_remaining_requestsgaugeapi_key_id, modelRemaining request quota for the key/model pair.
aisix_ratelimit_remaining_tokensgaugeapi_key_id, modelRemaining token quota for the key/model pair.
aisix_budget_limit_usdgaugeapi_key_id, team_id, user_idBudget limit in USD.
aisix_budget_spent_usdgaugeapi_key_id, team_id, user_idBudget spent in USD.
aisix_budget_remaining_usdgaugeapi_key_id, team_id, user_idBudget remaining in USD.
aisix_budget_reset_secondsgaugeapi_key_id, team_id, user_idSeconds until the budget period resets.
aisix_budget_details_presentgaugeapi_key_id, team_id, user_id1 when budget gauges are populated, 0 when cleared.

Deployment and Routing

MetricTypeLabelsDescription
aisix_deployment_requests_totalcounterprovider, model, upstream_model, provider_key_idTotal requests dispatched to a target model.
aisix_deployment_success_responses_totalcounterprovider, model, upstream_model, provider_key_idSuccessful upstream responses from a target model.
aisix_deployment_failure_responses_totalcounterprovider, model, upstream_model, provider_key_idFailed upstream responses from a target model.
aisix_deployment_stategaugeprovider, model, upstream_model, provider_key_idRuntime health state: 0 = healthy, 1 = partial failure, 2 = down.
aisix_deployment_cooled_down_totalcounterprovider, model, upstream_model, provider_key_idTimes a target model entered cooldown.
aisix_routing_successful_fallbacks_totalcountermodelSuccessful failovers to the next routing candidate.
aisix_routing_failed_fallbacks_totalcountermodelFailed failovers where no candidate was available.

Guardrails

MetricTypeLabelsDescription
aisix_guardrail_blocks_totalcounterNoneRequests rejected by a guardrail on input or output.
aisix_guardrail_bypasses_totalcounterreasonFail-open events where a remote guardrail was unreachable but fail_open allowed the request through. reason values include bedrock_5xx, bedrock_timeout, bedrock_throttled.

Usage Events and Exporters

MetricTypeLabelsDescription
aisix_usage_events_emitted_totalcounterhandler, status_code, inbound_protocolUsage events successfully queued for delivery. status_code is bucketed as 2xx, 3xx, 4xx, 5xx, or other. handler is the endpoint name, such as chat, embeddings, messages, or mcp.
aisix_usage_event_drops_totalcounterreasonUsage events dropped because the sink was full or closed.
aisix_otlp_fanout_drops_totalcounterexporter, reasonOTLP trace spans dropped during fan-out.
aisix_otlp_fanout_failures_totalcounterexporterOTLP trace span delivery failures.

MCP tool calls emit usage events with handler="mcp" and inbound_protocol="mcp". These events identify the MCP server and tool in the usage-event payload. Token and cost fields are zero for MCP tool calls.

Cache

MetricTypeLabelsDescription
aisix_redis_failures_totalcounteroperationRedis cache operation failures when the Redis backend is configured.

Common Queries

Success Rate

Because aisix_llm_requests_total counts both successful and failed requests, a success rate is the ratio of success outcomes to all outcomes:

sum(rate(aisix_llm_requests_total{outcome="success"}[5m]))
/
sum(rate(aisix_llm_requests_total[5m]))

To measure the success rate of the primary path only — over the requests that were not served by a fallback target — restrict both the numerator and denominator with is_fallback="false":

sum(rate(aisix_llm_requests_total{outcome="success", is_fallback="false"}[5m]))
/
sum(rate(aisix_llm_requests_total{is_fallback="false"}[5m]))

Whether rate-limited requests count against the rate is a policy choice. 429 responses carry outcome="rate_limited", so exclude them from both numerator and denominator (for example outcome!="rate_limited") if a client hitting its own quota should not be treated as a gateway failure.

Streaming TTFT and End-to-End Latency

Time-to-first-token is recorded only for streaming requests, so compare it against the E2E latency restricted to the same streaming sample. AISIX exposes these latency series as Prometheus summaries, so read the precomputed quantile label directly:

# P90 time-to-first-token (streaming requests only)
aisix_llm_time_to_first_token_seconds{quantile="0.9"}

# P90 end-to-end latency, restricted to the same streaming sample
aisix_llm_request_duration_seconds{stream="true", quantile="0.9"}

Token Volume by Client

sum by (client_type) (rate(aisix_llm_tokens_by_client_total[5m]))

Add token_type to separate input from output, for example sum by (client_type, token_type) (...).

API7.ai Logo

The digital world is connected by APIs,
API7.ai exists to make APIs more efficient, reliable, and secure.

Sign up for API7 newsletter

Product

API7 Gateway

SOC2 Type IIISO 27001HIPAAGDPRRed Herring

Copyright © APISEVEN PTE. LTD 2019 – 2026. Apache, Apache APISIX, APISIX, and associated open source project names are trademarks of the Apache Software Foundation