Monitor AI Traffic and Track LLM Costs
This guide explains how to observe AI traffic and estimate LLM costs using built-in AI logging, APISIX context variables, and standard observability integrations.
Overview
AI observability is different from traditional API observability. For LLM workloads, you need token-level visibility, model attribution, and latency breakdowns such as time to first token.
With API7 AI Gateway, you can collect:
- Request and response model metadata.
- Prompt and completion token counts.
- End-to-end and upstream timing signals.
- Optional payload-level logs for request/response content.
The gateway does not calculate billing directly. Cost tracking is derived by mapping token usage to provider pricing.
Prerequisites
- Install Docker.
- Install cURL to send requests to the services for validation.
- Have a running API7 Enterprise Gateway instance. See the Getting Started Guide for setup instructions.
Built-in AI Logging
ai-proxy and ai-proxy-multi support a logging configuration with:
logging.summaries(boolean): logsrequest_model,model,duration,prompt_tokens,completion_tokens, andupstream_response_time.logging.payloads(boolean): logs request messages, stream flag, and response text content.
Enable logging at route scope (or in shared plugin policy where applicable):
- Admin API
- ADC
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-observability",
"service_id": "$SERVICE_ID",
"paths": ["/ai"],
"plugins": {
"ai-proxy": {
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o" },
"logging": {
"summaries": true,
"payloads": false
}
}
}
}'
❶ Enable AI logging on the ai-proxy plugin.
❷ Log summary-level fields (model + token/timing metadata) for cost and performance analysis.
❸ Keep payload logging disabled by default to reduce sensitive content exposure.
services:
- name: AI Observability
routes:
- uris:
- /ai
name: ai-observability
plugins:
ai-proxy:
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o
logging:
summaries: true
payloads: false
❶ Enable AI logging on the ai-proxy plugin.
❷ Log summary-level fields (model + token/timing metadata) for cost and performance analysis.
❸ Keep payload logging disabled by default to reduce sensitive content exposure.
Synchronize the configuration to API7 Gateway:
adc sync -f adc.yaml
For multi-model routes, apply the same logging fields in each ai-proxy-multi instance:
- Admin API
- ADC
curl "http://127.0.0.1:7080/apisix/admin/routes?gateway_group_id=default" -X PUT \
-H "X-API-KEY: $ADMIN_API_KEY" \
-d '{
"id": "ai-observability-multi",
"service_id": "$SERVICE_ID",
"paths": ["/ai-multi"],
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "openai-primary",
"provider": "openai",
"auth": { "header": { "Authorization": "Bearer '"$OPENAI_API_KEY"'" } },
"options": { "model": "gpt-4o-mini" },
"logging": {
"summaries": true,
"payloads": false
},
"weight": 1
}
]
}
}
}'
❶ Configure logging.summaries and logging.payloads per instance on ai-proxy-multi.
services:
- name: AI Observability Multi
routes:
- uris:
- /ai-multi
name: ai-observability-multi
plugins:
ai-proxy-multi:
instances:
- name: openai-primary
provider: openai
auth:
header:
Authorization: "Bearer sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
options:
model: gpt-4o-mini
logging:
summaries: true
payloads: false
weight: 1
❶ Configure logging.summaries and logging.payloads per instance on ai-proxy-multi.
Synchronize the configuration to API7 Gateway:
adc sync -f adc.yaml
For the full configuration reference, see ai-proxy and ai-proxy-multi.
APISIX Variables for AI Traffic
The following APISIX context variables can be used in access logs, external logging plugins, and custom observability pipelines:
| Variable | Description |
|---|---|
llm_time_to_first_token | Time to first token (TTFT) for streaming responses. |
llm_prompt_tokens | Prompt token count reported by the LLM provider. |
llm_completion_tokens | Completion token count reported by the LLM provider. |
llm_model | Actual model that generated the response. |
request_llm_model | Model requested by the client. |
llm_response_text | Response text content captured in context. |
llm_raw_usage | Raw usage object returned by the provider. |
Prometheus Integration
Use the standard prometheus plugin to export gateway metrics. For AI traffic, the confirmed built-in metric is:
apisix_llm_active_connections(gauge)
Example PromQL queries:
# Current active LLM connections across all routes
sum(apisix_llm_active_connections)
# Top routes by active LLM connections
topk(10, sum by (route) (apisix_llm_active_connections))
# Alert condition example: sustained high active connections
avg_over_time(sum(apisix_llm_active_connections)[5m]) > 200
Adjust labels and thresholds to your deployment and traffic profile.
Logging AI Requests to External Systems
You can forward structured logs with AI metadata to systems such as ELK, Loki, or Splunk by combining APISIX logging plugins with AI context variables.
Example structured log payload:
{
"route_id": "27bcdea2-7586-47ad-9262-a3adf4b6699e",
"service_id": "d8402098-2f80-4e08-afab-21b9ebc62090",
"model": "deepseek-chat",
"request_model": "",
"prompt_tokens": 11,
"completion_tokens": 16,
"duration_ms": 3991,
"ttft_ms": 2159,
"upstream_response_time": 1133
}
route_idandservice_idare the gateway-assigned UUIDs for the route and service.request_modelis empty when the client does not specify a model explicitly; the model is then determined by the plugin configuration.duration_msis not a built-in AI log field — use$latency(total request time) and$upstream_latency(upstream response time) in your log format configuration to measure request duration.
Use summary logging for default production telemetry. Enable payload logging only when you need short-term debugging with appropriate data handling controls.
Building a Cost Dashboard
API7 Gateway exposes token usage signals, while cost is calculated externally:
estimated_cost = (prompt_tokens × prompt_price_per_token) + (completion_tokens × completion_price_per_token)
Typical dashboard panels in Grafana:
- Cost by consumer (hourly/daily).
- Cost by model.
- Prompt vs. completion token split.
- Spend trend and budget threshold alerts.
Implementation approach:
- Export AI logs/metrics.
- Enrich records with provider pricing metadata.
- Compute estimated cost in your log or metrics pipeline.
- Visualize and alert in Grafana.
For token governance controls, see Token-Based Rate Limiting and Quota Management.
OpenTelemetry Tracing
Use the standard opentelemetry plugin to trace AI request lifecycle latency across:
- Client request ingress.
- Gateway processing.
- Upstream LLM call.
- Response egress.
This helps isolate whether latency is caused by client-side behavior, gateway plugins, or upstream model/provider response time.
Next Steps
- Token-Based Rate Limiting and Quota Management — Enforce token budgets per route, consumer, and model.
- Multi-LLM Routing and Fallback — Improve reliability and optimize provider/model usage.
- For plugin details, see
ai-proxy,ai-proxy-multi,prometheus, andopentelemetry.