Performance and Sizing
AISIX AI Gateway's data plane is a compiled Rust proxy, so it adds very little latency of its own to a request. This page reports measured proxy latency and throughput on a reference machine. It also gives a formula to size CPU for your traffic.
These figures follow the same benchmark pattern that Kong, LiteLLM, and TensorZero publish, so you can compare AISIX against their published numbers. AISIX was pinned to 4 vCPUs on a dedicated AWS c7i.4xlarge, proxying OpenAI-compatible /v1/chat/completions with a ~1000-token prompt. The upstream is a near-zero-latency mock and no traffic-control policies are attached, so the reported latency is essentially the gateway's own overhead.
What to Expect
The gateway's own processing stays in the sub-millisecond range at low-to-moderate load and grows gracefully as CPU fills. Because the test upstream returns in ~0.07 ms, the gateway latency below is effectively the overhead AISIX adds on top of a real upstream:
| Offered load | Throughput (req/s) | Gateway latency p50 / p95 / p99 |
|---|---|---|
| Light (20%) | 5,700 | 0.31 / 0.51 / 0.59 ms |
| Moderate (40%) | 11,300 | 0.52 / 0.89 / 1.04 ms |
| Busy (60%) | 17,000 | 0.82 / 1.37 / 1.69 ms |
| Heavy (80%) | 22,600 | 1.12 / 2.14 / 2.54 ms |
| Saturated | 28,300 | — |
Isolating the overhead further — gateway latency minus the direct-to-upstream latency at the same rate — leaves a p50 of just 0.24 ms at light load. At heavy load it rises to 0.99 ms. A real LLM call takes hundreds of milliseconds to several seconds, so this gateway overhead is negligible end to end.
Throughput and CPU
On 4 vCPUs, a single AISIX instance sustained about 28,300 req/s for this workload. CPU usage scales almost perfectly linearly with request rate:
CPU% ≈ 14 + 0.0144 × (req/s) # per instance, in % of one vCPU
That is roughly 0.14 ms of one CPU core per request plus a small fixed runtime cost. This linear fit holds across the tested 20–80% range. Near saturation the curve flattens against the 4-vCPU ceiling — the 28,300 req/s peak drew about 383% CPU, not the ~421% a naive extrapolation implies. That flattening is one more reason to size below saturation, as described below. Throughput scales horizontally — add vCPUs to an instance, or add replicas behind a load balancer.
Streaming Responses
For streaming (SSE) responses, AISIX relays tokens to the client as they arrive from the upstream rather than buffering the whole response. Time-to-first-token overhead is about 0.65 ms, and the total stream duration matches the upstream — the gateway does not hold the stream.
Sizing Your Deployment
Estimate the vCPUs a single instance needs for a target request rate Q (req/s):
vCPUs ≈ (14 + 0.0144 × Q) / 100
| Target throughput | vCPUs (approx) |
|---|---|
| 5,000 req/s | ~0.9 |
| 10,000 req/s | ~1.6 |
| 25,000 req/s | ~3.7 |
| 50,000 req/s | ~7.3 |
| 100,000 req/s | ~14.5 |
Guidance:
- Leave headroom. Size for roughly 70–80% of saturation, not 100%, so latency stays low under bursts.
- Scale out for HA and aggregate throughput. Run multiple replicas behind a load balancer; per-instance overhead stays flat as you add replicas.
- Traffic controls add cost. These figures are a pure-proxy baseline. Authentication, rate limiting, guardrails, caching, and request logging each add per-request work — measure with your policy set enabled.
- Results depend on request shape. Larger prompts and response bodies cost more per request; streaming and non-streaming differ. Benchmark with representative traffic before committing capacity.
How These Were Measured
The reference machine was a dedicated AWS c7i.4xlarge (16 vCPU). AISIX ran pinned to 4 vCPUs, with a load generator and a canned near-zero-latency mock upstream on separate cores. That isolation keeps the upstream and the load generator from ever becoming the bottleneck. Reported latency is the gateway's own overhead at a given rate, with no policies attached. Your results will vary with hardware, request shape, and enabled policies.