Performance and Sizing

AISIX AI Gateway's data plane is a compiled Rust proxy, so it adds very little latency of its own to a request. This page reports measured proxy latency and throughput on a reference machine. It also gives a formula to size CPU for your traffic.

These figures follow the same benchmark pattern that Kong, LiteLLM, and TensorZero publish, so you can compare AISIX against their published numbers. AISIX was pinned to 4 vCPUs on a dedicated AWS c7i.4xlarge, proxying OpenAI-compatible /v1/chat/completions with a ~1000-token prompt. The upstream is a near-zero-latency mock and no traffic-control policies are attached, so the reported latency is essentially the gateway's own overhead.

What to Expect

The gateway's own processing stays in the sub-millisecond range at low-to-moderate load and grows gracefully as CPU fills. Because the test upstream returns in ~0.07 ms, the gateway latency below is effectively the overhead AISIX adds on top of a real upstream:

Offered load	Throughput (req/s)	Gateway latency p50 / p95 / p99
Light (20%)	5,700	0.31 / 0.51 / 0.59 ms
Moderate (40%)	11,300	0.52 / 0.89 / 1.04 ms
Busy (60%)	17,000	0.82 / 1.37 / 1.69 ms
Heavy (80%)	22,600	1.12 / 2.14 / 2.54 ms
Saturated	28,300	—

Isolating the overhead further — gateway latency minus the direct-to-upstream latency at the same rate — leaves a p50 of just 0.24 ms at light load. At heavy load it rises to 0.99 ms. A real LLM call takes hundreds of milliseconds to several seconds, so this gateway overhead is negligible end to end.

Throughput and CPU

On 4 vCPUs, a single AISIX instance sustained about 28,300 req/s for this workload. CPU usage scales almost perfectly linearly with request rate:

CPU% ≈ 14 + 0.0144 × (req/s)     # per instance, in % of one vCPU

That is roughly 0.14 ms of one CPU core per request plus a small fixed runtime cost. This linear fit holds across the tested 20–80% range. Near saturation the curve flattens against the 4-vCPU ceiling — the 28,300 req/s peak drew about 383% CPU, not the ~421% a naive extrapolation implies. That flattening is one more reason to size below saturation, as described below. Throughput scales horizontally — add vCPUs to an instance, or add replicas behind a load balancer.

Streaming Responses

For streaming (SSE) responses, AISIX relays tokens to the client as they arrive from the upstream rather than buffering the whole response. Time-to-first-token overhead is about 0.65 ms, and the total stream duration matches the upstream — the gateway does not hold the stream.

Sizing Your Deployment

Estimate the vCPUs a single instance needs for a target request rate Q (req/s):

vCPUs ≈ (14 + 0.0144 × Q) / 100

Target throughput	vCPUs (approx)
5,000 req/s	~0.9
10,000 req/s	~1.6
25,000 req/s	~3.7
50,000 req/s	~7.3
100,000 req/s	~14.5

Guidance:

Leave headroom. Size for roughly 70–80% of saturation, not 100%, so latency stays low under bursts.
Scale out for HA and aggregate throughput. Run multiple replicas behind a load balancer; per-instance overhead stays flat as you add replicas.
Traffic controls add cost. These figures are a pure-proxy baseline. Authentication, rate limiting, guardrails, caching, and request logging each add per-request work — measure with your policy set enabled.
Results depend on request shape. Larger prompts and response bodies cost more per request; streaming and non-streaming differ. Benchmark with representative traffic before committing capacity.

How These Were Measured

The reference machine was a dedicated AWS c7i.4xlarge (16 vCPU). AISIX ran pinned to 4 vCPUs, with a load generator and a canned near-zero-latency mock upstream on separate cores. That isolation keeps the upstream and the load generator from ever becoming the bottleneck. Reported latency is the gateway's own overhead at a given rate, with no policies attached. Your results will vary with hardware, request shape, and enabled policies.

What to Expect​

Throughput and CPU​

Streaming Responses​

Sizing Your Deployment​

How These Were Measured​

What to Expect

Throughput and CPU

Streaming Responses

Sizing Your Deployment

How These Were Measured