Skip to main content

Performance and Sizing

AISIX AI Gateway's data plane is a compiled Rust proxy, so it adds very little latency of its own to a request. This page reports measured proxy latency and throughput on a reference machine. It also gives a formula to size CPU for your traffic.

These figures follow the same benchmark pattern that Kong, LiteLLM, and TensorZero publish, so you can compare AISIX against their published numbers. AISIX was pinned to 4 vCPUs on a dedicated AWS c7i.4xlarge, proxying OpenAI-compatible /v1/chat/completions with a ~1000-token prompt. The upstream is a near-zero-latency mock and no traffic-control policies are attached, so the reported latency is essentially the gateway's own overhead.

What to Expect

The gateway's own processing stays in the sub-millisecond range at low-to-moderate load and grows gracefully as CPU fills. Because the test upstream returns in ~0.07 ms, the gateway latency below is effectively the overhead AISIX adds on top of a real upstream:

Offered loadThroughput (req/s)Gateway latency p50 / p95 / p99
Light (20%)5,7000.31 / 0.51 / 0.59 ms
Moderate (40%)11,3000.52 / 0.89 / 1.04 ms
Busy (60%)17,0000.82 / 1.37 / 1.69 ms
Heavy (80%)22,6001.12 / 2.14 / 2.54 ms
Saturated28,300

Isolating the overhead further — gateway latency minus the direct-to-upstream latency at the same rate — leaves a p50 of just 0.24 ms at light load. At heavy load it rises to 0.99 ms. A real LLM call takes hundreds of milliseconds to several seconds, so this gateway overhead is negligible end to end.

Throughput and CPU

On 4 vCPUs, a single AISIX instance sustained about 28,300 req/s for this workload. CPU usage scales almost perfectly linearly with request rate:

CPU% ≈ 14 + 0.0144 × (req/s)     # per instance, in % of one vCPU

That is roughly 0.14 ms of one CPU core per request plus a small fixed runtime cost. This linear fit holds across the tested 20–80% range. Near saturation the curve flattens against the 4-vCPU ceiling — the 28,300 req/s peak drew about 383% CPU, not the ~421% a naive extrapolation implies. That flattening is one more reason to size below saturation, as described below. Throughput scales horizontally — add vCPUs to an instance, or add replicas behind a load balancer.

Streaming Responses

For streaming (SSE) responses, AISIX relays tokens to the client as they arrive from the upstream rather than buffering the whole response. Time-to-first-token overhead is about 0.65 ms, and the total stream duration matches the upstream — the gateway does not hold the stream.

Sizing Your Deployment

Estimate the vCPUs a single instance needs for a target request rate Q (req/s):

vCPUs ≈ (14 + 0.0144 × Q) / 100
Target throughputvCPUs (approx)
5,000 req/s~0.9
10,000 req/s~1.6
25,000 req/s~3.7
50,000 req/s~7.3
100,000 req/s~14.5

Guidance:

  • Leave headroom. Size for roughly 70–80% of saturation, not 100%, so latency stays low under bursts.
  • Scale out for HA and aggregate throughput. Run multiple replicas behind a load balancer; per-instance overhead stays flat as you add replicas.
  • Traffic controls add cost. These figures are a pure-proxy baseline. Authentication, rate limiting, guardrails, caching, and request logging each add per-request work — measure with your policy set enabled.
  • Results depend on request shape. Larger prompts and response bodies cost more per request; streaming and non-streaming differ. Benchmark with representative traffic before committing capacity.

How These Were Measured

The reference machine was a dedicated AWS c7i.4xlarge (16 vCPU). AISIX ran pinned to 4 vCPUs, with a load generator and a canned near-zero-latency mock upstream on separate cores. That isolation keeps the upstream and the load generator from ever becoming the bottleneck. Reported latency is the gateway's own overhead at a given rate, with no policies attached. Your results will vary with hardware, request shape, and enabled policies.

API7.ai Logo

The digital world is connected by APIs,
API7.ai exists to make APIs more efficient, reliable, and secure.

Sign up for API7 newsletter

Product

API7 Gateway

SOC2 Type IIISO 27001HIPAAGDPRRed Herring

Copyright © APISEVEN PTE. LTD 2019 – 2026. Apache, Apache APISIX, APISIX, and associated open source project names are trademarks of the Apache Software Foundation