Version: 3.10.x

Implement RAG at the Gateway Layer

This guide shows how to configure Retrieval-Augmented Generation (RAG) in API7 AI Gateway so requests are enriched with context from your knowledge base before reaching the LLM.

Overview

Current limitation: RAG in API7 AI Gateway is Azure-only today. You must use Azure OpenAI for embeddings and Azure AI Search for vector search. Support for additional providers is planned but not implemented.

RAG augments LLM prompts with relevant context retrieved at request time from your vector knowledge base. Implementing this at the gateway layer centralizes enrichment logic and avoids application-side RAG orchestration in every service.

Architecture flow:

Client sends a chat request to API7 AI Gateway.
Gateway uses ai-rag to generate embeddings and run vector search against Azure AI Search.
Gateway injects retrieved context into the prompt.
Gateway forwards the enriched request to Azure OpenAI through ai-proxy.
Client receives a grounded answer.

Prerequisites

Install Docker.
Install cURL to send requests to the services for validation.
Have a running API7 Gateway instance with the ai-proxy and ai-rag plugins available.

Create a token from the Dashboard and save it to an environment variable:

export API_KEY=your-dashboard-token   # replace with your Dashboard token

Replace {gateway_group_id} with your gateway group ID. Use default if you are following the quickstart.
If you are following the Admin API examples, create or reuse a service in API7 Gateway. If you do not have one yet, follow Create or Reuse a Service, then save its ID to an environment variable:
```
export SERVICE_ID=your-service-id         # replace with your service ID
```
Azure OpenAI resource and deployment for your generation model.
Azure OpenAI access for embeddings.
Azure AI Search service with an index populated from your knowledge base.

Configure the RAG Plugin

Configure ai-rag and ai-proxy on the same route so retrieval and generation happen in one request path.

Admin API
ADC

curl -k "https://localhost:7443/apisix/admin/routes?gateway_group_id={gateway_group_id}" -X PUT \
  -H "X-API-KEY: ${API_KEY}" \
  -d '{
    "id": "rag-azure",
    "service_id": "'"$SERVICE_ID"'",
    "paths": ["/v1/chat/completions"],
    "plugins": {
      "ai-proxy": {
        "provider": "azure-openai",
        "auth": {
          "header": {
            "api-key": "YOUR_AZURE_OPENAI_KEY"
          }
        },
        "options": {
          "model": "gpt-4o-mini"
        },
        "override": {
          "endpoint": "https://YOUR-RESOURCE.openai.azure.com/openai/deployments/YOUR-DEPLOYMENT/chat/completions?api-version=2024-10-21"
        }
      },
      "ai-rag": {
        "embeddings_provider": {
          "azure_openai": {
            "endpoint": "https://YOUR-RESOURCE.openai.azure.com/openai/deployments/text-embedding-3-large/embeddings?api-version=2023-05-15",
            "api_key": "YOUR_AZURE_OPENAI_KEY"
          }
        },
        "vector_search_provider": {
          "azure_ai_search": {
            "endpoint": "https://YOUR-SEARCH.search.windows.net/indexes/YOUR-INDEX/docs/search?api-version=2024-07-01",
            "api_key": "YOUR_AZURE_SEARCH_KEY"
          }
        }
      }
    }
  }'

❶ ai-proxy handles generation and must target provider: "azure-openai" on this route.

❷ ai-rag is configured with Azure-only backends: Azure OpenAI for embeddings (embeddings_provider) and Azure AI Search for vector retrieval (vector_search_provider).

❸ Specify the full Azure OpenAI endpoint, including your resource name, deployment name, and API version.

adc.yaml
services:
  - name: RAG Azure Service
    routes:
      - uris:
          - /v1/chat/completions
        name: rag-azure
        plugins:
          ai-proxy:
            provider: azure-openai
            auth:
              header:
                api-key: "YOUR_AZURE_OPENAI_KEY"
            options:
              model: gpt-4o-mini
            override:
              endpoint: https://YOUR-RESOURCE.openai.azure.com/openai/deployments/YOUR-DEPLOYMENT/chat/completions?api-version=2024-10-21
          ai-rag:
            embeddings_provider:
              azure_openai:
                endpoint: https://YOUR-RESOURCE.openai.azure.com/openai/deployments/text-embedding-3-large/embeddings?api-version=2023-05-15
                api_key: YOUR_AZURE_OPENAI_KEY
            vector_search_provider:
              azure_ai_search:
                endpoint: https://YOUR-SEARCH.search.windows.net/indexes/YOUR-INDEX/docs/search?api-version=2024-07-01
                api_key: YOUR_AZURE_SEARCH_KEY

❶ ai-proxy handles generation and must target provider: "azure-openai" on this route.

❷ ai-rag is configured with Azure-only backends: Azure OpenAI for embeddings (embeddings_provider) and Azure AI Search for vector retrieval (vector_search_provider).

❸ Specify the full Azure OpenAI endpoint, including your resource name, deployment name, and API version.

Synchronize the configuration to API7 Gateway:

adc sync -f adc.yaml

For the full configuration reference, see ai-rag and ai-proxy.

Validate the Configuration

Send a request that includes the ai_rag field. The request must provide vector_search.fields and embeddings.input.

curl "http://127.0.0.1:9080/v1/chat/completions" -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Based on our internal docs, what are the main capabilities of API7 AI Gateway?"
      }
    ],
    "ai_rag": {
      "vector_search": {
        "fields": {
          "content": "content",
          "title": "title",
          "url": "url"
        },
        "top_k": 3
      },
      "embeddings": {
        "input": "API7 AI Gateway capabilities overview"
      }
    }
  }'

❶ ai_rag.vector_search.fields maps your Azure AI Search document fields used during retrieval.

❷ ai_rag.embeddings.input is the text embedded for vector search and is required for retrieval.

❸ Use a question that depends on your indexed knowledge base so you can confirm grounding behavior.

You should receive a standard chat completion response with an answer grounded in your indexed documents. Compared with a request without ai_rag, the answer should be more specific and aligned with your internal knowledge base.

Best Practices

Keep your knowledge base and index up to date. Stale documents reduce answer quality.
Tune top_k for your data shape. Too low can miss context; too high can dilute prompts.
Monitor token usage closely. Added context increases prompt tokens and cost.

Next Steps

Track token usage and costs to monitor RAG overhead.
Apply token-based budgets to control spend.
Review plugin references: ai-rag, ai-proxy.

Overview​

Prerequisites​

Configure the RAG Plugin​

Validate the Configuration​

Best Practices​

Next Steps​