Version: 3.14.1

Implement Prompt Guardrails

Prompt guardrails provide additional safeguards to protect user privacy, prevent unintended or harmful model behaviors, discourage hallucinated responses, and stay compliant with responsible AI ethical standards when working with large language models (LLMs).

In this document, you will learn how to implement prompt guardrails to redact sensitive information, as well as to discourage undesired output and hallucinations.

Prerequisite(s)

Understand how you can integrate APISIX with LLM services, such as OpenAI.
Have a running API7 Enterprise instance or APISIX instance.

Redact Sensitive Information (PII)

In this section, you will learn how to use the API7 Enterprise's data-mask plugin to implement an input guardrail that removes or replaces sensitive information in the request body before the request is forwarded to the LLM upstream services.

If you are integrating APISIX with OpenAI chat completions, the request body to the endpoint should follow the below format:

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "system prompt goes here"
    },
    {
      "role": "user",
      "content": "user-defined prompt goes here"
    },
    ...
  ]
}

Suppose you would like to mask all email addresses before the request is forwarded to the upstream service. Enable the data-mask plugin in the dashboard and use the following configuration in the JSON Editor:

{
  "request": [
    {
      "action": "regex",
      "type": "body",
      "body_format": "json",
      "name": "$.messages[*].content",
      "regex": "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b",
      "value": "xxxxxx@email.com"
    }
  ]
}

❶ Configure a data masking action to match sensitive information using RegEx.

❷ Specify that the location where sensitive information is the request body.

❸ Specify the request body encoding.

❹ Specify the target fields in the request body to be content of all messages using the JSON path syntax.

❺ Configure a RegEx matching any email format.

❻ Replace the matched email addresses with xxxxxx@email.com.

Send a sample request to parse any PII present in text:

curl "http://127.0.0.1:9080/v1/chat/completions" -X POST \
  -H "Content-Type: application/json" \
  -H "Host: api.openai.com:443" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "system",
        "content": "Extract all PII in text below:"
      },
      {
        "role": "user",
        "content": "Steve Jobs co-found Apple in 1976 and reimagined the personal computer, making technology accessible for everyday users. You may write to Jobs at sjobs@apple.com."
      }
    ]
  }'

❶ Configure a system prompt to extract all PII (for easier verification).

❷ Send a user message including two pieces of PII, name and email address.

You should receive a response showing the email address has been redacted:

{
  ...,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content":  "- Name: Steve Jobs\n- Email: xxxxx@apple.com"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  ...
}

For more information on how to use the data-mask plugin, see the plugin doc.

Discourage Hallucinations and Undesired Output

Hallucinations refer to instances where the model generates information that is factually incorrect, misleading, or entirely fabricated, even though it may sound plausible or confident. There are different approaches to mitigate hallucinations, one of which is to pre-engineer system prompts. For example, you can configure the following system prompt:

Before you respond to the user message, on a scale of 0 to 10, how confident are you with your response? If your confidence level is lower than 8/10, respond with "Sorry I do not have an answer that I am confident with" and explain the reasoning. If your confidence level is higher or equal to 8/10, you may return the response to the user.

You can also pre-engineer system prompts to discourage undesired output. For example, you may want all responses to not quote information from copyrighted content, or reference any controversial sources. You can configure the following system prompt:

Provide all responses based on factual information, avoiding any quotes from copyrighted materials. Do not reference or include information from controversial or unreliable sources. Ensure that all content is original, non-derivative, and based on widely accepted, publicly available information.

See Configure Prompt Decorators to learn how you can configure these pre-engineered prompts, should you wish to adopt this approach.

Moderate Content for Toxicity

Content moderation for toxicity in user prompts helps ensure a safe and respectful environment for users. Given that LLMs can generate responses based on user input, it is crucial to correctly handle and filter out harmful content such as profanity, hate speech, insult, harassment, violence, and threats, before they are processed by the model.

In APISIX, you could use the ai-aws-content-moderation plugin to check input prompts for toxicity and configure thresholds for each moderation category. If the evaluated outcome of a request exceeds the configured score, the request will be rejected by the gateway.

See ai-aws-content-moderation plugin doc for more information on how to use the plugin.

Implement Allow and Deny Patterns

During LLM integration, implementing allow and deny patterns is a practice for enhancing security and controlling the quality of user interactions. By defining explicit rules that permit or block specific types of input, organizations can prevent confidential or inappropriate content from reaching the model. This approach not only protects against potential misuse and harmful outputs but also ensures compliance with regulatory standards and internal policies. Such guardrails are crucial for maintaining the integrity and reliability of AI systems, especially when handling sensitive data or user-generated content.

In APISIX, you could use the ai-prompt-guard plugin to safeguard your LLM endpoints by inspecting and validating incoming prompt messages. It checks the content of requests against user-defined allowed and denied patterns to ensure that only approved inputs are forwarded to upstream LLM. Based on its configuration, the plugin can either examine just the latest message or the entire conversation history, and it can be set to check prompts from all roles or only from end users.

See ai-prompt-guard plugin doc for more information on how to use the plugin.

Next Steps

You have now learned how to implement a few prompt guardrails in APISIX for additional safeguards when integrating with LLM service providers.

There are other types of guardrails, such as denied topics and content filters, as well as different ways to implement them. While some approaches may not be suitable or ideal to implement in the API gateway, you are encouraged to review additional resources and explore various strategies that best fit your business needs.

Prerequisite(s)​

Redact Sensitive Information (PII)​

Discourage Hallucinations and Undesired Output​

Moderate Content for Toxicity​

Implement Allow and Deny Patterns​

Next Steps​