Version: 3.8.x

Trigger Gateway Alerts

Abnormal traffic patterns or errors in API Gateway usage can indicate problems or malicious attacks. By setting up alerts for certain thresholds and activities, you can quickly detect and gain insights into patterns that might indicate a security breach, abuse, or abnormal usage.

This tutorial guides you through creating alert policies to receive email and webhook notifications for specific events. Below is an interactive demo providing a hands-on introduction to counting healthy gateway instances in gateway groups scenario.

Prerequisites

Install API7 Enterprise.
Have a running API on the gateway group.
Get the webhook URL of your notification system.

Set Up SMTP Server

Select Organization from the top navigation bar, and then select Settings.
Click the SMTP Server tab.
Click Enable.
In the dialog box, do the following:

In the SMTP Server Address field, enter the address of your SMTP server. For example, 127.0.0.1.
In the Username and Password field, enter the credential to connect to your SMTP server.
In the From Name field, enter API7 Enterprise to display this name as the sender in the email.
In the From Email Address field, enter noreply@api7.ai. This will use as the actual sender address.
Click Enable.

Add Contact Points

A Contact Point defines a set of email addresses or webhook URLs that can be used by multiple alert policies.

Add a Email Contact Point

Select Organization from the top navigation bar, and then select Contact Points.
Click Add Contact Points.
In the dialog box, do the following:

In the Name field, enter Emergency Team Email List.
In the Type field, choose Email.
In the Email Addresses field, enter the email addresses of the recipients, for example, emergencyteamlist@api7.ai.
Click Add.

Add a Webhook Contact Point

Use a Slack incoming webhook to post messages from API7 Enterprise into Slack.

Select Organization from the top navigation bar, and then select Contact Points.
Click Add Contact Points.
In the dialog box, do the following:

In the Name field, enter Slack Notification.
In the Type field, choose Webhook.
In the URL field, enter https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX. Replace with your IDs.
Click Add.

Add Essential Alert Polices

The following alert policies are strongly recommended for configuration, as they are crucial for most users.

Monitor Control Plane to Data Plane mTLS Certificate Expiration

API7 control plane certificate and API7 control plane CA certificate enable secure mTLS communication between the control plane and data plane, which are activated upon gateway instance deployment. These certificates have a 13-month validity period.

To proactively monitor and alert on expiring certificates on gateway instances, implement a daily task to check certificate expiration dates. If a gateway instance's certificate is nearing expiration (within 30 days), send email alerts to the emergency team and post a notification to Slack.

Select Alert from the side navigation bar, then click Policies.
Click Add Alert Policy.
In the dialog box, do the following:

In the Name field, enter Gateway Instance Certificate Expired.
In the Severity field, choose High.
In the Check Interval field, enter 1440 minutes.
In the Conditions field, do the following:
- In the Operator field, choose Meet all of the following conditions(AND).
- In the Event field, choose mTLS certificate between control plane and data plane will expire.
- In the Trigger Gateway Group field, choose Select all.
- In the Rule field, fill in the blanks to mTLS certificate between data plane and control plane will expire in 30 days.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Email.
- In the Contact Points field, choose Emergency Team Email List.
- In the Alert Email Subject field, enter
```
[API7 Alert] Gateway Instance Certificate Expiration Warning.
```
- In the Alert Email Content field, enter：
```
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}.
```
- Click Add.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Webhook.
- In the Contact Points field, choose Slack Notification.
- In the Alert Message field, enter
```
"text": "{{.AlertDetail}}.".
```
- Click Add.

Click Add.

Validate

Imagine a control plane certificate expiring on 2024-12-31. On 2024-12-10, the alert policy triggers:

An email of the following:

* Subject: [API7 Alert] Gateway Instance Certificate Expiration Warning
* Alert Time: 2024 DEC 10 17:00:00, Detail: The certificate for gateway instance: gateway 123 will expire in 21 days.

A message in Slack:

The certificate for gateway instance: gateway 123 will expire in 21 days.

Select Alert from the side navigation bar, then click History.
An alert record corresponding to the event will be displayed. Click Detail:

Alert Policy: Gateway Instance Certificate Expired
Severity: High
Alert Time: 5 minutes ago
Trigger Gateway Group: production group
Alert Detail: The certificate for gateway instance: gateway 123 will expire in 21 days.

Detect Gateway Instance Offline

If the gateway instance (data plane node) has not reported heartbeat to the control plane for more than 2 hours, and this state persists for 7 days, the data plane node will be automatically removed, and marked offline.

Implement a hourly task to detect and send email alerts to the emergency team and Slack notifications in case of issues. Then someone should try to recover offline gateway instances.

Select Alert from the side navigation bar, then click Policies.
Click Add Alert Policy.
In the dialog box, do the following:

In the Name field, enter Gateway Instance Offline.
In the Severity field, choose High.
In the Check Interval field, enter 60 minutes.
In the Conditions field, do the following:
- In the Operator field, choose Meet all of the following conditions(AND).
- In the Event field, choose Gateway instance offline.
- In the Trigger Gateway Group field, choose Select all.
- In the Rule field, fill in the blanks to Any gateway instance in the gateway group offline for more than 1 hour.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Email.
- In the Contact Points field, choose Emergency Team Email List.
- In the Alert Email Subject field, enter:
```
[API7 Alert] Gateway Instance Offline Warning.
```
- In the Alert Email Content field, enter:
```
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}.
```
- Click Add.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Webhook.
- In the Contact Points field, choose Slack Notification.
- In the Alert Message field, enter：
```
 "text": "{{.AlertDetail}}".
```
- Click Add.

Click Add.

Validate

Imagine that two gateway instances went offline at 2024-12-31 14:00:00 and 2024-12-31 13:00:00. On 2024-12-31 17:00:00, the alert policy triggers:

An email of the following:

* Subject: [API7 Alert] Gateway Instance Offline Warning
* Alert Time: 2024 DEC 31 17:00:00, Detail: Gateway instance: gateway 123 in the gateway group: production group has been offline for 3 hours. 
Gateway instance: gateway 456 in the gateway group: test group has been offline for 4 hours.

A message in Slack:

Gateway instance: gateway 123 in the gateway group: production group has been offline for 3 hours. 
Gateway instance: gateway 456 in the gateway group: test group has been offline for 4 hours.

A record in the alert history. Select Alert from the side navigation bar, then click History to see the record.
Record details. Clicking into the record Detail, you should see:

Alert Policy: Gateway Instance Offline
Severity: High
Alert Time: 5 minutes ago
Trigger Gateway Group: production group
Alert Detail: Gateway instance: gateway 123 in the gateway group: production group has been offline for 3 hours. Gateway instance: gateway 456 in the gateway group: test group has been offline for 4 hours.

Detect CPU Cores Exceeding Quota

If CPU cores usage of all gateway groups exceeds the licensed CPU core limit for seven consecutive days, resource addition or modification will be restricted. However, existing services and routes will continue to function.

Implement a hourly task to detect all gateway groups for production environments, and send email alerts to the emergency team and Slack notifications in case of issues.

Select Alert from the side navigation bar, then click Policies.
Click Add Alert Policy.
In the dialog box, do the following:

In the Name field, enter CPU cores Exceeding Quota.
In the Severity field, choose High.
In the Check Interval field, enter 60 minutes.
In the Conditions field, do the following:
- In the Operator field, choose Meet all of the following conditions(AND).
- In the Event field, choose Allowed License CPU Quota Exceeded.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Email.
- In the Contact Points field, choose Emergency Team Email List.
- In the Alert Email Subject field, enter:
```
[API7 Alert] CPU Cores Exceeding Quota.
```
- In the Alert Email Content field, enter:
```
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}.
```
- Click Add.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Webhook.
- In the Contact Points field, choose Slack Notification.
- In the Alert Message field, enter
```
"text": "{{.AlertDetail}}".
```
- Click Add.

Click Add.

Validate

Assume that your API7 Enterprise license limit of 100 CPU cores. On 2024-12-31 17:00:00, the alert policy triggers:

An email of the following:

* Subject: [API7 Alert] CPU Cores Exceeding Quota
* Alert Time: 2024 DEC 31 17:00:00, Detail: Total CPU usage 110c has exceeded the allowed license CPU quota 100c.

A message in Slack:

Total CPU usage 110c has exceeded the allowed license CPU quota 100c.

A record in the alert history. Select Alert from the side navigation bar, then click History to see the record.
Record details. Clicking into the record Detail, you should see:

Alert Policy: CPU Cores Exceeding Quota
Severity: High
Alert Time: 5 minutes ago
Alert Detail: Total CPU usage 110c has exceeded the allowed license CPU quota 100c.

More Alert Policy Examples

Monitor SSL Certificate Expiration

To proactively monitor and alert on expiring SSL certificates, implement a daily task to check certificate expiration dates.

If a certificate is nearing expiration (within 30 days), send email alerts to the emergency team and post a notification to Slack.

Select Alert from the side navigation bar, then click Policies.
Click Add Alert Policy.
In the dialog box, do the following:

In the Name field, enter SSL Certificate Expired.
In the Severity field, choose Medium.
In the Check Interval field, enter 1440 minutes.
In the Conditions field, do the following:
- In the Operator field, choose Meet all of the following conditions(AND).
- In the Event field, choose SSL Certitificate will expire.
- In the Trigger Gateway Group field, choose Select all.
- In the Rule field, fill in the blanks to SSL certificate will expire in 30 days.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Email.
- In the Contact Points field, choose Emergency Team Email List.
- In the Alert Email Subject field, enter:
```
[API7 Alert] SSL Certificate Expiration Warning.
```
- In the Alert Email Content field, enter:
```
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}.
```
- Click Add.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Webhook.
- In the Contact Points field, choose Slack Notification.
- In the Alert Message field, enter：
```
"text": "{{.AlertDetail}}.".
```
- Click Add.

Click Add.

Validate

Suppose that a SSL certificate will expire on 2024-12-31. On 2024-12-10, the alert policy will trigger:

An email of the following:

* Subject: [API7 Alert]SSL Certificate Expiration Warning
* Alert Time: 2024 DEC 10 17:00:00, Detail: SSL Certificate: sslcert123 in gateway group: production group expires in 21 days.

A message in Slack:

SSL Certificate: sslcert123 in gateway group: production group will expire in 21 days.

A record in the alert history. Select Alert from the side navigation bar, then click History to see the record.
Record details. Clicking into the record Detail, you should see:

Alert Policy: SSL Certificate Expired
Severity: Medium
Alert Time: 5 minutes ago
Trigger Gateway Group: production group
Alert Detail: SSL Certificate: sslcert123 in gateway group: production group expires in 21 days.

Count Healthy Gateway Instances in a Gateway Group

If the number of healthy gateway instances in a gateway group falls below a critical threshold, it indicates potential service disruptions and impacts on traffic handling. This scenario is particularly relevant in Kubernetes deployments, where gateway instances may experience failures or be scaled down unexpectedly.

Implement a high frequent task to detect send email alerts to the emergency team and Slack notifications in case of issues.

Select Alert from the side navigation bar, then click Policies.
Click Add Alert Policy.
In the dialog box, do the following:

In the Name field, enter No Enough Healthy Gateway Instances in Production Group.
In the Severity field, choose Medium.
In the Check Interval field, enter 30 minutes.
In the Trigger Gateway Group field, choose Production Group.
In the Conditions field, do the following:
- In the Operator field, choose Meet all of the following conditions(AND).
- In the Event field, choose Number of healthy gateway instances.
- In the Rule field, fill in the blanks to Number of gateway instances in the gateway group is less than 50.
Click Add Notification.

In the dialog box, do the following:

In the Type field, choose Email.
In the Contact Points field, choose Emergency Team Email List.

In the Alert Email Subject field, enter:

[API7 Alert] No Enough Healthy Gateway Instances in Production Group`.
* In the **Alert Email Content** field, enter `Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}.

Click Add.

Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Webhook.
- In the Contact Points field, choose Slack Notification.
- In the Alert Message field, enter:
```
"text": "{{.AlertDetail}}".
```
- Click Add.

Click Add.

Validate

Assume that your gateway group requires a minimum of 50 healthy gateway instances. However, as of 2024-12-31, only 40 instances are operational. This significant shortfall may lead to service degradation and potential outages. Immediate attention is required to address this issue.

An email of the following:

* Subject: [API7 Alert] No Enough Healthy Gateway Instances in Production Group
* Alert Time: 2024 DEC 31 17:00:00, Detail: The number of healthy gateway instances 40 in gateway group: Production Group is less than the minimum requirement of 50.

A message in Slack:

The number of healthy gateway instances 40 in gateway group: Production Group is less than the minimum requirement of 50.

A record in the alert history. Select Alert from the side navigation bar, then click History to see the record.
Record details. Clicking into the record Detail, you should see:

Alert Policy: No Enough Healthy Gateway Instances in Production Group
Severity: Medium
Alert Time: 5 minutes ago
Trigger Gateway Group: Production Group
Alert Detail: The number of healthy gateway instances 40 in gateway group: Production Group is less than the minimum requirement of 50.

Monitor Status Code

If the number of specific API response status code exceed the threshold, for example, too many 500 error, it indicates potential service disruptions and impacts on traffic handling.

Implement a high frequent task to detect send email alerts to the emergency team and Slack notifications in case of issues.

Select Alert from the side navigation bar, then click Policies.
Click Add Alert Policy.
In the dialog box, do the following:

In the Name field, enter Too many 500 status code in production gateway groups.
In the Severity field, choose Medium.
In the Check Interval field, enter 30 minutes.
In the Trigger Gateway Group field, select Match Label then enter key/value envType: production.
In the Conditions field, do the following:
- In the Operator field, choose Meet all of the following conditions(OR).
- In the Event field, choose Number of status code 500.
- In the Rule field, fill in the blanks to Number of requests with status code 500 received by all published services of any one of the gateway groups has reached or exceeded 100 times in the last 60 minutes.
Click Add Condition.
In the Conditions field, do the following:
- In the Event field, choose Ratio of status code 500.
- In the Rule field, fill in the blanks to Ratio of requests with status code 500 received by all published services of any one of the gateway groups has reached or exceeded 10% in the last 60 minutes.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Email.
- In the Contact Points field, choose Emergency Team Email List.
- In the Alert Email Subject field, enter:
```
[API7 Alert] Too many 500 status code in {{.TriggerGatewayGroup}}.
```
- In the Alert Email Content field, enter:
```
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}.
```
- Click Add.
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose Webhook.
- In the Contact Points field, choose Slack Notification.
- In the Alert Message field, enter:
```
 "text": "{{.AlertDetail}}".
```
- Click Add.

Click Add.

Validate

Assume that your gateway group VIP Group has a label envType:production, experienced a 15% error rate between 16:00 and 17:00 on December 31, 2024. Out of 1000 requests, 150 resulted in 500 errors. And the gateway group US Group has a label envType:production, experienced a 10% error rate between 16:00 and 17:00 on December 31, 2024. Out of 500 requests, 50 resulted in 500 errors.

An email of the following:

* Subject: [API7 Alert] [API7 Alert] Too many 500 status code in VIP Group,US Group
* Alert Time: 2024 DEC 31 17:00:00, Detail: The number of 500 status code requests received by all published services in gateway group: VIP Group exceeded the threshold of 100 with a count of 150 in the last 60 minutes. Details: 100 requests for get-ip-route within httpbin-service, 40 requests for get-address-route within httpbin-service, and 10 unmatched requests.
500 status code request ratio for gateway group: VIP Group was 15% in the last 60 minutes (Total requests: 1000). Details: 100 requests (10%) for get-ip-route in httpbin-service, 40 requests (4%) for get-address-route in httpbin-service, and 10 (1%) for unmatched requests.
500 status code request ratio for gateway group: US Group was 10% in the last 60 minutes (Total requests: 500). Details: 100 requests (10%) for get-ip-route in httpbin-service.

A message in Slack:

The number of 500 status code requests received by all published services in gateway group: VIP Group exceeded the threshold of 100 with a count of 150 in the last 60 minutes. Details: 100 requests for get-ip-route within httpbin-service, 40 requests for get-address-route within httpbin-service, and 10 unmatched requests.
500 status code request ratio for gateway group: VIP Group was 15% in the last 60 minutes (Total requests: 1000). Details: 100 requests (10%) for get-ip-route in httpbin-service, 40 requests (4%) for get-address-route in httpbin-service, and 10 (1%) for unmatched requests.
500 status code request ratio for gateway group: US Group was 10% in the last 60 minutes (Total requests: 500). Details: 100 requests (10%) for get-ip-route in httpbin-service.

A record in the alert history. Select Alert from the side navigation bar, then click History to see the record.
Record details. Clicking into the record Detail, you should see:

Alert Policy: Too many 500 in production gateway groups
Severity: Medium
Alert Time: 5 minutes ago
Trigger Gateway Group: VIP Group
Alert Detail: The number of 500 status code requests received by all published services in gateway group: VIP Group exceeded the threshold of 100 with a count of 150 in the last 60 minutes. Details: 100 requests for get-ip-route within httpbin-service, 40 requests for get-address-route within httpbin-service, and 10 unmatched requests. 500 status code request ratio for gateway group: VIP Group was 15% in the last 60 minutes (Total requests: 1000). Details: 100 requests (10%) for get-ip-route in httpbin-service, 40 requests (4%) for get-address-route in httpbin-service, and 10 (1%) for unmatched requests. 500 status code request ratio for gateway group: US Group was 10% in the last 60 minutes (Total requests: 500). Details: 100 requests (10%) for get-ip-route in httpbin-service.

Additional Resources

References
- Alert Templates

Prerequisites​

Set Up SMTP Server​

Add Contact Points​

Add a Email Contact Point​

Add a Webhook Contact Point​

Add Essential Alert Polices​

Monitor Control Plane to Data Plane mTLS Certificate Expiration​

Validate​

Detect Gateway Instance Offline​

Validate​

Detect CPU Cores Exceeding Quota​

Validate​

More Alert Policy Examples​

Monitor SSL Certificate Expiration​

Validate​

Count Healthy Gateway Instances in a Gateway Group​

Validate​

Monitor Status Code​

Validate​

Additional Resources​

Prerequisites

Set Up SMTP Server

Add Contact Points

Add a Email Contact Point

Add a Webhook Contact Point

Add Essential Alert Polices

Monitor Control Plane to Data Plane mTLS Certificate Expiration

Validate

Detect Gateway Instance Offline

Validate

Detect CPU Cores Exceeding Quota

Validate

More Alert Policy Examples

Monitor SSL Certificate Expiration

Validate

Count Healthy Gateway Instances in a Gateway Group

Validate

Monitor Status Code

Validate

Additional Resources