Trigger Gateway Alerts
Abnormal traffic patterns or errors in API Gateway usage can indicate problems or malicious attacks. By setting up alerts for certain thresholds and activities, you can quickly detect and gain insights into patterns that might indicate a security breach, abuse, or abnormal usage.
This tutorial guides you through creating alert policies to receive email and webhook notifications for specific events. Below is an interactive demo providing a hands-on introduction to counting healthy gateway instances in gateway groups scenario.
Prerequisites
- Install API7 Enterprise.
- Have a running API on the gateway group.
- Get the webhook URL of your notification system.
Set Up SMTP Server
- Select Organization from the top navigation bar, and then select Settings.
- Click the SMTP Server tab.
- Click Enable.
- In the dialog box, do the following:
- In the SMTP Server Address field, enter the address of your SMTP server. For example,
127.0.0.1
. - In the Username and Password field, enter the credential to connect to your SMTP server.
- In the From Name field, enter
API7 Enterprise
to display this name as the sender in the email. - In the From Email Address field, enter
noreply@api7.ai
. This will use as the actual sender address. - Click Enable.
Add Contact Points
A Contact Point defines a set of email addresses or webhook URLs that can be used by multiple alert policies.
Add a Email Contact Point
- Select Organization from the top navigation bar, and then select Contact Points.
- Click Add Contact Points.
- In the dialog box, do the following:
- In the Name field, enter
Emergency Team Email List
. - In the Type field, choose
Email
. - In the Email Addresses field, enter
emergencyteamlist@api7.ai
. - Click Add.
Add a Webhook Contact Point
Use a Slack incoming webhook to post messages from API7 Enterprise into Slack.
- Select Organization from the top navigation bar, and then select Contact Points.
- Click Add Contact Points.
- In the dialog box, do the following:
- In the Name field, enter
Slack Notification
. - In the Type field, choose
Webhook
. - In the URL field, enter
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
. - Click Add.
Add an Alert Policy
For more alert email content examples, see Alert Variables and Templates. Alert policies start monitoring immediately after creation or update, with the initial check occurring at the specified check interval.
Monitor SSL Certificate Expiration
To proactively monitor and alert on expiring SSL certificates, implement a daily task to check certificate expiration dates.
If a certificate is nearing expiration (within 30 days), send email alerts to the emergency team and post a notification to Slack.
- Select Alert from the side navigation bar, then click Policies.
- Click Add Alert Policy.
- In the dialog box, do the following:
In the Name field, enter
Control Plane Certificate Expired
.In the Severity field, choose
High
.In the Check Interval field, enter
1440
minutes.In the Conditions field, do the following:
- In the Operator field, choose
Meet all of the following conditions(AND)
. - In the Event field, choose
Control Plane Certitificate will expire
. - In the Trigger Gateway Group field, choose
Select all
. - In the Rule field, fill in the blanks to
Control plane certificate will expire in 30 days
.
- In the Operator field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Email
. - In the Contact Points field, choose
Emergency Team Email List
. - In the Alert Email Subject field, enter
[API7 Alert] [{{.Severity}}]Control Plane Certificate Expiration Warning
. - In the Alert Email Content field, enter
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}
. - Click Add.
- In the Type field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Webhook
. - In the Contact Points field, choose
Slack Notification
. - In the Alert Message field, enter
"text": "{{.AlertDetail}}.".
- Click Add.
- In the Type field, choose
- Click Add.
Validate
Imagine that a gateway instance gateway 123
in the gateway group production group
has a control plane certificate that will expire on 2024-12-31.
- Select Alert from the side navigation bar, then click History.
- You shall see an alert record according to the case. Click Detail:
- Alert Policy: Control Plane Certificate Expired
- Severity: High
- Alert Time: 5 minutes ago
- Trigger Gateway Group: production group
- Alert Detail: Certificate of gateway instance: gateway 123 will expire in 21 days.
- The email sent will be like:
* Subject: [API7 Alert][High]Control Plane Certificate Expiration Warning
* Alert Time: 2024 DEC 10 17:00:00, Detail: Control plane certificate of gateway instance: gateway 123 will expire in 21 days.
- You will receive a message in Slack:
Certificate of gateway instance: gateway 123 will expire in 21 days.
Monitor Control Plane to Data Plane mTLS Certificate Expiration
API7 control plane certificate and API7 control plane CA certificate enable secure mTLS communication between the control plane and data plane, which are activated upon gateway instance deployment. These certificates have a 13-month validity period.
To proactively monitor and alert on expiring certificates on gateway instances, implement a daily task to check certificate expiration dates. If a gateway instance's certificate is nearing expiration (within 30 days), send email alerts to the emergency team and post a notification to Slack.
- Select Alert from the side navigation bar, then click Policies.
- Click Add Alert Policy.
- In the dialog box, do the following:
In the Name field, enter
Gateway Instance Certificate Expired
.In the Severity field, choose
High
.In the Check Interval field, enter
1440
minutes.In the Conditions field, do the following:
- In the Operator field, choose
Meet all of the following conditions(AND)
. - In the Event field, choose
mTLS certificate between control plane and data plane will expire
. - In the Trigger Gateway Group field, choose
Select all
. - In the Rule field, fill in the blanks to
mTLS certificate between data plane and control plane will expire in 30 days
.
- In the Operator field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Email
. - In the Contact Points field, choose
Emergency Team Email List
. - In the Alert Email Subject field, enter
[API7 Alert] Gateway Instance Certificate Expiration Warning
. - In the Alert Email Content field, enter
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}
. - Click Add.
- In the Type field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Webhook
. - In the Contact Points field, choose
Slack Notification
. - In the Alert Message field, enter
"text": "{{.AlertDetail}}.".
- Click Add.
- In the Type field, choose
- Click Add.
Validate
Imagine that a gateway instance gateway 123
in the gateway group production group
has a control plane CA certificate that will expire on 2024-12-31.
- Select Alert from the side navigation bar, then click History.
- You shall see an alert record according to the case. Click Detail:
- Alert Policy: Control Plane CA Certificate Expired
- Severity: High
- Alert Time: 5 minutes ago
- Trigger Gateway Group: production group
- Alert Detail: CA Certificate of gateway instance: gateway 123 will expire in 21 days.
- The email sent will be like:
* Subject: [API7 Alert] Control Plane CA Certificate Expiration Warning
* Alert Time: 2024 DEC 31 17:00:00, Detail: CA Certificate of gateway instance: gateway 123 will expire in 21 days.
- You will receive a message in Slack:
CA Certificate of gateway instance: gateway 123 will expire in 21 days.
Detect Gateway Instance Offline
If the gateway instance (data plane node) has not reported heartbeat to the control plane for more than 2 hours, and this state persists for 7 days, the data plane node will be automatically removed, and marked offline
.
Implement a hourly task to detect and send email alerts to the emergency team and Slack notifications in case of issues. Then someone should try to recover offline gateway instances.
- Select Alert from the side navigation bar, then click Policies.
- Click Add Alert Policy.
- In the dialog box, do the following:
In the Name field, enter
Gateway Instance Offline
.In the Severity field, choose
Medium
.In the Check Interval field, enter
60
minutes.In the Conditions field, do the following:
- In the Operator field, choose
Meet all of the following conditions(AND)
. - In the Event field, choose
Gateway instance offline
. - In the Trigger Gateway Group field, choose
Select all
. - In the Rule field, fill in the blanks to
Any gateway instance in the gateway group offline for more than 1 hour
.
- In the Operator field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Email
. - In the Contact Points field, choose
Emergency Team Email List
. - In the Alert Email Subject field, enter
[API7 Alert] Gateway Instance Offline Warning
. - In the Alert Email Content field, enter
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}
. - Click Add.
- In the Type field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Webhook
. - In the Contact Points field, choose
Slack Notification
. - In the Alert Message field, enter
"text": "{{.AlertDetail}}".
- Click Add.
- In the Type field, choose
- Click Add.
Validate
Imagine that gateway instance gateway 123
in the gateway group production group
became offline at 2024-12-31 14:00:00, and gateway instance gateway 456
in the gateway group test group
became offline at 2024-12-31 13:00:00.
The alert policy was trigger at 2024-12-31 17:00:00.
- Select Alert from the side navigation bar, then click History.
- You shall see an alert record according to the case. Click Detail:
- Alert Policy: Gateway Instance Offline
- Severity: High
- Alert Time: 5 minutes ago
- Trigger Gateway Group: production group
- Alert Detail: Gateway instance: gateway 123 in the gateway group: production group offline for 3 hours.\ Gateway instance: gateway 456 in the gateway group: test group offline for 4 hours.
- The email sent will be like:
* Subject: [API7 Alert] Gateway Instance Offline Warning
* Alert Time: 2024 DEC 31 17:00:00, Detail: Gateway instance: gateway 123 in the gateway group: production group offline for 3 hours.\ Gateway instance: gateway 456 in the gateway group: test group offline for 4 hours.
- You will receive a message in Slack:
Gateway instance: gateway 123 in the gateway group: production group offline for 3 hours
Gateway instance: gateway 456 in the gateway group: test group offline for 4 hours
Detect CPU Cores Outrage
If CPU cores usage of all gateway groups exceeds the licensed CPU core limit for seven consecutive days, resource addition or modification will be restricted. However, existing services and routes will continue to function.
Implement a hourly task to detect all gateway groups for production environments, and send email alerts to the emergency team and Slack notifications in case of issues.
- Select Alert from the side navigation bar, then click Policies.
- Click Add Alert Policy.
- In the dialog box, do the following:
In the Name field, enter
CPU cores Outrage
.In the Severity field, choose
High
.In the Check Interval field, enter
60
minutes.In the Conditions field, do the following:
- In the Operator field, choose
Meet all of the following conditions(AND)
. - In the Event field, choose
Allowed License CPU Quota Exceeded
.
- In the Operator field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Email
. - In the Contact Points field, choose
Emergency Team Email List
. - In the Alert Email Subject field, enter
[API7 Alert] CPU Cores Outrage
. - In the Alert Email Content field, enter
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}
. - Click Add.
- In the Type field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Webhook
. - In the Contact Points field, choose
Slack Notification
. - In the Alert Message field, enter
"text": "{{.AlertDetail}}".
- Click Add.
- In the Type field, choose
- Click Add.
Validate
Assume that your API7 Enterprise license limit of 100 CPU cores has been exceeded for 21 consecutive days, starting from 2024-12-1.
- Select Alert from the side navigation bar, then click History.
- You shall see an alert record according to the case. Click Detail:
- Alert Policy: CPU Cores Outrage
- Severity: Medium
- Alert Time: 5 minutes ago
- Trigger Gateway Group: production group
- Alert Detail: Total CPU usage for all gateway groups is 110c, exceeded allowed license CPU quota 100c.
- The email sent will be like:
* Subject: [API7 Alert] Gateway Instance Offline Warning
* Alert Time: 2024 DEC 31 17:00:00, Detail: Total CPU usage for all gateway groups is 110c, exceeded allowed license CPU quota 100c.
- You will receive a message in Slack:
Total CPU usage for all gateway groups is 110c, exceeded allowed license CPU quota 100c at 2024 DEC 31 17:00:00.
Count Healthy Gateway Instances in a Gateway Group
If the number of healthy gateway instances in a gateway group falls below a critical threshold, it indicates potential service disruptions and impacts on traffic handling. This scenario is particularly relevant in Kubernetes deployments, where gateway instances may experience failures or be scaled down unexpectedly.
Implement a high frequent task to detect send email alerts to the emergency team and Slack notifications in case of issues.
- Select Alert from the side navigation bar, then click Policies.
- Click Add Alert Policy.
- In the dialog box, do the following:
In the Name field, enter
No Enough Healthy Gateway Instances in Production Group
.In the Severity field, choose
High
.In the Check Interval field, enter
30
minutes.In the Trigger Gateway Group field, choose
Production Group
.In the Conditions field, do the following:
- In the Operator field, choose
Meet all of the following conditions(AND)
. - In the Event field, choose
Number of healthy gateway instances
. - In the Rule field, fill in the blanks to
Number of gateway instances in the gateway group is less than 50
.
- In the Operator field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Email
. - In the Contact Points field, choose
Emergency Team Email List
. - In the Alert Email Subject field, enter
[API7 Alert] No Enough Healthy Gateway Instances in Production Group
. - In the Alert Email Content field, enter
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}
. - Click Add.
- In the Type field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Webhook
. - In the Contact Points field, choose
Slack Notification
. - In the Alert Message field, enter
"text": "{{.AlertDetail}}".
- Click Add.
- In the Type field, choose
- Click Add.
Validate
Assume that your gateway group Production Group
requires a minimum of 50 healthy gateway instances. However, as of 2024-12-31 17:00:00, only 40 instances are operational. This significant shortfall may lead to service degradation and potential outages. Immediate attention is required to address this issue.
- Select Alert from the side navigation bar, then click History.
- You shall see an alert record according to the case. Click Detail:
- Alert Policy: No Enough Healthy Gateway Instances in Production Group
- Severity: High
- Alert Time: 5 minutes ago
- Trigger Gateway Group: Production Group
- Alert Detail: Number of healthy gateway instances in the gateway group: Production Group is 40.
- The email sent will be like:
* Subject: [API7 Alert] No Enough Healthy Gateway Instances in Production Group
* Alert Time: 2024 DEC 31 17:00:00, Detail: Number of healthy gateway instances in the gateway group: Production Group is 40.
- You will receive a message in Slack:
Number of healthy gateway instances in the gateway group: Production Group is 40 at 2024 DEC 31 17:00:00.
Monitor Status Code
If the number of specific API response status code exceed the threshold, for example, too many 500 error, it indicates potential service disruptions and impacts on traffic handling.
Implement a high frequent task to detect send email alerts to the emergency team and Slack notifications in case of issues.
- Select Alert from the side navigation bar, then click Policies.
- Click Add Alert Policy.
- In the dialog box, do the following:
In the Name field, enter
Too many 500 status code in production gateway groups
.In the Severity field, choose
High
.In the Check Interval field, enter
30
minutes.In the Trigger Gateway Group field, select
Match Label
then enter key/valueenvType: production
.In the Conditions field, do the following:
- In the Operator field, choose
Meet all of the following conditions(OR)
. - In the Event field, choose
Number of status code 500
. - In the Rule field, fill in the blanks to
Number of requests with status code 500 received by all published services of any one of the gateway groups has reached or exceeded 100 times in the last 60 minutes
.
- In the Operator field, choose
Click Add Condition.
In the Conditions field, do the following:
- In the Event field, choose
Ratio of status code 500
. - In the Rule field, fill in the blanks to
Ratio of requests with status code 500 received by all published services of any one of the gateway groups has reached or exceeded 10% in the last 60 minutes
.
- In the Event field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Email
. - In the Contact Points field, choose
Emergency Team Email List
. - In the Alert Email Subject field, enter
[API7 Alert] Too many 500 status code in {{.TriggerGatewayGroup}}
. - In the Alert Email Content field, enter
Alert Time: {{.AlertTime.Format "2006 Jan 02 15:04:05"}}, Detail:{{.AlertDetail}}
. - Click Add.
- In the Type field, choose
Click Add Notification.
In the dialog box, do the following:
- In the Type field, choose
Webhook
. - In the Contact Points field, choose
Slack Notification
. - In the Alert Message field, enter
"text": "{{.AlertDetail}}".
- Click Add.
- In the Type field, choose
- Click Add.
Validate
Assume that your gateway group VIP Group
has a label envType:production
, experienced a 15% error rate between 16:00 and 17:00 on December 31, 2024. Out of 1000 requests, 150 resulted in 500 errors. And the gateway group US Group
has a label envType:production
, experienced a 10% error rate between 16:00 and 17:00 on December 31, 2024. Out of 500 requests, 50 resulted in 500 errors.
- Select Alert from the side navigation bar, then click History.
- You shall see an alert record according to the case. Click Detail:
- Alert Policy: Too many 500 in production gateway groups
- Severity: High
- Alert Time: 5 minutes ago
- Trigger Gateway Group: VIP Group
- Alert Detail: Number of requests with status code 500 received by all published services of the gateway group: VIP Group is 150 times in the last 60 minutes.\n Ratio of requests with status code 500 received by all published services of the gateway group: VIP Group is 15% in the last 60 minutes.\n Ratio of requests with status code 500 received by all published services of the gateway group: US Group is 10% in the last 60 minutes.
- The email sent will be like:
* Subject: [API7 Alert] [API7 Alert] Too many 500 status code in VIP Group,US Group
* Alert Time: 2024 DEC 31 17:00:00, Detail:Number of requests with status code 500 received by all published services of the gateway group: VIP Group is 150 times in the last 60 minutes.
Ratio of requests with status code 500 received by all published services of the gateway group: VIP Group is 15% in the last 60 minutes.
Ratio of requests with status code 500 received by all published services of the gateway group: US Group is 10% in the last 60 minutes.
- You will receive a message in Slack:
Number of requests with status code 500 received by all published services of the gateway group: VIP Group is 150 times in the last 60 minutes.
Ratio of requests with status code 500 received by all published services of the gateway group: VIP Group is 15% in the last 60 minutes.
Ratio of requests with status code 500 received by all published services of the gateway group: US Group is 10% in the last 60 minutes.