Incident Response Automation
Incident Response Featured Workflow
Incident Response Automation
Restart Linux service on memory utilization alert
The following example workflow will show how Fylamynt can automate the remediation of an alert received from your existing performance monitoring or logging tools.
The use case we are addressing is where Linux-based servers are monitored for high memory utilization due to perhaps an application that experiences memory leaks. When an alert is received for a specific server, the remediation process taken is to SSH to that Linux server that sits in an isolated network or behind a firewall and restarts the service. This might sound straightforward, but what if this incident occurs at 2 am in the morning, or how does the SRE get access to the isolated environment to authenticate and execute commands on the Linux server in question…
Before we review the workflow in Fylamynt, let's look at the different tools and integrations required to make this happen. First off we are using New Relic to monitoring the EC2 instance for memory utilization however for this particular workflow New Relic can very easily be replaced by other APM tools like Datadog or Sumo Logic for which we have integrations available as well. The Policy in New Relic has a notification channel configured that sends the incident to a PagerDuty service. For the PagerDuty service, we configure a webhook integration to Fylamynt which allows us to monitor the incident generation on the service and trigger workflow automatically. Our Teleport integration is then used to authenticate and execute the SSH command on the specific EC2 Linux-based instance. And lastly, we require a Slack integration that provides the ability to send messages and approval notifications to your Slack Team.
Our example workflow is triggered from a PagerDuty Alert which retrieves the alert body, in JSON format, from the PagerDuty service.
The alert body output from PagerDuty is then used as the input for the JSONPATH action node which extracts only the hostname of the Linux server with the memory utilization alert using a path expression.
An approval request is then sent to a Slack channel to execute the SSH command on the affected server.
The Teleport SSH Execute Action node takes the output of the JSONPATH node, which contains the matched hostname, as the SSH Target Host, and executes the command provided to restart the service.
The workflow then transforms the JSON to string, and a message is sent to a specified Slack channel, with the hostname as a variable, notifying the team that the service for the host was successfully restarted.
The alert body output from PagerDuty is used again as the input for the JSONPATH action node which extracts the PagerDuty Incident ID.
Lastly, the workflow uses the retrieved Incident ID to automatically resolve the PagerDuty incident that triggered this workflow.