6. Creating your first Incident Response workflow
Performing repetitive tasks typically produces boredom, which can result in errors and mistake
Description
As SREs and IT administrators can attest, software applications can sometimes have a mind of their own and behave in ways that are not easy to understand or comprehend. The faults and errors can be attributed to many factors like hardware, resource allocation, operating systems, network, DNS, cloud provider services going down, and the list goes on and on.
Let’s be honest, at one point in time and probably still today, your business is running and you have to support an old legacy application with critical data. Unfortunately, it is not always easy to replace these systems due to internal knowledge gaps, application EOL, or migrations that can be costly and time-consuming.
Use case
The repetitive task we want to address is where a distributed application experiences memory leaks and causes the Linux operating system to run out of memory and creates performance issues. A simplified runbook to remediate such a scenario might look something like this:
Alert is received from APM tool for high server memory utilization
The server is identified by either the name or IP address
Connect and authenticate via SSH to the Linux server
The server might be in an isolated or firewalled network, which requires a Bastion host or VPN server for connectivity
Restart the service of the application
Verify memory utilization If memory utilization is still high, create a Jira ticket
If resolved, close the incident
This might sound simple enough to tackle manually, but having to repeatedly perform this task, and at the dreaded 2 am, just becomes a drag.
So let’s see how you can, using Fylamynt’s low-code workflow engine, automate the remediation of an alert received from a performance monitoring tool.
Integrations
Firstly, before building the workflow, you need to configure and authorize the required integrations.
Log in to Fylamynt
Select Settings
Select and configure the following integrations to be used in this workflow:
Trigger workflow execution with a selected New Relic Policy
Alternatively, you can use Datadog, Sumo Logic, Humio, Instana, or Splunk On-Call
Securely authenticate and access your SSH servers to execute commands.
Send messages and notifications to your teams
Create or resolve incidents
Alternatively, you can use Jira, Twilio, or ServiceNow
Creating a workflow in Fylamynt
Now that we have our integrations connected, let’s create the workflow.
Step 1: Create a new trigger based workflow
Login to Fylamynt
On the workflow page, click “New Workflow”
Provide a workflow name
Select New Relic as the trigger type
You are now presented with the Workflow Editor where you drag and drop Fylamynt’s action nodes as steps, it’s as simple as that.
Step 2: Add JSONPath node
The New Relic trigger is added by default and will provide the alert body in JSON output, which you can consume in any downstream node. Since the data is in JSON format you need to extract the relevant information, in this case, the hostname that is experiencing high memory utilization, that you have to SSH into to restart the service.
To add and configure the JSONPath node, here are the steps:
From the left menu bar, drag and drop the JSONPath action node onto the canvas and connect it to the New Relic node
Select the new action node
On the right menu, select the JSON input
For demonstration purposes I am going to pre-populate the JSON input with the New Relic alert first, just to show how the JSON path expression delivers the output, and then will change back to retrieve the output of the New Relic trigger node as input for the JSONPath.
Change the JSON Input to “Trigger 1”
For Previous Step Output select “output_json”
Enter the JSON Path expression to extract only the relevant name
“$.targets[0].labels.fullHostname”
Step 3: Add Teleport SSH Execute node
For this example workflow, the Teleport integration is used to authenticate and access the Linux server that runs on an isolated network. Fylamynt does support adding SSH Targets to specific servers that are publicly accessible, for instance your Bastion hosts, and in conjunction with the SSH Execute action node can run commands and retrieve the results.
To add and configure the Teleport SSH Execute node, here are the steps:
From the left menu bar, drag and drop the action node onto the canvas and connect it to the previous JSONPath node
Select the new action node
On the right menu, select the input tab
Enter the SSH User
For the SSH Target Host, you will retrieve the host information from the JSONPath’s output.
Add the SSH Command you want to execute on the server
“systemctl restart newrelic-infra.service && journalctl --unit=newrelic-infra.service -n 100 --no-pager”
Optionally, you can also add an S3 bucket where the execution logs will be stored.
Step 4: Add String Transformation node
The next step is to transform the JSON output to string to be easily consumed in the Slack node.
To add and configure the String Transformation node, here are the steps:
From the left menu bar, drag and drop the String Transformation action node onto the canvas and connect it to the previous Teleport SSH Execute node
Select the new action node
On the right menu, select the input tab
For the JSON Input, you will retrieve the host information from the JSONPath’s output
For the operation, select To Lowercase
Step 5: Add Slack Send Message node
The Slack node is added to notify the users in the specified Slack channel that the service was restarted successfully.
To add and configure the Slack Send Message node, here are the steps:
From the left menu bar, drag and drop the Slack Send Message action node onto the canvas and connect it to the previous String Transformation node
Select the new action node
On the right menu, select the input tab
Select the Slack Channel you want to send the message to
Click Add Slack Variables
Enter the variable name
For label select the String Transformation node, and as the previous step output select “string_output”
Click Save Variable
Now in the Message Text field, you can consume the variable in the following way:
“Service restart on the host {{hostname}} has been successfully carried out.
Step 6: Save the workflow
Click the Save New Version button
Every change made to the workflow within the editor will be saved as a new version. You can also very easily revert to previous versions.
Select the Workflow name in the top-level corner, or click on the manage versions button.
Optional action nodes
Fylamynt has over 100 actions across 38 services, with multiple integrations that you can use.
Here are some additional steps that you can add to enhance the workflow.
Approval
The approval node will send a message to a Slack channel where a user can approve or deny the restart of the service
Slack Send Message
This action can also send a notification at the beginning of the workflow that the memory alert was received from New Relic and that the service of the application will be restarted on the affected server.
New Relic NRQL Query
This action node allows you to perform a query and retrieve data that can be used to verify the memory utilization metric on the host after the service was restarted.
Conditional
The conditional node can be used to review the new memory utilization metric after the service was restarted.
The rule would check whether the memory utilization is still above 80%, and if that is the case, create a ticket or incident in one of the Fylamynt other integrations like Jira, ServiceNow, Pagerduty, etc.
Step 7: Automate the workflow trigger
In the next and final step, you have to complete the Incident Management configuration to set up Incident types and Incident Type associations in order for the workflow to automatically run when an alert is received from your trigger type integrations.
Last updated