6. Creating your first Incident Response workflow

Performing repetitive tasks typically produces boredom, which can result in errors and mistake

Description

As SREs and IT administrators can attest, software applications can sometimes have a mind of their own and behave in ways that are not easy to understand or comprehend. The faults and errors can be attributed to many factors like hardware, resource allocation, operating systems, network, DNS, cloud provider services going down, and the list goes on and on.

Let’s be honest, at one point in time and probably still today, your business is running and you have to support an old legacy application with critical data. Unfortunately, it is not always easy to replace these systems due to internal knowledge gaps, application EOL, or migrations that can be costly and time-consuming.

Use case

The repetitive task we want to address is where a distributed application experiences memory leaks and causes the Linux operating system to run out of memory and creates performance issues. A simplified runbook to remediate such a scenario might look something like this:

Alert is received from APM tool for high server memory utilization
The server is identified by either the name or IP address
Connect and authenticate via SSH to the Linux server
- The server might be in an isolated or firewalled network, which requires a Bastion host or VPN server for connectivity
Restart the service of the application
Verify memory utilization If memory utilization is still high, create a Jira ticket
If resolved, close the incident

This might sound simple enough to tackle manually, but having to repeatedly perform this task, and at the dreaded 2 am, just becomes a drag.

So let’s see how you can, using Fylamynt’s low-code workflow engine, automate the remediation of an alert received from a performance monitoring tool.

Integrations

Firstly, before building the workflow, you need to configure and authorize the required integrations.

Log in to Fylamynt
Select Settings
Select and configure the following integrations to be used in this workflow:
- New Relic
  1. Trigger workflow execution with a selected New Relic Policy
  2. Alternatively, you can use Datadog, Sumo Logic, Humio, Instana, or Splunk On-Call
- Teleport
  1. Securely authenticate and access your SSH servers to execute commands.
- Slack
  1. Send messages and notifications to your teams
- Pagerduty
  1. Create or resolve incidents
  2. Alternatively, you can use Jira, Twilio, or ServiceNow

Creating a workflow in Fylamynt

Now that we have our integrations connected, let’s create the workflow.

Step 1: Create a new trigger based workflow

Login to Fylamynt
On the workflow page, click “New Workflow”
Provide a workflow name
Select New Relic as the trigger type
You are now presented with the Workflow Editor where you drag and drop Fylamynt’s action nodes as steps, it’s as simple as that.

Step 2: Add JSONPath node

The New Relic trigger is added by default and will provide the alert body in JSON output, which you can consume in any downstream node. Since the data is in JSON format you need to extract the relevant information, in this case, the hostname that is experiencing high memory utilization, that you have to SSH into to restart the service.

To add and configure the JSONPath node, here are the steps:

From the left menu bar, drag and drop the JSONPath action node onto the canvas and connect it to the New Relic node
Select the new action node
On the right menu, select the JSON input
1. For demonstration purposes I am going to pre-populate the JSON input with the New Relic alert first, just to show how the JSON path expression delivers the output, and then will change back to retrieve the output of the New Relic trigger node as input for the JSONPath.
Change the JSON Input to “Trigger 1”
For Previous Step Output select “output_json”
Enter the JSON Path expression to extract only the relevant name
1. “$.targets[0].labels.fullHostname”‍

Step 3: Add Teleport SSH Execute node

For this example workflow, the Teleport integration is used to authenticate and access the Linux server that runs on an isolated network. Fylamynt does support adding SSH Targets to specific servers that are publicly accessible, for instance your Bastion hosts, and in conjunction with the SSH Execute action node can run commands and retrieve the results.

To add and configure the Teleport SSH Execute node, here are the steps:

From the left menu bar, drag and drop the action node onto the canvas and connect it to the previous JSONPath node
Select the new action node
On the right menu, select the input tab
1. Enter the SSH User
2. For the SSH Target Host, you will retrieve the host information from the JSONPath’s output.
Add the SSH Command you want to execute on the server
1. “systemctl restart newrelic-infra.service && journalctl --unit=newrelic-infra.service -n 100 --no-pager”
Optionally, you can also add an S3 bucket where the execution logs will be stored.

Step 4: Add String Transformation node

The next step is to transform the JSON output to string to be easily consumed in the Slack node.

To add and configure the String Transformation node, here are the steps:

From the left menu bar, drag and drop the String Transformation action node onto the canvas and connect it to the previous Teleport SSH Execute node
Select the new action node
On the right menu, select the input tab
1. For the JSON Input, you will retrieve the host information from the JSONPath’s output
For the operation, select To Lowercase

Step 5: Add Slack Send Message node

The Slack node is added to notify the users in the specified Slack channel that the service was restarted successfully.

To add and configure the Slack Send Message node, here are the steps:

From the left menu bar, drag and drop the Slack Send Message action node onto the canvas and connect it to the previous String Transformation node
Select the new action node
On the right menu, select the input tab
1. Select the Slack Channel you want to send the message to
2. Click Add Slack Variables
  1. Enter the variable name
  2. For label select the String Transformation node, and as the previous step output select “string_output”
3. Click Save Variable
Now in the Message Text field, you can consume the variable in the following way:
1. “Service restart on the host {{hostname}} has been successfully carried out.

Step 6: Save the workflow

Click the Save New Version button

Every change made to the workflow within the editor will be saved as a new version. You can also very easily revert to previous versions.

Select the Workflow name in the top-level corner, or click on the manage versions button.

Optional action nodes

Fylamynt has over 100 actions across 38 services, with multiple integrations that you can use.

Here are some additional steps that you can add to enhance the workflow.

Approval

The approval node will send a message to a Slack channel where a user can approve or deny the restart of the service

Slack Send Message

This action can also send a notification at the beginning of the workflow that the memory alert was received from New Relic and that the service of the application will be restarted on the affected server.

New Relic NRQL Query

This action node allows you to perform a query and retrieve data that can be used to verify the memory utilization metric on the host after the service was restarted.

Conditional

The conditional node can be used to review the new memory utilization metric after the service was restarted.

The rule would check whether the memory utilization is still above 80%, and if that is the case, create a ticket or incident in one of the Fylamynt other integrations like Jira, ServiceNow, Pagerduty, etc.

Step 7: Automate the workflow trigger

In the next and final step, you have to complete the Incident Management configuration to set up Incident types and Incident Type associations in order for the workflow to automatically run when an alert is received from your trigger type integrations.

Previous5. Setting up your first resource Next7. Incident Management - Automatic workflow execution

Last updated 3 years ago

Was this helpful?