6. Creating your first Incident Response workflow

Performing repetitive tasks typically produces boredom, which can result in errors and mistake

Description

As SREs and IT administrators can attest, software applications can sometimes have a mind of their own and behave in ways that are not easy to understand or comprehend. The faults and errors can be attributed to many factors like hardware, resource allocation, operating systems, network, DNS, cloud provider services going down, and the list goes on and on.

Let’s be honest, at one point in time and probably still today, your business is running and you have to support an old legacy application with critical data. Unfortunately, it is not always easy to replace these systems due to internal knowledge gaps, application EOL, or migrations that can be costly and time-consuming.

Use case

The repetitive task we want to address is where a distributed application experiences memory leaks and causes the Linux operating system to run out of memory and creates performance issues. A simplified runbook to remediate such a scenario might look something like this:

  • Alert is received from APM tool for high server memory utilization

  • The server is identified by either the name or IP address

  • Connect and authenticate via SSH to the Linux server

    • The server might be in an isolated or firewalled network, which requires a Bastion host or VPN server for connectivity

  • Restart the service of the application

  • Verify memory utilization If memory utilization is still high, create a Jira ticket

  • If resolved, close the incident

This might sound simple enough to tackle manually, but having to repeatedly perform this task, and at the dreaded 2 am, just becomes a drag.

So let’s see how you can, using Fylamynt’s low-code workflow engine, automate the remediation of an alert received from a performance monitoring tool.

Integrations

Firstly, before building the workflow, you need to configure and authorize the required integrations.

  1. Log in to Fylamynt

  2. Select Settings

  3. Select and configure the following integrations to be used in this workflow:

Creating a workflow in Fylamynt

Now that we have our integrations connected, let’s create the workflow.

Step 1: Create a new trigger based workflow

  1. Login to Fylamynt

  2. On the workflow page, click “New Workflow”

  3. Provide a workflow name

  4. Select New Relic as the trigger type

  5. You are now presented with the Workflow Editor where you drag and drop Fylamynt’s action nodes as steps, it’s as simple as that.

Step 2: Add JSONPath node

The New Relic trigger is added by default and will provide the alert body in JSON output, which you can consume in any downstream node. Since the data is in JSON format you need to extract the relevant information, in this case, the hostname that is experiencing high memory utilization, that you have to SSH into to restart the service.

To add and configure the JSONPath node, here are the steps:

  1. From the left menu bar, drag and drop the JSONPath action node onto the canvas and connect it to the New Relic node

  2. Select the new action node

  3. On the right menu, select the JSON input

    1. For demonstration purposes I am going to pre-populate the JSON input with the New Relic alert first, just to show how the JSON path expression delivers the output, and then will change back to retrieve the output of the New Relic trigger node as input for the JSONPath.

  4. Change the JSON Input to “Trigger 1”

  5. For Previous Step Output select “output_json”

  6. Enter the JSON Path expression to extract only the relevant name

    1. “$.targets[0].labels.fullHostname”‍

Step 3: Add Teleport SSH Execute node

For this example workflow, the Teleport integration is used to authenticate and access the Linux server that runs on an isolated network. Fylamynt does support adding SSH Targets to specific servers that are publicly accessible, for instance your Bastion hosts, and in conjunction with the SSH Execute action node can run commands and retrieve the results.

To add and configure the Teleport SSH Execute node, here are the steps:

  1. From the left menu bar, drag and drop the action node onto the canvas and connect it to the previous JSONPath node

  2. Select the new action node

  3. On the right menu, select the input tab

    1. Enter the SSH User

    2. For the SSH Target Host, you will retrieve the host information from the JSONPath’s output.

  4. Add the SSH Command you want to execute on the server

    1. “systemctl restart newrelic-infra.service && journalctl --unit=newrelic-infra.service -n 100 --no-pager”

  5. Optionally, you can also add an S3 bucket where the execution logs will be stored.

Step 4: Add String Transformation node

The next step is to transform the JSON output to string to be easily consumed in the Slack node.

To add and configure the String Transformation node, here are the steps:

  1. From the left menu bar, drag and drop the String Transformation action node onto the canvas and connect it to the previous Teleport SSH Execute node

  2. Select the new action node

  3. On the right menu, select the input tab

    1. For the JSON Input, you will retrieve the host information from the JSONPath’s output

  4. For the operation, select To Lowercase

Step 5: Add Slack Send Message node

The Slack node is added to notify the users in the specified Slack channel that the service was restarted successfully.

To add and configure the Slack Send Message node, here are the steps:

  1. From the left menu bar, drag and drop the Slack Send Message action node onto the canvas and connect it to the previous String Transformation node

  2. Select the new action node

  3. On the right menu, select the input tab

    1. Select the Slack Channel you want to send the message to

    2. Click Add Slack Variables

      1. Enter the variable name

      2. For label select the String Transformation node, and as the previous step output select “string_output”

    3. Click Save Variable

  4. Now in the Message Text field, you can consume the variable in the following way:

    1. “Service restart on the host {{hostname}} has been successfully carried out.

Step 6: Save the workflow

Click the Save New Version button

Every change made to the workflow within the editor will be saved as a new version. You can also very easily revert to previous versions.

  1. Select the Workflow name in the top-level corner, or click on the manage versions button.

Optional action nodes

Fylamynt has over 100 actions across 38 services, with multiple integrations that you can use.

Here are some additional steps that you can add to enhance the workflow.

Approval

The approval node will send a message to a Slack channel where a user can approve or deny the restart of the service

Slack Send Message

This action can also send a notification at the beginning of the workflow that the memory alert was received from New Relic and that the service of the application will be restarted on the affected server.

New Relic NRQL Query

This action node allows you to perform a query and retrieve data that can be used to verify the memory utilization metric on the host after the service was restarted.

Conditional

The conditional node can be used to review the new memory utilization metric after the service was restarted.

The rule would check whether the memory utilization is still above 80%, and if that is the case, create a ticket or incident in one of the Fylamynt other integrations like Jira, ServiceNow, Pagerduty, etc.

Step 7: Automate the workflow trigger

In the next and final step, you have to complete the Incident Management configuration to set up Incident types and Incident Type associations in order for the workflow to automatically run when an alert is received from your trigger type integrations.

Last updated