Application Performance Monitoring

Application Performance Monitoring Featured Workflow

Category:

Workflow name:

Restart Linux service on memory utilization alert

Description:

The following example workflow is monitoring a SaaS application for performance-related issues, and notify as well as provide remediation.

The use case we are addressing is where we are running a synthetic smoke API test within Datadog, against a SaaS application, and trigger an alert notification to Fylamynt webhook on failures. The workflow will then retrieve the following error metrics and based on the conditional will either create a Jira ticket or remediate the issue with an approval request.

Lambda errors
ApiGateway Latency errors
DynamoDB ReadThrottleEvents
AWS API Gateway 5XXError
API Gateway Count

Integrations:

Before we review the workflow in Fylamynt, let's look at the different tools and integrations required to make this happen. First off we are using a Datadog monitor to trigger the workflow based on the Synthetic Test alert condition. This trigger can very easily be replaced by your own APM tools like Sumo Logic, New Relic or an AWS native solution. We then configure an AWS Target account which provides no-code drag and drop ability to use AWS Execution action node and allows calling any AWS API endpoint supported by the boto3 library, in this case using Cloudwatch Get Metric Statistics for different services. The workflow also integrates with Jira to create issues, Slack to send messages, and lastly Twilio to send urgent SMS to users related to the issue at hand.

Workflow review:

Our example workflow is triggered from a Datadog Alert which retrieves the alert body, in JSON format, from the preconfigured Datadog Monitor.

The API results are retrieved from Datadog and presented to the team via a Slack message which summarized the workflow. The summary information is sent out at the end of the workflow.

The first metric retrieved across a specified date range is the AWS Lambda errors using the AWS Execution action node.

The output of AWS Lambda error metrics is then used as the input for the conditional node and compared to the comparison value. If the value is greater then a Jira issue is created.

The next error metric retrieved across a specified date range is the AWS ApiGateway latency metrics using the AWS Execution action node.

The output of AWS ApiGateway latency metrics is then used as the input for the conditional node and compared to the comparison value. If the value is greater then a Jira issue is created which corresponds to this observation.

The next error metric retrieved across a specified date range is the AWS DynamoDB Read Throttle Events metrics for a specific table using the AWS Execution action node.

The output of AWS DynamoDB Read Throttle events is then used as the input for the conditional node and compared to the comparison value. If the value is greater then an SMS message is sent to a specific recipient using the Twilio action node.

Next, an approval request is sent to a Slack channel to increase the ReadCapacityUnits to 5 to unblock the table.

If the request is approved, the AWS Execution action node is used to increase the ReadCapacityUnits to 5.

A message is also sent to a Slack Channel to inform the team of the DynamoDB table change that was made.

The last error metric retrieved is the AWS ApiGateway for 5XXError using the AWS Execution action node.

The output of AWS ApiGateway 5XX Error is then used as the input for the conditional node and compared to the comparison value. If the value is greater a Slack message is sent to notify the team that Api Gateway 500 errors exceeded the default ratio.

The final step in the workflow is to send a summary of all the information collected.

PreviousIncident Response Automation NextWhat's new?

Last updated 3 years ago

Was this helpful?