Skip to Content

EC2 System Status Check Failure

Implementing automated recovery from EC2 system status check failure using Terraform

EC2 System Status Checks

AWS has two types of status checks, which automatically monitor your EC2 instance. I had previously been unaware of the details, until during the course of investigating an outage I found that the EC2 had suffered a “system check failure”.

The system status check description includes:

Monitor the AWS systems on which your instance runs. These checks detect underlying problems with your instance that require AWS involvement to repair.

You can either wait for AWS to resolve, or you can trigger recovery yourself.

Terraform

I thought I would implement automatic recovery, and add to my Terraform configuration.

AWS provide instructions on how to configure a CloudWatch alarm to recover an instance.

Here is all it takes to implement using Terraform.

data "aws_region" "current" {}

resource "aws_cloudwatch_metric_alarm" "auto_recovery_alarm" {
  alarm_name        = "EC2 System Status Check Recovery (demo)"
  alarm_description = "Recover EC2 instance on Status Check failure"

  namespace           = "AWS/EC2"
  metric_name         = "StatusCheckFailed_System"
  period              = "60"
  evaluation_periods  = "1"
  statistic           = "Maximum"
  comparison_operator = "GreaterThanThreshold"
  threshold           = "0"

  dimensions = {
    InstanceId = aws_instance.demo.id
  }

  alarm_actions = ["arn:aws:automate:${data.aws_region.current.name}:ec2:recover"]
}

Did It Work?

I saw my first system check failure on an EC2 instance at work. Within a couple of weeks, a personal EC2 instance was also hit.

Both of these outages lasted about an hour before AWS automatically restarted the instances. I was curious to see how quickly the automated recovery worked in comparison.

The alarm has now been in place for 5 months, and has never been triggered. The EC2 instance being monitored has not suffered a system check failure. So I don’t know if it works.

If the alarm triggers, I’ll update this post.

Update: November 2, 2020

It’s been another 7 months, and the alarm triggered this morning. I received an email from AWS with the subject “Amazon EC2 instance recovery”.

Event timeline:

  • 06:39:24 Syslog is missing expected message from dhclient
  • 06:41:57 CloudWatch alarm changed state to “in alarm”
  • 06:41:58 The recover action was successfully executed
  • 06:43:20 Syslog as EC2 instance boots
  • 06:43:22 EC2 instance connected to network
  • 06:46:57 CloudWatch alarm changed state to “OK”

Success! The alarm reduced the outage from around the hour seen before, to around 5 minutes.