AWS health checks - overview of our experience

Monitoring is key component in infrastructure teams every day job. This can be for different purposes like:

Reporting
Troubleshooting
Resource status monitoring

etc.,

AWS provides different approaches to address these issues. Health checks are a way AWS users use “resource status monitoring” to verify their services like EC2 instances are running or not.

A few different types of health checks AWS users can configure/use are::

EC2 health checks
ELB health checks
Custom health checks

These health checks serve different purposes and help services like Auto Scaling, R53 and application endpoint monitoring to manage the AWS resources. Let us take a quick look at these services before I describe what we learned while using these services.

Auto Scaling:

This means allocating more CPU than baseline performance. AWS uses “CPU Credits” to calculate how much more capacity it can assign to this EC2 instance.

“ Amazon EC2 Auto Scaling helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You create collections of EC2 instances, called Auto Scaling groups.”

You can specify the minimum number of instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures that your group never goes below this size. You can specify the maximum number of instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures that your group never goes above this size.

Auto Scaling groups use health checks to keep up with the group configuration defined. EC2 status checks are default for Auto Scaling, if an instance fails these status checks, Auto scaling considers instance unhealthy and replaces it. If ASG has LB or target groups configured, then you can configure ELB health checks as a way to determine instance’s health. Note that attaching LB or target group to ASG will not enable this configuration default, you need to define explicitly.

Apart from LB/target group checks, if you have own EC2 health check system, custom health checks , are the way to go with ASG scaling configuration.

R53

Amazon Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources. Each health check that you create can monitor one of the following:

The health of a specified resource, such as a web server
The status of other health checks
The status of an Amazon CloudWatch alarm

Based on business needs, users can pick the type of health check they would like to configure. For example, at regular intervals that you specify, Route 53 submits automated requests over the internet to your application, server, or other resource to verify that it's reachable, available, and functional.

Status of other health checks approach is useful in cases like, when you have multiple resources that perform the same function, such as multiple web servers, and your chief concern is whether some minimum number of your resources are healthy. You can create a health check for each resource without configuring notification for those health checks.

R53

The key lessons learned are, while configuring R53 health checks, security groups could make your life complicated (specially if you want to keep your application access to your company IPs).

Load balancers make your life easy while may add some bill: To discover the availability of your EC2 instances, a load balancer periodically pings, attempts connections, or sends requests to test the EC2 instances. These tests are called health checks. The status of the instances that are healthy at the time of the health check is InService.

Load balancer accepts incoming requests
Configure instance security group to allow requests only from LB security group
Map load balancer to private DNS (or) configure LB security groups to limit to whatever IP you would like to

Non ELB health checks could be complicated:: Users could choose non-alias records (just EC2 instance) approach for reasons like reduce aws bill , specially for their dev/qa environments - things will become little challenging in this setup. If you're routing traffic to resources that you can't create alias records for, such as EC2 instances, you create a record and a health check for each resource. Then you associate each health check with the applicable record.

If your application is public facing application, setup will be simple:

Configure R53 (in whatever HA/Failover mode)
Configure security group on EC2 instance to allow communication on port 80/808 (or whatever port your application is running on)
Configure R53 health check to ping the open port (or) URL

If your application is private facing application, setup will be little complicated:

Configure R53 (in whatever HA/Failover mode)
This step is what complicates the setup, because app is a private app, it doesn’t make sense to wide open application port. Instead, restrict port to internal IPs.
Configure R53 health check to ping the open port (or) URL

Results? Health checks will ALWAYS fail. Why?

When Route 53 checks the health of an endpoint, it sends an HTTP, HTTPS, or TCP request to the IP address and port that you specified when you created the health check. For a health check to succeed, your security group must allow inbound traffic from the IP addresses that the Route 53 health checkers use. R53 has health checkers in locations around the world.

How can we solve this issue?
Quick and dirty fix is an open instance security group to allow internet communication (wide open to all) on the health check ports. Sometimes this might make sense, but most of the time users would not like to open the application to everyone over the internet.

Another solution, use ELB (alias record) instead of just EC2 instance. If cost is not a concern, this is a little cleaner solution.

One other solution is, configure security group to allow communication from IP address ranges that are associated with Route 53 health checkers . This saves cost as well as keeps application private. The disadvantage with this approach is, maintenance headache. These R53 health checker IP address ranges could change anytime. You need to validate and update the security group with new IP range frequently (whenever they change).

Have any questions on cloud savings? Talk to us, we could help you.

Using schedulers to save cloud hosting costs? you should read the blog AWS EC2 Schedulers - not good enough tool to reduce AWS bill

AWS health checks - overview of our experience

Recent posts