If the NAT instance is terminated or stopped, the status will become”Black Hole” in the route table. All instances in private subnets that associate with the route table will no longer be able to connect to the Internet until the route is updated with another NAT instance.
In order to make the NAT instance resilient, we can leverage an Auto Scaling Group to make the NAT instance self-heal itself. With MinSize and MaxSize set to 1 on the Auto Scaling Group, it will ensure there is always a NAT instance ready. The NAT instance will need to run a shell script to repair the route table automatically when it starts.
It updates route tables that are tagged with network=private in the same VPC. Use –tag-key and –tag-value to overwrite the tag key and tag value respectively if you want to use a different tag and value. I recommend to set up a cron job in the NAT instance to run the shell script periodically so that any new route tables that are tagged will get the default route set automatically. The latest version of the shell script can be found in https://github.com/schen1628/aws-auto-healing-nat/blob/master/nat.sh.
The NAT instance will need to have the following permissions to perform route table update:
CloudFormation template can be found in https://github.com/schen1628/aws-auto-healing-nat/blob/master/nat.json. It creates the following resources and a cron job to execute the shell script to adjust route tables if needed every 5 minutes:
- Auto Scaling Group
- Launch Configuration
- Security Group
- IAM Instance Policy
- IAM Role
NOTE: You may need to replace the NAT AMI IDs with the latest ones. On the Choose an Amazon Machine Image (AMI) page, select the Community AMIs category, and search for
amzn-ami-vpc-nat to find the most recent AMIs.