AWS with OpenTofu: Load Balancing with ALB for Zero Downtime

2021/04/17

Categories: tutorial Tags: aws opentofu terraform alb load-balancer high-availability

Table of Contents

Purpose

In the previous tutorial, we made our infrastructure self-healing with Auto Scaling Groups. If the webserver crashed, the ASG replaced it — but there was still a brief downtime while the new instance booted. In production, even a few minutes of downtime is unacceptable.

In this tutorial, we solve this by adding an Application Load Balancer (ALB) in front of two webservers. The ALB distributes requests between them, and if one webserver fails, the other keeps serving traffic immediately — zero downtime. The ASG then replaces the failed instance in the background.

We also make an important architectural change: the webservers move from the public subnet to the private subnet. Since users now access the application through the ALB (which lives in the public subnet), the webservers no longer need to be directly exposed to the internet.

The full source code is available on my GitHub repository.

Architecture overview

  graph TB
    Internet((Internet))
    You[Your IP]

    subgraph VPC[VPC 10.0.0.0/16]
        IGW[Internet Gateway]

        subgraph LBSubs["Public Subnets ALB - 3 AZs"]
            LBCIDRs["10.0.11.0/24 | 10.0.12.0/24 | 10.0.13.0/24"]
            ALB["Application Load Balancer :80"]
        end

        subgraph NATSubs["Public Subnets NAT - 3 AZs"]
            NATCIDRs["10.0.21.0/24 | 10.0.22.0/24 | 10.0.23.0/24"]
            NAT["3x NAT Gateways"]
        end

        subgraph BastionSubs["Public Subnets Bastion - 3 AZs"]
            BastionCIDRs["10.0.31.0/24 | 10.0.32.0/24 | 10.0.33.0/24"]
            BASTION["ASG min:1 max:1 --> Bastion EC2"]
        end

        subgraph WebSubs["Private Subnets Web - 3 AZs"]
            WebCIDRs["10.0.41.0/24 | 10.0.42.0/24 | 10.0.43.0/24"]
            ASGWEB["ASG min:2 max:2 --> 2x Webserver EC2 :8000"]
        end

        subgraph RedisSubs["Private Subnets Redis - 3 AZs"]
            RedisCIDRs["10.0.51.0/24 | 10.0.52.0/24 | 10.0.53.0/24"]
            REDIS["ElastiCache Redis :6379"]
        end
    end

    Internet -- "HTTP :80" --> IGW
    IGW -- "HTTP :80" --> ALB
    ALB -- "HTTP :8000" --> ASGWEB
    You -- "SSH :22" --> IGW
    IGW -- "SSH :22" --> BASTION
    BASTION -. "SSH :22" .-> ASGWEB
    ASGWEB -- "Redis :6379" --> REDIS
    ASGWEB -- "HTTP/S outbound" --> NAT
    NAT --> IGW

    style LBCIDRs fill:#ffd,stroke:#cc0,color:#333
    style NATCIDRs fill:#ffd,stroke:#cc0,color:#333
    style BastionCIDRs fill:#ffd,stroke:#cc0,color:#333
    style WebCIDRs fill:#ffd,stroke:#cc0,color:#333
    style RedisCIDRs fill:#ffd,stroke:#cc0,color:#333

What changed from tutorial 06

Webservers moved to private subnets

In the previous tutorial, the webservers were in public subnets and each had its own Elastic IP. Now that the ALB handles all inbound traffic, the webservers don’t need public IPs anymore. They are moved to private subnets where they are unreachable from the internet — only the ALB can forward requests to them.

This is a significant security improvement. The webservers can still reach the internet for package updates via the NAT Gateway, but no one from outside can connect to them directly.

One NAT Gateway per Availability Zone

Instead of a single NAT Gateway, we now create one per AZ:

resource "aws_nat_gateway" "nat_gw" {
  count         = length(var.subnet_public_nat)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public_nat[count.index].id

  tags = {
    Name = "nat_gw-${var.env}-${count.index}"
  }
}

Each private subnet’s route table points to the NAT Gateway in its own AZ. This way, if AZ-A goes down, the private instances in AZ-B and AZ-C are not affected — they use their own NAT Gateways. Each NAT route table is associated with the private web subnet in the same AZ:

resource "aws_route_table" "route_nat" {
  count  = length(var.subnet_public_nat)
  vpc_id = aws_vpc.my_vpc.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_nat_gateway.nat_gw[count.index].id
  }
}

resource "aws_route_table_association" "private_web" {
  count          = length(var.subnet_private_web)
  subnet_id      = aws_subnet.private_web[count.index].id
  route_table_id = aws_route_table.route_nat[count.index].id
}

The count.index ensures AZ-0’s web subnet routes through NAT-0, AZ-1’s web subnet through NAT-1, and so on.

Five subnet groups

The network now has 5 groups of subnets, each spanning 3 AZs — for a total of 15 subnets:

Subnet group Type CIDR blocks Purpose
ALB Public 10.0.11-13.0/24 Load Balancer endpoints
NAT Public 10.0.21-23.0/24 NAT Gateways (one per AZ)
Bastion Public 10.0.31-33.0/24 SSH jump server
Web Private 10.0.41-43.0/24 Webserver instances
Redis Private 10.0.51-53.0/24 ElastiCache Redis

Public subnets route through the Internet Gateway. Private subnets route through the NAT Gateway in their respective AZ.

The Application Load Balancer

The ALB is the central piece of this tutorial. It is defined in modules/network/alb.tf and consists of three resources.

The load balancer itself

resource "aws_lb" "web" {
  name               = "alb-web-${var.env}"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb_web.id]
  subnets            = aws_subnet.public_lb[*].id
}

Setting internal = false makes it internet-facing. It is deployed across all 3 ALB public subnets and protected by its own security group.

The target group

resource "aws_lb_target_group" "web" {
  port     = local.web_port
  protocol = "HTTP"
  vpc_id   = aws_vpc.my_vpc.id

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 3
    interval            = 30
    path                = "/cgi-bin/ping.py"
  }
}

The target group defines where the ALB forwards traffic (port 8000) and how it checks if the webservers are healthy. The ALB calls /cgi-bin/ping.py on each instance every 30 seconds. If an instance fails 2 consecutive checks (unhealthy_threshold = 2), the ALB stops sending it traffic. When the ASG launches a replacement and it passes 2 consecutive checks (healthy_threshold = 2), the ALB starts routing to it again.

The listener

resource "aws_lb_listener" "web" {
  load_balancer_arn = aws_lb.web.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    target_group_arn = aws_lb_target_group.web.arn
    type             = "forward"
  }
}

The listener accepts HTTP traffic on port 80 and forwards it to the target group. The flow is:

  graph LR
    User[User :80] --> ALB[ALB Listener :80]
    ALB --> TG[Target Group :8000]
    TG --> WEB1[Web Server 1 :8000]
    TG --> WEB2[Web Server 2 :8000]
    TG -. "Health check every 30s" .-> PING["/cgi-bin/ping.py"]

ALB security group

The ALB has its own security group that only allows HTTP inbound from anywhere and only forwards to the webserver security group:

resource "aws_security_group_rule" "alb_web_from_any_http" {
  type              = "ingress"
  from_port         = local.http_port
  to_port           = local.http_port
  protocol          = "tcp"
  cidr_blocks       = local.anywhere
  security_group_id = aws_security_group.alb_web.id
}

resource "aws_security_group_rule" "alb_web_to_web_http" {
  type                     = "egress"
  from_port                = local.web_port
  to_port                  = local.web_port
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.web.id
  security_group_id        = aws_security_group.alb_web.id
}

The webserver security group mirrors this — it accepts HTTP on port 8000 only from the ALB security group:

resource "aws_security_group_rule" "web_from_alb_web_http" {
  type                     = "ingress"
  from_port                = local.web_port
  to_port                  = local.web_port
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.alb_web.id
  security_group_id        = aws_security_group.web.id
}

The webserver ASG

The webserver ASG now maintains 2 instances instead of 1, and is attached to the ALB target group:

resource "aws_autoscaling_group" "web" {
  name                = "asg_web-${var.env}"
  vpc_zone_identifier = data.terraform_remote_state.network.outputs.subnet_private_web_id[*]
  target_group_arns   = [data.terraform_remote_state.network.outputs.alb_target_group_web_arn]
  health_check_type   = "ELB"
  min_size            = 2
  max_size            = 2

  launch_template {
    id = aws_launch_template.web.id
  }
}

Two important changes compared to tutorial 06:

Since the webservers are now in private subnets, the launch template sets associate_public_ip_address = false — no public IP is needed.

The health check endpoint

The user-data script creates a simple /cgi-bin/ping.py page that returns “ok”:

#!/usr/bin/env python3

print("Content-type: text/html")
print("")
print("<html><body>")
print("<p>ok</p>")
print("</body></html>")

This is separate from hello.py because the health check should be lightweight — it doesn’t need to connect to Redis. The ALB calls this endpoint every 30 seconds on each instance to verify the webserver process is running.

The hello.py page now also displays the instance ID, so you can see which server is responding:

print("Id: $INSTANCE_ID")

When you curl the ALB multiple times, you’ll see the instance ID alternating between the two servers — proving the load balancer is distributing requests.

Project structure

aws-terraform-tuto07/
├── modules/
│   ├── network/
│   │   ├── main.tf           # VPC, 15 subnets, IGW, 3 NAT GWs, routes
│   │   ├── sg.tf             # Security groups for bastion, ALB, web, database
│   │   ├── alb.tf            # Application Load Balancer, target group, listener
│   │   ├── iam.tf            # IAM role for EIP association
│   │   ├── outputs.tf
│   │   ├── providers.tf
│   │   └── variables.tf
│   ├── bastion/              # ASG min:1 max:1, EIP re-association
│   ├── database/             # ElastiCache Redis
│   └── web/                  # ASG min:2 max:2, attached to ALB target group
└── envs/
    └── dev/
        ├── 01-network/
        ├── 02-bastion/
        ├── 03-database/
        └── 04-web/

The network module now includes alb.tf for the load balancer configuration.

Deploy the infrastructure

Prepare your variables

Create a file at ~/terraform/aws-terraform-tuto07/terraform_vars_dev_secrets:

export TF_VAR_aws_profile="dev"
export TF_VAR_region="eu-west-3"
export TF_VAR_bucket="XXXX-tofu-state"
export TF_VAR_key_network="tuto-07/dev/network/terraform.tfstate"
export TF_VAR_key_bastion="tuto-07/dev/bastion/terraform.tfstate"
export TF_VAR_key_database="tuto-07/dev/database/terraform.tfstate"
export TF_VAR_key_web="tuto-07/dev/web/terraform.tfstate"
export TF_VAR_ssh_public_key="ssh-ed25519 XXXX"
MY_IP=$(curl -s ifconfig.co/)
export TF_VAR_my_ip_address="$MY_IP/32"

Build

Deploy the four stacks in order:

$ cd envs/dev/01-network
$ make apply
$ cd ../02-bastion
$ make apply
$ cd ../03-database
$ make apply
$ cd ../04-web
$ make apply

Test the application

Get the DNS name of the ALB:

$ aws --profile dev elbv2 describe-load-balancers --names alb-web-dev \
    --query 'LoadBalancers[*].DNSName' \
    --output text

Test the application by issuing several requests:

$ curl http://<load_balancer_dns>/cgi-bin/hello.py

Each request increments the counter (stored in ElastiCache Redis). You should also notice the instance ID alternating between two values — that’s the ALB distributing traffic across both webservers.

Test the high availability

This is where the ALB shines. Let’s kill one webserver and verify there is zero downtime.

First, connect to one of the webservers through the bastion and kill the Python process:

$ ssh -J ec2-user@<bastion_eip> ec2-user@<web_private_ip>
$ sudo pkill python3

Now keep making requests:

$ curl http://<load_balancer_dns>/cgi-bin/hello.py

The ALB detects the unhealthy instance after 2 failed health checks (about 60 seconds) and stops routing to it. During this time and after, you still get responses — from the remaining healthy server. You’ll notice the instance ID no longer alternates; only the healthy server’s ID appears.

After a few minutes, the ASG launches a replacement instance. Once it boots, runs user-data, and passes 2 consecutive health checks, the ALB starts routing to it again. The instance ID will start alternating again, this time with the new instance’s ID.

The key difference from tutorial 06: at no point was the service unavailable. One server was always handling requests while the other was being replaced.

Clean up

Destroy in reverse order:

$ cd envs/dev/04-web
$ make destroy
$ cd ../03-database
$ make destroy
$ cd ../02-bastion
$ make destroy
$ cd ../01-network
$ make destroy

Summary

In this tutorial, we added an Application Load Balancer to distribute traffic across two webservers, achieving zero-downtime failover. When one webserver fails, the ALB routes all traffic to the remaining healthy server while the ASG replaces the failed one in the background.

We also moved the webservers from public to private subnets — since users access the application through the ALB, the webservers no longer need to be directly exposed to the internet. And we deployed one NAT Gateway per Availability Zone to ensure private instances maintain internet access even if an AZ fails.

In the next tutorial, I will show you how to auto-scale your infrastructure when the servers are overloaded — dynamically adding or removing webservers based on CPU utilization.

>> Home