AWS EKS migrate from Amazon Linux 2 to AL2023 with Terraform

Aws Kubernetes EKS 1.32 will be the last release to support Amazon Linux 2 (AL2), 1.33 will support only AL2023 optimized image.

If you are using Terraform to manage the eks cluster, here it’s a simplified version of the code with worker nodes running on AL2:

resource "aws_eks_cluster" "my" {
  name     = "test"
  ...
}
data "aws_ami" "al2" {
  filter {
    name   = "name"
    values = ["amazon-eks-node-${aws_eks_cluster.my.version}-v*"]
  }
  most_recent = true
  owners      = ["602401143452"]
}
locals {
  user_data = <<USERDATA
#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh --apiserver-endpoint '${aws_eks_cluster.my.endpoint}' --b64-cluster-ca '${aws_eks_cluster.my.certificate_authority[0].data}' ${aws_eks_cluster.my.name}
USERDATA
}
resource "aws_launch_template" "worker" {
  name_prefix = "test-worker"
  image_id    = data.aws_ami.al2.id
  user_data = base64encode(local.userdata)
  ...
}
resource "aws_autoscaling_group" "my" {
  launch_template {
    id      = aws_launch_template.al2.id
    version = "$Latest"
  }
  ...
}

in order to migrate to AL2023, you need to change the aws_ami to use the new AMI and the user_data from the custom script to the Nodeadm method:

resource "aws_eks_cluster" "my" {
  name     = "test"
  ...
}
data "aws_ami" "al2023" {
  filter {
    name   = "name"
    values = ["amazon-eks-node-al2023-x86_64-standard-${aws_eks_cluster.my.version}-v*"]
  }
  most_recent = true
  owners      = ["602401143452"]
}
locals {
  user_data = <<USERDATA
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
  cluster:
    name: ${aws_eks_cluster.my.name}
    apiServerEndpoint: ${aws_eks_cluster.my.endpoint}
    certificateAuthority: ${aws_eks_cluster.my.certificate_authority[0].data}
    cidr: ${aws_eks_cluster.my.kubernetes_network_config[0].service_ipv4_cidr}
USERDATA
}
resource "aws_launch_template" "worker" {
  name_prefix = "test-worker"
  image_id    = data.aws_ami.al2023.id
  user_data = base64encode(local.user_data)
  ...
}
resource "aws_autoscaling_group" "my" {
  launch_template {
    id      = aws_launch_template.al2023.id
    version = "$Latest"
  }
  ...
}

do a “terraform apply” and destroy a node, after some seconds a new node will spawn and you will notice that the “AMI Name” of the EC2 will be something like “amazon-eks-node-al2023-x86_64-standard-1.31-v20250203”, execute a “kubectl get nodes” and you should see the new node joined to the cluster.

AL2023 kubernetes node will work very similar to AL2, here it’s some hints that may be useful to you:

  • AL2 included the “crictl” package for quick-and-dirty container’s debug on the node, AL2023 doesn’t include this package but include the “nerdctl” package that works very similar.
  • AL2023 is bundled with SELinux, but its configured by default with “permissive” mode, it doesn’t officially support the “enforcing” mode. If you would like to use this method, follow the properly github issues.
  • AL2023 doesn’t support IMDSv1 any more, switch to IMDSv2 by adding this code to aws_launch_template:
  metadata_options {
    http_tokens                 = "required"
    http_put_response_hop_limit = 1
  }
  • If you need to pass some “kubelete-extra-args” options to AL2023, follow this example:
locals {
  userdata = <<USERDATA
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
  cluster:
    name: ${aws_eks_cluster.my.name}
    apiServerEndpoint: ${aws_eks_cluster.my.endpoint}
    certificateAuthority: ${aws_eks_cluster.my.certificate_authority[0].data}
    cidr: ${aws_eks_cluster.my.kubernetes_network_config[0].service_ipv4_cidr}
  kubelet:
    flags:
      - --node-labels=mynodegroup=ondemand
USERDATA
}

AWS AppConfig agent error “connection refused”

AWS AppConfig service it’s useful for feature flag functionality, you can access it directly via API but this is not the suggested method, for production workload it’s a best practice to use the provided agent. If you are using AppConfig on Kubernetes or EKS you should add the appconfig-agent to your deployment by adding:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-application-label
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-application-label
  template:
    metadata:
      labels:
        app: my-application-label
    spec:
      containers:
      - name: my-app
        image: my-repo/my-image
        imagePullPolicy: IfNotPresent
      - name: appconfig-agent
        image: public.ecr.aws/aws-appconfig/aws-appconfig-agent:2.x
        ports:
        - name: http
          containerPort: 2772
          protocol: TCP
        env:
        - name: SERVICE_REGION
          value: region
        imagePullPolicy: IfNotPresent

This method will work but in some edge cases you could “randomly” get an exception like this:

cURL error 7: Failed to connect to localhost port 2772 after 0 ms: Connection refused (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:2772/applications/APPLICATION_NAME/environments/ENVIRONMENT_NAME/configurations/CONFIGURATION_NAME

If you take a look at the logs you could notice that the AppConfig agent has been explicitly shut down:

[appconfig agent] INFO shutdown complete (actual duration: 50ms)
[appconfig agent] INFO received terminated signal, shutting down
[appconfig agent] INFO shutting down in 50ms
[appconfig agent] INFO stopping server on localhost:2772

digging into the logs you could notice that the master container is still working for some seconds after the appconfig-agent has been shut down, that’s the problem! appconfig-agent is very fast to shut down, if your primary container is still working when appconfig has been shut down, your primary container will not be able to connect to the agent and you will get the error.

How to make sure that appconfig-agent is always active in a deployment? the new Sidecar Container feature, added in the recent 1.29 Kubernetes release, is a perfect fit: the container in the sidecar (appconfig-agent) will be the first to start and the last to stop, your primary container will always find the sidecar ready.

Modify the deployment this way:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-application-label
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-application-label
  template:
    metadata:
      labels:
        app: my-application-label
    spec:
      containers:
      - name: my-app
        image: my-repo/my-image
        imagePullPolicy: IfNotPresent
      initContainers: 
      - name: appconfig-agent
        image: public.ecr.aws/aws-appconfig/aws-appconfig-agent:2.x
        restartPolicy: Always
        ports:
        - name: http
          containerPort: 2772
          protocol: TCP
        env:
        - name: SERVICE_REGION
          value: region
        imagePullPolicy: IfNotPresent