Automate AKS Deployment and Chaos Engineering with Terraform and GitHub Actions
Published May 17 2024 09:03 AM 1,820 Views



Azure Chaos Studio is a fully managed chaos engineering platform that helps you identify and mitigate potential issues in your applications before they impact customers. It enables you to intentionally introduce faults and disruptions to test the resilience and robustness of your systems. By using Chaos Studio, you can uncover hard-to-find problems in your applications, from late-stage development through production, and plan mitigations to improve overall system reliability.


The provided GitHub Action workflows demonstrate a comprehensive approach to automating the deployment and management of an AKS (Azure Kubernetes Service) cluster using Terraform, as well as deploying Chaos Mesh experiments and the Azure Vote service within the AKS cluster. These workflows streamline the infrastructure management process by integrating directly with GitHub, enabling seamless updates and deployments based on code changes or manual triggers. By leveraging GitHub Actions, Azure, and Kubernetes, these workflows ensure a robust, automated pipeline for maintaining and testing the resilience of applications deployed in the AKS environment.


Automating AKS with Terraform


To automate the deployment and management of an Azure Kubernetes Service (AKS) cluster, I utilized Terraform with the AKS module provided by Azure. This module simplifies the process by abstracting many of the complex configurations needed to set up and manage an AKS cluster.


In the Terraform configuration, I specified the AKS module with the latest version at the time, ensuring compatibility with the latest features and updates. The configuration began by defining essential parameters, such as the resource group name, Kubernetes version, and admin username. Automatic patch upgrades were enabled to ensure the cluster remains updated with the latest patches.

The cluster was configured to use virtual machine scale sets for agent nodes, with a specific node size and a range of nodes to accommodate varying workloads. Custom Linux OS configurations were applied to the agent nodes, enhancing their performance and security settings.


To enhance security, the API server was restricted to authorized IP ranges, including both public and private IP addresses of a bastion host and additional CIDR ranges. Integration with Azure Container Registry (ACR) was facilitated by attaching the ACR ID to the AKS cluster, enabling seamless container management.


Advanced features such as Azure Policy, auto-scaling, and HTTP application routing were enabled to improve cluster governance, scalability, and traffic management. User-assigned managed identities were employed for secure access control, and key management services (KMS) were enabled to secure sensitive data using Azure Key Vault.


Network settings were carefully configured, including DNS service IP, service CIDR, network plugin, and policy settings, ensuring robust network management and security. Role-based access control (RBAC) was enabled and managed through Azure Active Directory (AAD) to streamline user and group management.


Additional features such as log analytics, maintenance windows, and secret rotation were configured to enhance cluster monitoring, maintenance, and security. Tags and labels were added to agent nodes for better organization and resource management.

By defining these configurations in Terraform, the AKS deployment process was automated, making it reproducible and manageable through code. This approach not only reduced manual intervention but also ensured consistency and reliability in the AKS infrastructure.


Note: The code provided below is for exhibit purposes only and may be outdated at the time of writing. This code was used solely in a demo environment to illustrate the automation of an Azure Kubernetes Service (AKS) cluster/Chaos Mesh using the AKS module in Terraform. While the configuration showcases a comprehensive setup, including security, scalability, and management features, it is essential to review and update the code according to the latest Azure and Terraform best practices and versions when implementing it in a production environment. The exhibit is intended to serve as an educational example and may require modifications to align with current standards and specific use cases.



module "aks" {
  source                                   = "Azure/aks/azurerm"
  version                                  = "7.4.0"
  prefix                                   = random_id.aks.hex
  resource_group_name                      =
  kubernetes_version                       = "1.27" # don't specify the patch version!
  admin_username                           = "azureuser"
  automatic_channel_upgrade                = "patch"
  agents_availability_zones                = ["1"]
  agents_count                             = null
  agents_max_count                         = var.agents_max_count
  agents_max_pods                          = 75
  agents_min_count                         = var.agents_min_count
  agents_size                              = "Standard_D2s_v3"
  agents_pool_name                         = "testnodepool"
  agents_type                              = "VirtualMachineScaleSets"
  agents_pool_linux_os_configs             = [
      transparent_huge_page_enabled        = "always"
      sysctl_configs                       = [
          fs_aio_max_nr                    = 65536
          fs_file_max                      = 100000
          fs_inotify_max_user_watches      = 1000000
  api_server_authorized_ip_ranges          = concat(["${azurerm_linux_virtual_machine.bastion.public_ip_address}/32", "${azurerm_linux_virtual_machine.bastion.private_ip_address}/32", "REDACTED"],var.chaos_studio_cidr_ranges)
  attached_acr_id_map                      = {
    example                                =
  azure_policy_enabled                     = true
  auto_scaler_profile_enabled              = true
  auto_scaler_profile_expander             = "least-waste"
  enable_auto_scaling                      = true
  http_application_routing_enabled         = true
  identity_ids                             = []
  identity_type                            = "UserAssigned"
  ingress_application_gateway_enabled      = false
  #ingress_application_gateway_id           =
  #ingress_application_gateway_subnet_cidr = ""
  key_vault_secrets_provider_enabled       = true
  kms_enabled                              = true
  kms_key_vault_key_id                     = "https://${}${}/${azurerm_key_vault_key.aks_key.version}"
  local_account_disabled                   = false
  log_analytics_workspace_enabled          = true
  cluster_log_analytics_workspace_name     = random_id.aks.hex
  microsoft_defender_enabled               = false
  maintenance_window                       = {
    allowed                                = [
        day                                = "Sunday",
        hours                              = [22,23]
    not_allowed                            = [
        start                              = "2024-01-01T20:00:00Z",
        end                                = "2024-01-01T21:00:00Z"
  net_profile_dns_service_ip               = ""
  net_profile_service_cidr                 = ""
  network_plugin                           = "azure"
  network_policy                           = "azure"
  os_disk_size_gb                          = 60
  private_cluster_enabled                  = false
  public_network_access_enabled            = true
  rbac_aad                                 = true
  rbac_aad_managed                         = true
  role_based_access_control_enabled        = true
  secret_rotation_enabled                  = true
  sku_tier                                 = "Standard"
  storage_profile_blob_driver_enabled      = true
  storage_profile_enabled                  = true
  temporary_name_for_rotation              = "a${random_string.aks_temporary_name_for_rotation.result}"
  vnet_subnet_id                           =
  rbac_aad_admin_group_object_ids          = [azuread_group.aks_admins.object_id]

  agents_labels                            = {
    "Agent"                                : "agentLabel"

  agents_tags                              = {
    "Agent"                                : "agentTag"
  depends_on                               = [



Automating AKS with GitHub Actions


The provided GitHub Action workflow automates the deployment of an Azure Kubernetes Service (AKS) cluster using Terraform. This workflow is triggered on two conditions: when changes are pushed to the main branch within the terraform directory, or manually through a workflow dispatch event. The manual trigger allows users to specify the desired Terraform operation (plan, apply, or destroy) through an input parameter. This flexibility enables users to review changes, apply the infrastructure configuration, or tear it down as needed.


The workflow defines a single job named 'Terraform' that runs on the latest Ubuntu environment. It sets up necessary environment variables using secrets for secure authentication with Azure. The steps include checking out the repository, setting up the specified version of Terraform, and initializing Terraform with backend configuration sourced from environment variables. The workflow then validates the Terraform configuration to ensure correctness. Depending on the trigger, it proceeds to execute the appropriate Terraform command: plan to review the changes, apply to deploy the infrastructure, or destroy to remove it. This automation streamlines the management of the AKS cluster, ensuring consistent and reproducible deployments.



    branches: [main]
      - 'terraform/**'
        description: "Terraform operation: plan, apply, destroy"
        required: true
        default: "plan"
        type: choice
          - plan
          - apply
          - destroy

name: Deploy AKS Cluster

    name: 'Terraform'
    runs-on: ubuntu-latest
      ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
      ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
      GITHUB_TOKEN: ${{ secrets.GH_TOKEN }}
      TF_VERSION: 1.6.1

        shell: bash
        working-directory: ./terraform

      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        id: init
        run: |
          set -a 
          source ../.env.backend
          terraform init \
            -backend-config="resource_group_name=$TF_VAR_state_resource_group_name" \

      - name: Terraform Validate
        id: validate
        run: terraform validate -no-color

      - name: Terraform Plan
        id: plan
        run: terraform plan -no-color
        if: "${{ github.event_name == 'workflow_dispatch' && github.event.inputs.terraform_operation == 'plan' || github.event_name == 'push' }}"

      - name: Terraform Apply
        id: apply
        run: terraform apply -auto-approve
        if: "${{ github.event_name == 'workflow_dispatch' && github.event.inputs.terraform_operation == 'apply' }}"

      - name: Terraform Destroy
        id: destroy
        run: terraform destroy --auto-approve
        if: "${{ github.event.inputs.terraform_operation == 'destroy' }}"



Automating Chaos Studio with Terraform


The provided Terraform code defines resources for deploying Chaos Mesh. First, it creates a new Kubernetes namespace named "chaos-testing" using the kubernetes_namespace resource. This namespace isolates the Chaos Mesh components from other workloads in the cluster, enhancing organization and security by confining the chaos engineering experiments to a dedicated area.


Next, the code uses the helm_release resource to install Chaos Mesh via Helm, a package manager for Kubernetes. The Helm chart for Chaos Mesh is specified from its official repository, with version 2.6 explicitly chosen. The installation occurs within the previously defined "chaos-testing" namespace. The set blocks within the helm_release resource customize the installation by configuring the chaosDaemon to use containerd as the runtime and specifying the socket path for the container runtime. This setup ensures that Chaos Mesh integrates correctly with the underlying container runtime, enabling effective chaos engineering experiments to test the resilience and robustness of applications running in the Kubernetes cluster.



resource "kubernetes_namespace" "chaos_testing" {
  metadata {
    name     = "chaos-testing"

resource "helm_release" "chaos_mesh" {
  name       = "chaos-mesh"
  repository = ""
  chart      = "chaos-mesh"
  namespace  = kubernetes_namespace.chaos_testing.metadata[0].name 
  version    = "2.6"  # specify the version of the Chaos Mesh chart you want to deploy

  set {
    name     = "chaosDaemon.runtime"
    value    = "containerd"

  set {
    name     = "chaosDaemon.socketPath"
    value    = "/run/containerd/containerd.sock"



Automating Chaos Studio with GitHub Actions


The GitHub Action workflow provided facilitates the deployment and management of Chaos Mesh experiments and the Azure Vote service within an AKS (Azure Kubernetes Service) cluster. This workflow can be triggered by three types of events: a push to the main branch, a published release, and a manual trigger via workflow_dispatch. The manual trigger allows users to choose between three operations: deploying the vote service, uninstalling the vote service, or deploying chaos experiments.


The workflow defines three separate jobs corresponding to these operations, each running on a self-hosted runner. The deploy_vote_service job checks out the repository, logs into Azure using provided credentials, and sets up the Kubernetes configuration to interact with the AKS cluster. It then creates a namespace and deploys the Azure Vote service. The uninstall_vote_service job follows similar steps but focuses on removing the Azure Vote service from the cluster. The deploy_chaos_experiments job is more complex, involving the setup of the AKS configuration, deployment of chaos experiments, and management of necessary role assignments in Azure AD. It iterates over a set of predefined chaos experiment configurations, applies them, and ensures appropriate permissions are set for the experiments to interact with the AKS cluster. This structured approach ensures a consistent and automated deployment process for both the Azure Vote service and Chaos Mesh experiments.



      - main
    types: [published]
        description: 'Operation: Deploy Experiments for Chaos Mesh'
        required: true
        default: 'deploy_vote_service'
        type: choice
          - deploy_vote_service
          - uninstall_vote_service
          - deploy_chaos_experiments
name: Deploy Chaos Mesh Experiments & Vote Service

      runs-on: self-hosted
      if: ${{ github.event.inputs.chaos_experiments_operation == 'deploy_vote_service' }}
        - name: Checkout
          uses: actions/checkout@v4

        - name: Azure Login
          uses: azure/login@v1
            creds: '{"clientId":"${{ secrets.ARM_CLIENT_ID }}","clientSecret":"${{ secrets.ARM_CLIENT_SECRET }}","subscriptionId":"${{ secrets.ARM_SUBSCRIPTION_ID }}","tenantId":"${{ secrets.ARM_TENANT_ID  }}"}'

        - name: kubeconfig
          run: |
              az aks get-credentials --resource-group ${{ secrets.AKS_RESOURCE_GROUP }} --name ${{ secrets.AKS_NAME }} --overwrite-existing
              kubelogin convert-kubeconfig -l azurecli

        - name: Create Namespace
          run: |
              kubectl get namespace azure-vote || kubectl create namespace azure-vote

        - name: Install Azure Vote Service
          run: |
              kubectl apply -f ./app/azure-vote.yaml -n azure-vote
              kubectl get service azure-vote-front -n azure-vote

      runs-on: self-hosted
      if: ${{ github.event.inputs.chaos_experiments_operation == 'uninstall_vote_service' }}
        - name: Checkout
          uses: actions/checkout@v4

        - name: Azure Login
          uses: azure/login@v1
            creds: '{"clientId":"${{ secrets.ARM_CLIENT_ID }}","clientSecret":"${{ secrets.ARM_CLIENT_SECRET }}","subscriptionId":"${{ secrets.ARM_SUBSCRIPTION_ID }}","tenantId":"${{ secrets.ARM_TENANT_ID  }}"}'

        - name: kubeconfig
          run: |
            az aks get-credentials --resource-group ${{ secrets.AKS_RESOURCE_GROUP }} --name ${{ secrets.AKS_NAME }} --overwrite-existing
            kubelogin convert-kubeconfig -l azurecli

        - name: Uninstall Azure Vote Service
          run: |
            kubectl delete -f ./app/azure-vote.yaml -n azure-vote

      runs-on: self-hosted
      if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && github.event.inputs.chaos_experiments_operation == 'deploy_chaos_experiments') }}
        - name: Checkout
          uses: actions/checkout@v4

        - name: Azure Login
          uses: azure/login@v1
            creds: '{"clientId":"${{ secrets.ARM_CLIENT_ID }}","clientSecret":"${{ secrets.ARM_CLIENT_SECRET }}","subscriptionId":"${{ secrets.ARM_SUBSCRIPTION_ID }}","tenantId":"${{ secrets.ARM_TENANT_ID  }}"}'

        - name: Deploy Chaos Experiment AKS Targets
          run: |
            for file in ${{ github.workspace }}/json/*.json; do
             sed -i 's/SUBSCRIPTION_ID_PLACEHOLDER/${{ secrets.ARM_SUBSCRIPTION_ID }}/g' "$file"
             sed -i 's/RESOURCE_GROUP_PLACEHOLDER/${{ secrets.AKS_RESOURCE_GROUP }}/g' "$file"
             sed -i 's/AKS_NAME_PLACEHOLDER/${{ secrets.AKS_NAME }}/g' "$file"

            # Create the chaos target
            az rest --method put --uri "${{ secrets.AKS_RESOURCE_ID }}/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh?api-version=${{ secrets.API_VERSION }}" --headers 'Content-Type=application/json' --body "{\"properties\":{}}"


            # Create the chaos experiments
            experimentNames=("PodChaos-2.1" "DNSChaos-2.1" "HTTPChaos-2.1" "KernelChaos-2.1" "TimeChaos-2.1" "IOChaos-2.1" "StressChaos-2.1" "NetworkChaos-2.1")
            for experimentName in "${experimentNames[@]}"; do
              echo "Creating capability ${experimentName}"
              az rest --method put --uri "${{ secrets.AKS_RESOURCE_ID }}/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh/capabilities/${experimentName}?api-version=${{ secrets.API_VERSION }}" --headers "$headers" --body "{\"properties\":{}}"
              echo "Creating experiment ${experimentName}"
              response=$(az rest --method put --uri "${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.Chaos/experiments/${experimentName}?api-version=${{ secrets.API_VERSION }}" --headers "$headers" --body @"${{ github.workspace }}/json/${experimentName}.json")
              echo "Response: $response"

        - name: Get Principal IDs
          id: get_principal_ids
          run: |
            # Define the experiment names
            experimentNames=("PODCHAOS-2.1" "DNSCHAOS-2.1" "HTTPCHAOS-2.1" "KERNELCHAOS-2.1" "TIMECHAOS-2.1" "IOCHAOS-2.1" "STRESSCHAOS-2.1" "NETWORKCHAOS-2.1")
            for experiment_name in "${experimentNames[@]}"; do
              echo "Processing experiment: $experiment_name"
              api_url="${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.Chaos/experiments/$experiment_name?api-version=2024-01-01"
              echo "API URL: $api_url"
              experiment_response=$(az rest --method get --uri "$api_url")
              echo "Response for $experiment_name: $experiment_response"
              principal_id=$(echo $experiment_response | jq -r '.identity.principalId')
              echo "Principal ID for $experiment_name: $principal_id"
            principal_ids="${principal_ids%,}" # Remove trailing comma
            echo "principal_ids=$principal_ids" >> $GITHUB_ENV
            echo "::set-output name=principal_ids::$principal_ids"

        - name: Add Principals to AD Group and Assign AKS Cluster Admin Role
          run: |
            IFS=',' read -ra IDS <<< "${{ steps.get_principal_ids.outputs.principal_ids }}"
            for id in "${IDS[@]}"; do
              # Check if the principal is already a member of the AD group
              group_member_check=$(az ad group member check --group "${{ secrets.AKS_AD_GROUP }}" --member-id "$id" --query 'value' -o tsv)
              if [ "$group_member_check" == "false" ]; then
                az ad group member add --group "${{ secrets.AKS_AD_GROUP }}" --member-id "$id"
                echo "Principal $id is already a member of the AD group."

              # Check if the principal already has the AKS Cluster Admin role
              role_assignment_check=$(az role assignment list --assignee "$id" --role "Azure Kubernetes Service Cluster Admin Role" --scope "/subscriptions/${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.ContainerService/managedClusters/${{ secrets.AKS_NAME }}" --query 'length(@)' -o tsv)
              if [ "$role_assignment_check" -eq 0 ]; then
                # Assign AKS Cluster Admin role
                az role assignment create \
                --assignee-object-id "$id" \
                --role "Azure Kubernetes Service Cluster Admin Role" \
                --scope "/subscriptions/${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.ContainerService/managedClusters/${{ secrets.AKS_NAME }}"
                echo "Principal $id already has the AKS Cluster Admin role assigned."



Automating Chaos Studio JSON Templates with GitHub Actions and Terraform


The JSON configuration provided (also see Azure Chaos Studio fault and action library) defines a detailed chaos experiment setup intended for deployment within an AKS (Azure Kubernetes Service) cluster. This configuration, which is stored in a separate root GitHub folder named json, is utilized by the GitHub Action workflows to orchestrate chaos engineering experiments using Chaos Mesh. By keeping these JSON configurations organized in a dedicated folder, the workflows can easily reference and apply them during deployment, ensuring a structured and maintainable approach to chaos testing.


The JSON file specifies the location of the experiment (eastus) and sets up a system-assigned identity for the resources. Within the properties section, the experiment steps are outlined, beginning with "Step 1." This step includes a single branch ("Branch 1") that defines a continuous action targeting all pods within the "azure-vote" namespace. The action is configured to simulate pod failures for a duration of five minutes, utilizing a specific Chaos Mesh capability (podChaos/2.1). The JSON configuration also defines a selector ("Selector1") that identifies the specific AKS cluster targeted by the experiment. This setup ensures that the chaos experiment is precisely targeted and executed within the intended cluster, helping to test the resilience and fault tolerance of the applications running in the "azure-vote" namespace.


By integrating these JSON configurations into the GitHub Action workflows, the automation process becomes seamless. The workflows dynamically replace placeholder values (SUBSCRIPTION_ID_PLACEHOLDER, RESOURCE_GROUP_PLACEHOLDER, and AKS_NAME_PLACEHOLDER) with actual values during execution. This dynamic replacement allows for flexibility and reusability of the JSON configurations across different environments and clusters. The structured approach of keeping these configurations in a dedicated folder and calling them within the GitHub Action workflows ensures a streamlined and efficient process for deploying and managing chaos experiments, ultimately contributing to the robustness and reliability of the AKS-deployed applications.



 "location": "eastus",
 "identity": {
   "type": "SystemAssigned"
 "properties": {
   "steps": [
       "name": "Step 1",
       "branches": [
           "name": "Branch 1",
           "actions": [
               "type": "continuous",
               "selectorId": "Selector1",
               "duration": "PT5M",
               "parameters": [
                     "key": "jsonSpec",
                     "value": "{\"action\":\"pod-failure\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"azure-vote\"]}}"
               "name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.1"
   "selectors": [
       "id": "Selector1",
       "type": "List",
       "targets": [
           "type": "ChaosTarget",
           "id": "/subscriptions/SUBSCRIPTION_ID_PLACEHOLDER/resourceGroups/RESOURCE_GROUP_PLACEHOLDER/providers/Microsoft.ContainerService/managedClusters/AKS_NAME_PLACEHOLDER/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh"





We covered several aspects of automating and managing AKS (Azure Kubernetes Service) clusters and chaos engineering experiments using Terraform and GitHub Actions. We started by detailing the Terraform code used to deploy an AKS cluster, highlighting the configuration of various components such as agent nodes, network settings, security policies, and integrations with Azure services. This automation not only ensures a consistent deployment process but also leverages the power of infrastructure as code to manage complex cloud resources efficiently.


We then explored a GitHub Action workflow designed to automate the deployment and management of Chaos Mesh experiments and the Azure Vote service. This workflow uses triggers based on code changes and manual inputs to execute specific tasks, such as deploying, uninstalling, or running chaos experiments within the AKS cluster. By integrating Azure credentials and Kubernetes configurations, the workflow streamlines the process of setting up and managing these experiments, ensuring that they are applied accurately and securely.


Additionally, we delved into the JSON configurations used for chaos experiments, stored in a dedicated GitHub folder and referenced within the GitHub Action workflows. These configurations define detailed chaos experiment steps and selectors, targeting specific resources within the AKS cluster to simulate various fault scenarios. By organizing these configurations and automating their deployment, we enhance the resilience and fault tolerance of applications running in the cloud.


Together, these discussions illustrate a robust approach to managing cloud infrastructure and testing application resilience through automation and chaos engineering. Utilizing Terraform for infrastructure deployment and GitHub Actions for orchestration and management allows for a streamlined, efficient, and consistent process, ultimately contributing to more reliable and resilient cloud-native applications.


Here are some helpful links from Microsoft Learn that relate to the topics we discussed today:

Version history
Last update:
‎May 17 2024 09:51 AM
Updated by: