Azure Infrastructure Blog articles

Golden Image Refresh for Virtual Machines and VM Scale Sets: Driving Consistency at Scale

ranjsharma — Wed, 20 May 2026 02:56:39 GMT

Overview:

A golden image is a prebuilt, approved system template that represents the ideal baseline for deployment. It includes:

Hardened operating system configuration (e.g., RHEL)
Preinstalled software and dependencies
Security patches and updates
Organizational compliance standards

Architecture:

Golden Image Refresh for VM Scale Sets (VMSS):

Instead of updating instances individually:

A new image version is published
The VMSS is updated to reference the new image
Instances are gradually replaced through a controlled rollout
New instances (based on updated image) are introduced
Traffic is gradually shifted to these new instances
Old instances are decommissioned in phases
Minimizes service disruption
Enables safe rollout of updated environments
Allows real-time validation of new image versions

Virtual Machine Scale Set (VMSS) deployments use a custom image that is baked on top of a Golden Image.
The Golden Image version is pinned in the environment-specific Packer variables (pkrvariables) files.
Refreshing a VMSS Golden Image involves baking a new custom image using an updated Golden Image version and deploying it via the VMSS pipelines.

Image Dependency Flow

Golden Image
- Published and versioned by the Golden Image Team.
- Source OS image, pinned in pkrvariables per environment.
Custom Image
- Created by the custom image pipeline.
- Built on top of the pinned Golden Image.
- Used by VMSS deployments.
VMSS Deployment
- Deploys or updates scale sets using the selected custom image version.

Golden Image Version Management (VMSS)

Each environment pins the Golden Image version in its respective pkrvariables file.
Golden Image versions are selected from the same Golden Image Galleries:
- Dev
- PPR
- Prod
No automatic upgrades occur; changes are explicit and controlled via Git.

VMSS Golden Image Refresh Procedure

1.Select Golden Image Version

Navigate to the appropriate Golden Image Gallery for the target environment.
Identify the Golden Image version to be used for the refresh.

2. Update Packer Variables

Create a feature branch.
Update the pinned Golden Image version in the environment-specific pkrvariables file.

3.Merge Changes

Raise a Merge Request (MR) for the updated version.
After approval, merge the MR into the target branch.

Custom Image Creation

Trigger the custom image pipeline.
This pipeline:
- Uses the updated Golden Image version
- Bakes a new custom image
- Publishes a new custom image version for VMSS consumption

VMSS Deployment

Once the custom image is successfully created, deploy it using one of the following approaches:

Option 1: Operational Pipeline

Use the operational pipeline to deploy the newly created custom image to the VMSS. Operational Pipeline is separate pipeline which will refresh the image.

option 2: Infrastructure Pipeline Update

Update the infrastructure (Terraform) pipeline code with the new custom image version.
Run:
- terraform plan to review VMSS updates
- terraform apply to roll out the new image

Terraform Behavior

VMSS instances are updated to use the newly created custom image.
The same remote Terraform backend is used to preserve state consistency.

Validation and Verification

After deployment:

Validate VMSS instance health
Confirm successful instance provisioning
Verify application and service functionality
Monitor scale set upgrade status and error metrics

Image Team will provide the golden image and then we need to create custom image.

After retrieval of Custom image used in Infra code.

The Golden image refresh in infra code, requires a activity which is called upgrade and there are 2 kinds of upgrade in VMSS :

Automatic upgrade - VMSS instances will upgrade automatically, and this requires downtime.

All VMSS instances will start upgrading simultaneously and application will be down till VMSS instances is up and running.

Manual upgrade - VMSS instances need to be manually upgraded, and this requires 10 - 15 minutes of degradation.

As part of this Upgrade - we need to manually upgrade VMSS instance one by one and so other instances will be up. There will be no downtime for the application.

Bydefault VMSS will consider automatic upgrade which requires downtime. If we do not require Automatic upgrade then we need to change the setting in provider like below.

provider "azurerm" {
features {
virtual_machine_scale_set {
reimage_on_manual_upgrade = false
roll_instances_when_required = false
}
}

After updating above code in provider.tf as part of manual upgrade then update the terraform code for new golden image.

Create a New Image: Start by creating a new golden image with the latest updates and configurations using YAML pipeline.
Update Terraform Configuration: Modify your Terraform configuration to reference the new image. This involves updating the source_image_id or image_reference in your azurerm_virtual_machine_scale_set resource to point to the new image version.

Example:

source_image_id = "/subscriptions/subscriptionid/resourceGroups/rgname/providers/Microsoft.Compute/images/confluence-prd-v-24052450"

data_disks = [
{
storage_account_type = "Premium_LRS"
caching = "ReadWrite"
create_option = "FromImage"
lun = 0
disk_size_gb = "500"
disk_encryption_set_id = null
ultra_ssd_disk_iops_read_write = null
ultra_ssd_disk_mbps_read_write = null
}
]
instances = 3
automatic_instance_repair = [{
enabled = false
grace_period = "PT30M"
}]
computer_name_prefix = "Appname-prd"
overprovision = false
edge_zone = null
health_probe_id = null
upgrade_mode = "Manual"
single_placement_group = true
secure_boot_enabled = false

Apply Terraform Configuration: Run terraform apply to apply the updated configuration. This will update the scale set to use the new image.

After the Apply - Upgrade type is Manual then upgrade the VMSS instances one by one to make the service up and running.

Golden Image Refresh for VM Scale Sets (VM)

Scope

Linux VMs:
- VMs use RHEL 7.9
- VMs use RHEL 8.10

For RHEL 7.9 , there is no golden image hence needs to create custom image. To refresh the image, change the image from (example from 1.0 to 1.1)

Resource Changes (VMTRF):

VM: will be replaced (source_image_id changed)
OS disk: azapi_update_resource.disk ⇒ replaced
Data disk attachments: will be replaced
Network interface: updated in-place
Disk encryption set: updated in-place
Role assignments: will be replaced
VM extension (Custom Script Extension): will be replaced

RHEL 8.10

To refresh the image, change the image from (example from 1.0 to 1.1)

Resource Changes (VM STD):

VM: will be replaced (source_image_id changed)
OS disk: azapi_update_resource.disk ⇒ replaced
Data disk attachments: will be replaced
Network interface: updated in-place
Disk encryption set: updated in-place
Role assignments: will be replaced
VM extension (Custom Script): will be replaced

Building AI Guardian Extension: AI Detection and Enterprise AI Security

ranjsharma — Tue, 19 May 2026 11:41:31 GMT

Introduction

Generative AI tools such as ChatGPT, GitHub Copilot, and Google Gemini are rapidly becoming part of everyday enterprise workflows. Teams use them for code generation, documentation, analysis, support automation, and productivity enhancement.

However, this accelerated adoption has also created a significant governance and security challenge: Shadow AI.

Shadow AI refers to the unauthorized, unmanaged, or unmonitored use of AI tools inside an organization. Employees may unknowingly paste sensitive enterprise information into external AI platforms, exposing:

API keys
Source code
Customer data
Credentials
Internal business documents

At the same time, enterprise AI usage is increasingly exposed to:

Prompt injection attacks
Malicious API manipulation
Unsafe model outputs
Compliance violations

Security and compliance teams currently lack centralized visibility and governance over enterprise AI usage.

Existing tools are fragmented and do not provide unified protection across the complete AI attack surface including Data, Prompt/API, and Model layers.

Organizations require an intelligent, autonomous platform capable of

detecting Shadow AI usage,
preventing sensitive data leakage,
securing AI interactions,
enforcing governance policies,
and maintaining compliance in real time.

Proposed Solution:

AI Guardian is an intelligent security and governance platform designed to secure enterprise AI adoption and mitigate Shadow AI risks.

1) The platform continuously monitors AI interactions across enterprise environments and provides autonomous protection across multiple AI attack surfaces.

Core Capabilities
Shadow AI Detection
Detects unauthorized AI tool usage
Monitors risky AI interactions
Identifies sensitive data exposure
Multi-Layer AI Security
Data Layer Protection
PII detection
API key and credential scanning
Confidential data leakage prevention
Prompt/API Layer Protection
Prompt injection detection
Malicious prompt analysis
API abuse detection
Model Layer Protection
Unsafe response monitoring
AI Compliance Copilot
Generates governance reports
Recommends remediation actions

2.AI Guardian Extension Automatically performs:
prompt blocking,
redaction,
SOC alerting,
incident creation,
and compliance logging.

3.AI Guardian Extension Business Value:

This AI Guardian delivers measurable business value by enabling secure and governed enterprise AI adoption.
4.Key Business Benefits
1. Prevents Sensitive Data Leakage
2. Enables Safe Enterprise AI Adoption
3. Reduces Compliance Risks and helps align enterprise AI usage with:
SOC2
ISO27001
GDPR
Internal security policies

5. Improves Security Visibility and provides centralized visibility into:

AI usage patterns
risky prompts
Shadow AI activity
policy violations

6. Strengthens Enterprise AI Security Posture and protects multiple AI attack surfaces including:

Data Layer
Prompt/API Layer
Model Layer

Customer Involvement

No Customer Involvement added.

Scenario: Employee pastes confidential code into ChatGPT

User opens ChatGPT in the browser
AI Guardian Extension detects site access
User pastes source code containing an API key
Content script captures prompt text
Sensitive data detector finds the API key
Policy engine classifies the action as high risk
Extension redacts the key and blocks original submission
User sees a notification explaining the action
Event is logged to AI Guardian backend
SOC alert and compliance log are generated if required

The rise of Shadow AI means enterprises can no longer rely solely on backend monitoring or post-event analysis. Security controls must move closer to the user interaction point, where prompts are created and data is shared.

Building an AI Guardian Browser Extension provides that control plane.

It enables organizations to:

detect unauthorized AI usage
inspect prompts in real time
prevent sensitive data leakage
block malicious interactions
enforce governance policies
generate audit-ready logs

In a world where AI adoption is accelerating faster than governance, the AI Guardian Extension becomes a practical and scalable way to make enterprise AI usage secure, visible, and compliant.

Rundeck – AWS Enterprise Rundeck Integration with Azure Runner

ranjsharma — Tue, 19 May 2026 10:08:59 GMT

Architecture Overview

Rundeck Server (AWS) → https://dev.rundeck.xyz.com
Rundeck Runner (Azure Linux VM)
Secure Communication over HTTPS (Port 4432)
Optional Proxy for enterprise networks

1.Ensure that network connectivity is established between the Rundeck endpoint (dev.rundeck.xyz.com) and the Azure subnet over port 443.

2.Request the customer to create a Runner within a new or existing project in the Rundeck portal by providing a suitable name and tags and then proceed to download the corresponding Runner JAR file.

3.Provision an Azure Linux Virtual Machine and deploy the downloaded Runner JAR using a VM extension or through a CI/CD pipeline integrated with BAMS/Artifactory. Below all prerequisite steps are mentioned which needs to be run on Azure VM to ensure Rundeck runner is ready.

4.Additionally, make sure all required prerequisites are installed on the Azure Linux VM that will host the Rundeck Runner.

5.After installation it will establish the connectivity between azure Rundeck runner and Aws Enterprise Rundeck.

6.We can trigger any Rundeck jobs on Azure virtual machines.

By deploying the Rundeck Runner on Azure, enterprises can seamlessly bridge AWS-hosted orchestration with Azure-based execution environments. This setup enables robust, scalable, and secure automation across hybrid cloud ecosystems.

All prerequiste needs to be run on the Azure vm.

Step 1: Network Connectivity Prerequisites:

Ensure proper network connectivity between Rundeck and Azure:

Allow outbound/inbound access: Source: Azure Subnet Destination: dev.rundeck.xyz.com Port: 443 Protocol: HTTP

Step 2: Create Rundeck Runner

Navigate to a project → Runner Management

Create a new runner:

Provide Name

Add Tags

Download the Runner JAR file

Step 3: Azure VM setup

Create a Linux VM (RHEL/CentOS) in Azure.

Deploy the runner jar via:

CI/CD pipeline (BAMS/Artifactory)

VM Extension for automation

Step 4: Install Prerequisites on Azure VM

Install Java 11

sudo bash

yum install java-11-openjdk.x86_64

rpm -qa | grep java-11

java --version

Step 5: Start the Runner -

useradd rundeck

Step 6: Create Rundeck User-

mkdir /opt/rundeck/

chown -R rundeck:rundeck /opt/rundeck/

cp runner-*.jar /opt/rundeck/

ll /opt/rundeck/

Step 7: Start the Runner -

without proxy

/bin/java -jar /opt/rundeck/runner-<id>.jar

With proxy:

/bin/java \

-Dmicronaut.http.client.proxy-type=http \

-Dmicronaut.http.client.proxy-address=webproxy.lo5.mgmt.services:80 \

-jar /opt/rundeck/runner-<id>.jar

Step 8: Configure Runner as a Service

vi /etc/systemd/system/runner.service

Add Configuration

[Unit]

Description=Process Automation Runner

[Service]

WorkingDirectory=/opt/rundeck/

Type=simple

User=rundeck

Group=rundeck

ExecStart=/bin/java -jar /opt/rundeck/runner-<id>.jar

Restart=on-failure

[Install]

WantedBy=multi-user.target

Enable and start service:

chmod 0640 /etc/systemd/system/runner.service

systemctl daemon-reload

systemctl enable runner.service

systemctl start runner.service

systemctl status runner.service

Step 9: Verify Runner Process

ps aux | grep rundeck

systemctl status runner.service

Step 10: Configure Log Rotation

create configuration:

vi /etc/logrotate.d/rundeck_runnerd

/opt/rundeck/runner/logs/operations.log

{

daily

missingok

rotate 8

compress

copytruncate

maxsize 150M

create 644 root root

}

Step 11: Runner Upgrade Process

Regenerate credentials:

curl -k -X POST https://dev.rundeck.xyz.com/api/42/runnerManagement/runner/<runner-id>/regenerateCreds \

--header "Content-Type: application/json" \

--header "X-Rundeck-Auth-Token: <token>"

Step 12: Download Updated runner

curl -k -X GET https://dev.rundeck.xyz.com/api/41/runnerManagement/download/<downloadTk> \

--header "X-Rundeck-Auth-Token: <token>" \

--output runner-${runner_id}.jar

Building a Terraform Drift Validator for Azure with Live Portal Verification

ranjsharma — Tue, 19 May 2026 06:18:13 GMT

Architecture:

This blog describes how to build a practical Terraform Drift Validator for Azure that compares three sources of truth:

Excel sheet or design document containing expected Azure configuration
Terraform state file representing IaC-managed deployed intent
Live Azure configuration, verified both programmatically and through Azure Portal step-by-step checks

The solution can be exposed as a lightweight validation application. Below is the link of agent created for drift validator for infra.

https://ca-aiv-agent.livelyhill-f6d3be20.eastus.azurecontainerapps.io/

Key Takeaways

Terraform drift detection is valuable but Terraform alone is not enough when enterprises also rely on design documents, migration inventories, and operational portal validations.
Azure attributes such as SKU, tags, accelerated networking, managed disk type, and zone placement can all be validated using a mix of Terraform state parsing, Azure APIs/Resource Graph, and portal verification.
Azure Resource Graph is especially useful for fast, large-scale live validation because it can query resource properties across subscriptions without calling every resource provider one by one.
Managed Identity is the preferred enterprise authentication model for a read-only validator because it removes credential handling and supports token-based access to downstream Azure services.
Manual Azure Portal verification with documented steps and screenshot placeholders makes the solution feel audit-ready, migration-ready, and operationally trustworthy.

1. Introduction

Terraform drift in Azure happens when what is actually deployed no longer matches what Terraform thinks exists or what the original design intended. In practice, drift appears after portal edits, partial manual changes, external scripts, policy remediation, or emergency operational actions that never get folded back into code. HashiCorp recommends automated drift detection because unmanaged divergence can create operational inconsistency, security exposure, and compliance risk.

Azure environments make this especially important because infrastructure is often governed by multiple teams: architects define the target design, DevOps teams deploy through Terraform, and operations teams may validate or troubleshoot through Azure Portal. Azure Resource Graph itself exists to support large-scale governance and current-state visibility across subscriptions, which makes it a strong foundation for live verification.

That is why an enterprise-grade drift validator should not compare only Terraform vs Azure. It should compare:

Design intent from Excel or architecture documentation
Terraform state as the IaC-managed representation
Azure live configuration as the real runtime state

2. Problem Statement

Many organizations still maintain infrastructure specifications in Excel, migration trackers, or design documents. Those documents often contain the details that matter most to governance and operations: VM sizes, storage SKUs, disk performance expectations, tags, zones, and network settings. Terraform state, on the other hand, reflects what Terraform knows about deployed resources. Azure live state reflects what is actually running.

These three sources diverge for common reasons:

A VM was resized manually in Azure Portal
Accelerated networking was expected in design, defined in Terraform, but the actual NIC does not reflect it
Tags were defined in a workbook but never applied consistently
Storage or disk settings changed during production troubleshooting
Zone placement differs from the architecture baseline

3. Solution Overview

The proposed solution is a Terraform Drift Validator for Azure with Live Portal Verification. At a high level, it works like this:

The user uploads an Excel sheet or design document
The solution extracts expected resource configuration
It ingests the Terraform state file
It queries Azure live resources
It normalizes values across all three sources
It compares attributes and generates a structured drift report
It optionally presents portal validation steps for manual verification

This can run as a web application, API service, or agent-based validator, exposed through:

https://ca-aiv-agent.livelyhill-f6d3be20.eastus.azurecontainerapps.io/

4. Supported Validation Parameters

A useful validator should support the following checks:

SKU
Validate expected SKU from design against Terraform and live Azure. This is critical for cost and performance.
IOPS
Validate disk performance expectations. Azure documents that some disk types—particularly Ultra Disk and Premium SSD v2—allow direct performance tuning, making IOPS a meaningful validation parameter. [
Accelerated Networking
Azure states that accelerated networking improves VM network performance by reducing latency and CPU utilization via SR-IOV on supported VM sizes. This is exactly the kind of feature that is frequently missed or changed during deployments.
Tags
Azure tags are key-value metadata used for governance, organization, and cost management. Azure also warns that tags are stored as plain text and should not contain sensitive data.
Availability Zones
Validate whether resources are placed in the expected zone(s) or zone-resilient configuration.
Region
VM Size
Disk Type
Resource Group
NIC and network configuration
Optional controls such as NSG expectations, public/private exposure, backup status, or monitoring configuration

a { text-decoration: none; color: #464feb; } tr th, tr td { border: 1px solid #e6e6e6; } tr th { background-color: #f5f5f5; }

7. Workflow

User uploads an Excel sheet or design document
System extracts expected fields such as resource type, name, SKU, IOPS, tags, zones, region, VM size, and network settings
Terraform state file is parsed into resource objects
Azure live resources are queried using Resource Graph and targeted API calls
Attribute names and values are normalized
Comparison engine calculates:
- Match
- Design vs Terraform drift
- Terraform vs Azure drift
- Design vs Azure drift
Final report is generated with severity and remediation recommendations

8. Drift Report Output

A report should look like this:

Resource	Attribute	Expected	Terraform	Azure Live	Status	Severity
vm-prod-01	VM Size	Standard_D8s_v5	Standard_D8s_v5	Standard_D4s_v5	Drift	High
nic-prod-01	Accelerated Networking	Enabled	Enabled	Disabled	Drift	High
disk-prod-01	Disk SKU	Premium SSD	Premium_LRS	StandardSSD_LRS	Drift	High
vm-prod-01	Tags.Environment	Production	Production	Prod	Drift	Medium
vm-prod-01	Zone	1	1

9. Creating agents for same and uploading entire python code on the chrysalis

10. Future Enhancements

Next-step enhancements could include:

Scheduled drift monitoring
CI/CD integration after deployment
ServiceNow or Teams notifications
AI-generated remediation summaries
Policy-aware scoring
Compliance dashboards
FinOps insights tied to SKU variance
Historical trend tracking of drift across subscriptions

Modernizing TCP Applications with Azure Application Gateway Layer 4 TCP/TLS Proxy

rbhatia — Thu, 14 May 2026 18:51:33 GMT

Why TCP/TLS Proxy Matters

Modern cloud architectures commonly focus on HTTP/HTTPS traffic management, but many enterprise systems still rely on:

Proprietary TCP protocols
Financial transaction systems
Messaging platforms
Legacy middleware applications
Secure client-server communication

Traditionally, these workloads often required:

Network Virtual Appliances (NVAs)
Hardware load balancers
Custom reverse proxy solutions
Dedicated TCP ingress infrastructure

Managing these components across large environments can increase operational complexity and infrastructure maintenance overhead.

The Layer 4 proxy capability in Azure Application Gateway helps organizations standardize ingress management for both HTTP and non-HTTP workloads using Azure-native services.

Understanding Layer 4 TCP/TLS Proxy

Layer 7 vs Layer 4 Traffic Management

Layer 7 (HTTP/HTTPS)

Layer 7 routing focuses on application-aware traffic handling such as:

URL-based routing
Header inspection
Cookie affinity
Web Application Firewall policies

Layer 4 (TCP/TLS)

Layer 4 proxy focuses on connection-level traffic handling including:

TCP traffic forwarding
TLS traffic pass-through
Port-based routing
Backend load distribution

This approach is useful for applications that do not use HTTP protocols but still require centralized ingress architecture.

Key Feature Capabilities

TCP and TLS Traffic Support

The Layer 4 proxy capability supports:

TCP listeners
TLS listeners
Secure traffic forwarding
Backend connection management

This enables organizations to expose non-HTTP workloads through a centralized ingress layer.

TLS Pass-Through Support

In TLS pass-through scenarios, encrypted traffic remains encrypted between the client and backend application.

Potential advantages include:

End-to-end encryption support
Backend-managed certificate ownership
Reduced application-layer processing at ingress

This model can be useful for applications with strict encryption or compliance requirements.

Proxy Protocol v1 Support

One important capability available in TCP/TLS backend settings is support for Proxy Protocol v1.

Proxy Protocol v1 helps pass original client connection information to backend applications, including:

Source IP address
Destination IP address
Source port
Destination port

This capability can be valuable for:

Backend logging and auditing
Security analysis
Connection tracing
Applications requiring client IP visibility

Without Proxy Protocol support, backend applications may only see the Application Gateway frontend IP rather than the original client source.

When enabling Proxy Protocol v1, backend applications must also support parsing the Proxy Protocol header. Organizations should validate application compatibility before enabling this setting in production environments.

Backend Pool Integration

Layer 4 proxy supports backend pool integration with:

Virtual machines
Virtual machine scale sets
IP-based backends
Kubernetes workloads hosted on Azure Kubernetes Service

This flexibility allows organizations to standardize ingress architecture across different workload types.

Common Enterprise Use Cases

Legacy Application Modernization

Organizations migrating traditional applications to Azure may need TCP ingress without redesigning application communication protocols.

Kubernetes TCP Workloads

Applications running on Azure Kubernetes Service frequently expose TCP services such as:

Messaging brokers
Database endpoints
Streaming services
Proprietary application protocols

Layer 4 proxy can help centralize ingress management for these workloads.

Secure TLS Pass-Through

Some enterprise applications require end-to-end encryption where TLS termination remains on backend services rather than the ingress layer.

Hybrid Connectivity Patterns

Enterprises integrating on-premises applications with Azure workloads may also benefit from centralized TCP traffic management.

Architecture Pattern

A typical architecture pattern includes:

Client application initiates TCP/TLS connection
Azure Application Gateway receives inbound traffic
Layer 4 listener forwards traffic to backend pool
Backend applications process TCP traffic
Traffic routing is managed based on backend availability

Core Azure services commonly involved:

Azure Application Gateway
Azure Kubernetes Service
Azure Virtual Network

Benefits of Azure-Native TCP Ingress

Potential advantages of using Azure-native Layer 4 ingress include:

Area	Potential Benefit
Operations	Reduced infrastructure management overhead
Scalability	Managed platform scaling capabilities
Architecture	Centralized ingress management
Integration	Native Azure networking compatibility
Availability	Support for resilient deployment patterns

Key Recommendations

When implementing Layer 4 TCP/TLS proxy:

Validate backend application compatibility with Proxy Protocol v1 if enabled
Monitor long-lived TCP connections
Test backend scaling scenarios
Validate TLS handling requirements before deployment
Align ingress architecture with application connectivity requirements

For enterprise deployments, organizations should also evaluate:

Disaster recovery requirements
Capacity planning
Operational support models

From Pipelines to Agents: Self-Healing CI/CD Workflow

RavinderGupta — Wed, 13 May 2026 20:21:28 GMT

The Brain of the Operation: Azure OpenAI.

When building a DevOps agent, following are the points which can be considered to select Azure OpenAI as the ideal choice for logical engine:

Native Tool Use: It is specifically optimized for function calling, allowing the agent to interact with Azure DevOps APIs and Github seamlessly.
Cost Efficiency: As a first-party service, Azure OpenAI is the most cost-effective way to run production-grade agents.
Speed and Context: GPT-4o processes complex logs in seconds, identifying the error much faster.

The Architecture: A Self-Healing Loop

A self-healing workflow is an agentic loop consisting of three phases: Observe, Analyze, and Act.

1. Observe (The Trigger)

The process begins with an event-driven trigger. When an Azure DevOps pipeline fails, a webhook sends the telemetry and build logs to an Azure Function.

2. Analyze (The Reasoning)

The logs are passed to GPT-4o via the Microsoft AI Foundry endpoint. The model doesn't just look for error codes; it understands the infrastructure context.

The Prompt:

"You are a DevOps Engineer. Analyze this build log from our Azure Internal Load Balancer deployment. Determine if the failure is a logic error in Terraform or a connectivity issue in the VNET. Suggest the exact code fix in JSON format."

3. Act (The Execution)

This is where the agent becomes "autonomous." Using function calling, the agent can take the following action as an example:

The Action: If GPT-4o identifies a missing Health Probe in your ILB config, it invokes a tool to checkout the code branch, apply the fix, and open a Pull Request (PR) for your approval.

Technical Implementation: Unified Inference

Microsoft AI Foundry provides a standardized way to call Azure OpenAI. This makes the agent code clean and portable:

from azure.ai.inference import ChatCompletionsClient from azure.core.credentials import AzureKeyCredential # Initialize the Foundry Client for GPT-4o client = ChatCompletionsClient( endpoint="https://your-gpt4o-deployment.eastus2.models.ai.azure.com", credential=AzureKeyCredential("YOUR_FOUNDRY_API_KEY") ) def self_heal_pipeline(error_logs): response = client.complete( messages=[ {"role": "system", "content": "You are an autonomous DevOps assistant."}, {"role": "user", "content": f"Analyze and propose a fix for this log: {error_logs}"} ], model="gpt-4o" ) # Logic to trigger a GitHub PR or an Azure DevOps Update return response.choices[0].message.content

Practical Example: The Migration Headache

In our example we had a task of mapping legacy load balancer settings (like fastest-app-response or source-address persistence) to Azure ILB rules.

One small typo in a backend pool member IPs can tank a deployment. We have tested our agent to now scan these configs, flags mismatches, and suggests the correct Azure-native equivalent before the pipeline even runs. It’s saved us days of "trial and error" debugging.

Final Thoughts: Stability on Autopilot

We’ve spent years trying to build the "perfect" pipeline, but the reality is that infrastructure is messy and code is human. By shifting the burden of initial troubleshooting to automated agents, we aren't just saving time; we’re increasing the reliability of our entire stack. Microsoft AI Foundry provides the secure sandbox we need to let these agents work safely.

Azure Arc AKS Explained: Run Kubernetes Beyond Azure Cloud

mohit-kanojia — Tue, 12 May 2026 02:57:30 GMT

Modern enterprises are no longer running workloads only inside a centralized cloud environment. Applications today operate across:

On-premises datacenters
Remote branch offices
Manufacturing plants
Retail stores
Edge locations
Hybrid infrastructure

While Kubernetes has become the standard for container orchestration, managing Kubernetes consistently across distributed environments introduces operational complexity.

This is where Azure Arc and Azure Kubernetes Service extend the Azure control plane beyond traditional Azure cloud boundaries.

Azure Arc enables organizations to deploy, govern, monitor, and manage Kubernetes clusters running:

On-premises
At the edge
In multicloud environments
On virtualization platforms
On physical infrastructure

In this guide, we will explore:

What Azure Arc AKS is
How the architecture works
Core infrastructure components
Step-by-step deployment flow
Networking considerations
Operational insights
Common challenges and troubleshooting approaches

Understanding the Problem Azure Arc Solves

Traditionally, Kubernetes management becomes fragmented when infrastructure exists outside public cloud environments.

Organizations often face:

Separate tooling for on-prem and cloud clusters
Inconsistent governance
Manual onboarding of clusters
Complex identity management
Disconnected monitoring and policy enforcement
Operational overhead at edge locations

Azure Arc addresses this by extending Azure management capabilities to infrastructure running anywhere.

Instead of moving all infrastructure into Azure, Azure Arc brings Azure’s operational model to your existing infrastructure.

What is Azure Arc AKS?

Azure Arc-enabled Kubernetes allows Kubernetes clusters running outside Azure to become manageable resources inside Azure.

This means:

Clusters appear inside Azure Portal
Azure RBAC can be applied
Policies can be enforced centrally
Monitoring and governance become standardized
GitOps and extensions can be deployed consistently

AKS Arc extends this further by enabling an AKS-like Kubernetes deployment and lifecycle management experience on local or edge infrastructure.

High-Level Architecture

The deployment architecture typically follows this flow:

At a high level:

Layer	Purpose
Infrastructure Layer	Physical server or virtual machine
Connectivity Layer	Azure Arc agents and registration
Kubernetes Layer	Kubernetes runtime and orchestration
Azure Integration Layer	Governance, monitoring, policies
Operations Layer	Cluster lifecycle and workload management

Core Components of AKS Arc

Before deployment, it is important to understand the major components involved.

1. Azure Arc

Azure Arc acts as the bridge between Azure and external infrastructure.

It enables:

Resource registration
Hybrid governance
Policy enforcement
Monitoring
Extension deployment
Inventory management

2. Arc-Enabled Machines

These are:

Physical servers
Virtual machines
Edge devices

Once connected to Azure Arc, they become manageable Azure resources.

3. Kubernetes Cluster

The Kubernetes layer provides:

Container orchestration
Scheduling
Networking
Scaling
Workload lifecycle management

4. Custom Location

Custom Locations create a logical mapping between Azure resources and edge infrastructure.

They allow Azure services to target workloads to specific on-prem or edge environments.

5. Device Pool

A device pool groups machines participating in a cluster deployment.

This becomes especially important in multi-node environments.

6. Logical Network (LNET)

The Logical Network defines:

Cluster networking
IP allocation
Gateway configuration
DNS behavior

Networking is one of the most critical parts of any AKS Arc deployment.

Infrastructure Planning Before Deployment

Before starting deployment, infrastructure readiness is essential.

Hardware Recommendations

For a lab or proof-of-concept deployment:

Component	Recommended
CPU	4+ vCPUs
RAM	16 GB minimum
Disk	256 GB SSD
Network	Stable internet connectivity

For production deployments:

Redundant networking
High-performance storage
Multi-node clustering
Power redundancy
Secure network segmentation
Monitoring infrastructure

should all be considered.

Physical vs Virtual Infrastructure

AKS Arc supports both:

Physical hardware
Virtualized environments

Many engineers begin using:

Hyper-V
VMware
Other virtualization platforms

for lab simulation and testing.

Virtual Machine Advantages

Benefit	Explanation
Faster setup	Easier experimentation
Lower cost	No dedicated hardware needed
Flexible snapshots	Quick rollback capability
Easier automation	Infrastructure reproducibility

Physical Hardware Advantages

Benefit	Explanation
Realistic edge testing	Accurate network behavior
Hardware validation	BIOS, TPM, drivers
Production readiness	Real deployment conditions

Step-by-Step AKS Arc Deployment Flow

Now let us walk through the deployment lifecycle.

Step 1 – Prepare Infrastructure

Create or identify:

Physical servers
Edge devices
Virtual machines

Ensure:

Internet connectivity exists
Static IP planning is completed
DNS resolution works correctly
Firewall rules allow Azure communication

Step 2 – Configure Virtualization Environment (Optional)

If using virtualization:

Enable:

Hypervisor platform
Virtual networking
NAT or bridged networking

Create:

Internal virtual switch
DHCP-enabled network
Internet routing

A stable network configuration is critical because cluster deployment depends heavily on:

API communication
Agent registration
Extension downloads
Kubernetes node communication

Step 3 – Install Operating Environment

Install the operating system image on the target machine.

Typical requirements include:

Linux-based edge operating system
Container runtime support
Kubernetes prerequisites
Secure boot considerations
TPM enablement

Recommended VM sizing:

16 GB RAM
4 processors minimum
256 GB storage

Get Image Reference to know where to Download Azure Local OS (ROE) and Azure Local Configurator App -

Step 4 – Connect Infrastructure to Azure Arc

Once the machine is operational:

Install Arc connectivity components
Register the machine with Azure
Verify successful onboarding

After successful registration:

The machine becomes visible inside Azure Portal
Azure governance capabilities become available

At this stage, the machine transitions from:

“Standalone infrastructure”

to:

“Azure-managed hybrid resource”

Login to Azure and Set Subscription az login az account set \ --subscription "<subscription-id>" Install Connected Machine Agent #Install Azure Arc agent on Linux machine. wget https://aka.ms/azcmagent -O ~/install_linux_azcmagent.sh bash ~/install_linux_azcmagent.sh Connect Machine to Azure Arc #Register machine as Arc-enabled server. sudo azcmagent connect \ --resource-group "<resource-group>" \ --tenant-id "<tenant-id>" \ --location "<azure-region>" \ --subscription-id "<subscription-id>" Verify Arc Agent Status #Confirm successful Arc onboarding. azcmagent show Agent Status : Connected Verify Arc Machine in Azure #List Arc-enabled servers. az connectedmachine list \ --resource-group "<resource-group>"

Step 5 – Create the Azure Arc Site

The Arc Site acts as the logical container for edge infrastructure.

During setup:

Select subscription
Choose resource group
Define region
Register machines into the site

This enables Azure to organize:

Provisioned devices
Clusters
Networking resources
Operational metadata

Create Resource Group #Logical container for Arc resources. az group create \ --name "<resource-group>" \ --location "<azure-region>" Register Required Providers #Enable Arc and AKS Arc services. az provider register --namespace Microsoft.HybridCompute az provider register --namespace Microsoft.Kubernetes az provider register --namespace Microsoft.KubernetesConfiguration az provider register --namespace Microsoft.ExtendedLocation az provider register --namespace Microsoft.ResourceConnector az provider register --namespace Microsoft.ContainerService Verify Provider Registration #Ensure providers are fully available. az provider show \ --namespace Microsoft.Kubernetes \ --query registrationState Install Arc Extensions #Enable AKS Arc management capabilities. az extension add --name connectedk8s az extension add --name customlocation az extension add --name k8s-extension az extension add --name aksarc Create Custom Location #Map Azure services to edge infrastructure. az customlocation create \ --name "<custom-location-name>" \ --resource-group "<resource-group>" \ --host-resource-id "<connected-cluster-resource-id>" \ --namespace "<namespace>" \ --cluster-extension-ids "<extension-id>"

Step 6 – Verify Machine Readiness

After onboarding:

Machines undergo provisioning
Agents initialize
Connectivity validation occurs
Extensions are deployed

Eventually the machine reaches a healthy operational state.

Typical indicators:

Connected
Ready
Cluster-capable

Provisioning time may vary significantly depending on:

Network quality
Hardware performance
Extension installation time
Azure synchronization delays

Check Arc Machine Connectivity #Verify machine connection status. az connectedmachine show \ --name "<machine-name>" \ --resource-group "<resource-group>" #Look for: #status : Connected

Step 7 – Deploy the AKS Arc Cluster

Once infrastructure is ready:
begin cluster deployment.

Deployment configuration usually includes:

Cluster name
Node selection
Networking configuration
IP assignment
DNS configuration
Gateway definition

Create Logical Network (LNET) #Define networking for AKS Arc cluster. az aksarc network create \ --name "<lnet-name>" \ --resource-group "<resource-group>" Create AKS Arc Cluster #Deploy Kubernetes cluster on Arc infrastructure. az aksarc create \ --name "<cluster-name>" \ --resource-group "<resource-group>" \ --custom-location "<custom-location-id>" \ --vnet-ids "<logical-network-id>" Verify Kubernetes Connectivity #Check Arc-enabled Kubernetes status. az connectedk8s list \ --resource-group "<resource-group>" Check Installed Extensions #Validate required Arc extensions. az k8s-extension list \ --cluster-name "<cluster-name>" \ --resource-group "<resource-group>" \ --cluster-type connectedClusters Check Node Readiness #Validate Kubernetes node health. kubectl get nodes

Understanding Networking Parameters

Networking is often the most misunderstood area in AKS Arc deployments.

Let us simplify the important parameters.

Subnet

Defines the IP range used by:

Kubernetes nodes
Cluster services
Internal communication

Example:

192.168.1.0/24

DNS Server

Used for:

Name resolution
Azure connectivity
Package downloads
Kubernetes service discovery

Public DNS examples:

8.8.8.8
1.1.1.1

Production environments typically use internal enterprise DNS.

Default Gateway

The gateway routes traffic outside the local subnet.

Without correct gateway configuration:

Azure connectivity fails
Agent communication breaks
Cluster provisioning may stall

Host IP

Each machine requires a unique static IP.

This IP identifies:

Kubernetes nodes
Cluster hosts
Edge infrastructure endpoints

Control Plane IP

The Kubernetes API server requires a stable endpoint.

This becomes the cluster management address used by:

kubectl
automation tools
CI/CD systems

Step 8 – Cluster Provisioning by Azure

Once deployment begins, Azure creates:

Device pools
Custom locations
Logical networks
Kubernetes control plane resources
Cluster integration resources

Provisioning can take:

1 to 2 hours depending on environment

This duration surprises many first-time users.

Unlike cloud-native AKS:
AKS Arc deployments involve:

Hybrid coordination
Infrastructure validation
Edge synchronization
Agent deployment
Local networking configuration

Step 9 – Verify Cluster Connectivity

After deployment:
verify cluster health.

Common validation steps:

az aksarc get-credentials \ --name <cluster-name> \ --resource-group <resource-group>

This retrieves Kubernetes credentials locally.

Then verify nodes:

kubectl get nodes

Healthy output typically shows:

STATUS = Ready

for all participating nodes.

Operational Benefits of AKS Arc

Once operational, AKS Arc provides several major advantages.

Centralized Governance

Using Azure Policy:

Security baselines
Compliance rules
Tagging standards
Resource controls

can be enforced consistently.

Unified Monitoring

Integration with:

Azure Monitor
Container Insights
Log Analytics

provides operational visibility across distributed infrastructure.

az k8s-extension create \ --name azuremonitor-containers \ --cluster-name "<cluster-name>" \ --resource-group "<resource-group>" \ --cluster-type connectedClusters \ --extension-type Microsoft.AzureMonitor.Containers

GitOps-Based Deployments

AKS Arc supports GitOps workflows where:

Kubernetes manifests
Helm charts
Configuration updates

can be synchronized automatically from Git repositories.

az k8s-configuration flux create \ --cluster-name "<cluster-name>" \ --resource-group "<resource-group>" \ --name "<gitops-config>"

Hybrid Consistency

Teams can operate Kubernetes similarly across:

Azure cloud
On-premises
Edge environments

This reduces operational fragmentation.

Common Challenges in AKS Arc Deployments

Real-world deployments are rarely frictionless.

Here are some practical issues engineers often encounter.

1. Networking Misconfiguration

Symptoms:

Provisioning stuck
Cluster not connecting
Agents unhealthy

Root causes:

Incorrect subnet
Invalid gateway
DNS failures
Firewall restrictions

2. Slow Provisioning

Provisioning delays are common.

Reasons include:

Extension deployment time
Image downloads
Edge connectivity latency
Infrastructure initialization

Patience becomes important during initial deployment.

3. Resource Constraints

Insufficient:

RAM
CPU
Storage

can destabilize the Kubernetes environment.

Edge clusters still require enterprise-grade resource planning.

4. Hybrid Debugging Complexity

Troubleshooting spans:

Azure
Kubernetes
Networking
Local infrastructure
Arc agents

This requires multidisciplinary operational knowledge.

Best Practices for AKS Arc Deployments

Plan Networking Early

Most deployment issues originate from poor IP planning.

Document:

Subnets
DNS
Gateways
Static IP allocations

before deployment begins.

Start with Single-Node Labs

Begin small:

Validate architecture
Learn deployment flow
Test operational processes

Then scale toward production-grade clusters.

Monitor Everything

Collect:

System logs
Kubernetes events
Arc agent logs
Network diagnostics

Hybrid environments require strong observability.

Treat Edge Like Production Infrastructure

Even lab environments should implement:

Security controls
Identity management
Backup planning
Access governance

Real-World Use Cases

AKS Arc is especially valuable in environments where low latency or disconnected operations matter.

Examples include:

Industry	Use Case
Manufacturing	Factory automation
Retail	Store analytics
Energy	Remote substations
Healthcare	Local processing
Logistics	Warehouse orchestration
Telecom	Edge compute platforms

Final Thoughts

Azure Arc fundamentally changes how organizations think about infrastructure management.

Instead of forcing workloads entirely into the cloud, Azure Arc extends Azure’s operational capabilities to wherever infrastructure already exists.

Combined with Azure Kubernetes Service, organizations gain:

Kubernetes consistency
Hybrid governance
Centralized operations
Edge deployment capability
Cloud-native management beyond cloud boundaries

As edge computing adoption grows, AKS Arc is becoming an increasingly important platform for modern hybrid infrastructure architectures.

Azure Arc extends Azure management capabilities beyond Azure cloud boundaries, while AKS Arc enables Kubernetes clusters to run consistently across edge, on-premises, and hybrid environments.

Tags:

Azure Arc
Azure Kubernetes Service
Azure Arc AKS
Hybrid Cloud
Edge Computing
Azure Local

Scaling GitHub Advanced Security in Azure DevOps with a single reusable YAML template

Paulams732 — Mon, 11 May 2026 10:47:04 GMT

Scaling GitHub Advanced Security in Azure DevOps with a single reusable YAML template

Managing security scanning across dozens of repositories can quickly become complex—especially when each repository uses different languages, frameworks, and infrastructure patterns.

In our environment, we needed a scalable way to apply GitHub Advanced Security (GHAS) consistently across:

Application code (Python, C#, Java, JavaScript)
Infrastructure as Code (Terraform, ARM, Bicep)
Mixed (polyglot) repositories

Instead of maintaining multiple pipelines, we built a single reusable Azure DevOps YAML template that dynamically adapts to any repository.

The problem

Most teams struggle with:

Multiple pipelines for different tech stacks
Inconsistent security coverage
Maintenance overhead across repositories
Unnecessary scans increasing build time

We needed a solution that:

Detects repository content automatically
Runs only relevant scans
Standardizes security across all repos
Minimizes duplication

Solution overview

The solution is a single-stage pipeline template with three key jobs:

Detect repository content
Run CodeQL for application code
Run IaC security scanning

Scanning behavior is driven entirely by detection outputs.

Architecture

🟦 High-level flow

Key design patterns

1. Detection-driven execution

Instead of hardcoding logic, the pipeline first detects repository content.

✅ Runs only when code is present
✅ Avoids unnecessary execution

2. Single template for all repositories

A single YAML template works for:

Backend services
Frontend apps
IaC repositories
Mixed projects

No duplication. No branching logic across repos.

3. Dynamic CodeQL configuration

The pipeline generates a runtime CodeQL config file:

✅ Keeps configuration centralized
✅ Avoids scan failures due to irrelevant directories

4. Language-aware setup

The pipeline dynamically prepares environments:

✅ No need for separate pipelines
✅ Works across polyglot repos

5. Correct CodeQL build strategy

For compiled languages like C#, the pipeline performs build tracing:

✅ Ensures proper CodeQL extraction
✅ Avoids empty-database failures

6. Integrated IaC security scanning

Infrastructure scanning is handled in the same pipeline:

✅ Covers Terraform, ARM, Bicep
✅ Unified reporting across code and infrastructure

7. Centralized reporting

Artifacts are published for traceability:

Code scanning results → CodeScanningReports
IaC results → IaCSecurityReports

✅ Easy audit and troubleshooting
✅ Retains SARIF outputs

Benefits

This approach delivers:

✔ One pipeline for all repositories
✔ Reduced maintenance overhead
✔ Consistent security enforcement
✔ Faster pipeline execution
✔ Scalable DevSecOps model

Lessons learned

Detection-first pipelines are critical for scale
Config-driven CodeQL execution prevents failures
Build tracing must be handled explicitly for compiled languages
IaC scanning should not be a separate workflow

Conclusion

Scaling GitHub Advanced Security across Azure DevOps doesn’t require multiple pipelines—it requires the right architecture.

By combining:

Detection-driven execution
Dynamic configuration
Conditional setup
Unified scanning

You can operationalize security at scale with a single reusable YAML template.

Understanding the deployment quota limitation (800) Error in Azure Bicep and ARM Deployments

ranjan_ashish — Mon, 11 May 2026 05:12:56 GMT

Understanding deployment quota limitation (800) Error in Azure Deployments (Bicep/ARM)

Introduction

When working with Infrastructure as Code (IaC) using Azure Bicep or ARM templates, deployment failures are a common part of day-to-day operations—especially in large-scale enterprise environments.

One such frequently encountered but often misunderstood issue is the quota limitation (800) error, which typically occurs during repeated or automated deployments.

Figure: Azure Bicep deployment failure showing DeploymentQuotaExceeded error after reaching the 800 deployment history limit, with reference to aka.ms/800 for remediation.

{ "code": "DeploymentFailed", "target": "/subscriptions/cxxx8e00-0add-4f8a-8709-xxxxxxxxxxxx/resourceGroups/pxs-azure-connectivity-d-gwc-dnszone-rg/providers/Microsoft.Resources/deployments/ppiwwpkn7kkla-pdns-zone-deployment", "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.", "details": [ { "code": "DeploymentQuotaExceeded", "message": "Creating the deployment '46d3xbcp.res.network-privatednszone.0-6-0.rysq' would exceed the quota of '800'. The current deployment count is '800'. Please delete some deployments before creating a new one, or see https://aka.ms/800LimitFix for information on managing deployment limits." } ] }

This blog explains:

What this error means
A practical Bicep deployment use case
Root causes
Resolution approaches
Preventive best practices

What is the quota limitation (800) Error?

The quota limitation (800) reference is commonly associated with a deployment quota limitation in Azure Resource Manager (ARM).

In simple terms:
Azure limits the number of deployment records that can be stored per resource group.

Key detail:

Maximum allowed deployment history entries per resource group: 800

Once this limit is exceeded:

New deployments fail
Error messages such as the following are observed:

“DeploymentQuotaExceeded”

The current deployment count is '800'. Please delete some deployments before creating a new one.

This happens because Azure maintains deployment history for auditing, tracking, and troubleshooting purposes. [aka.ms/800]

Use Case: Bicep Deployment Failure in CI/CD

Scenario

An organization is deploying infrastructure using a Bicep template through an automated pipeline.

Example command:

az deployment group create \ --resource-group prod-rg \ --template-file main.bicep \ --parameters @params.bicepparam

Environment Characteristics

Continuous deployment using pipelines (Azure DevOps / GitHub Actions)
Multiple deployments triggered daily
Incremental deployment mode enabled
A shared resource group used across multiple deployments

Issue Encountered

After repeated deployments over time, the following failure occurs:

Error: DeploymentQuotaExceeded

The current deployment count is '800'

See aka.ms/800 for more information

At this point, no further deployments succeed in that resource group.

Root Cause Analysis

Deployment History Limit

Azure stores every deployment execution as a record under:

Resource Group → Deployments

These records accumulate over time, and once the count reaches 800, new deployments are blocked.

Important clarification:

This is not a resource limit (VMs, VNets, etc.)
This is a metadata limit related to deployment history

High Frequency CI/CD Deployments

In enterprise environments, pipelines may run frequently due to:

Minor configuration updates
Validation runs
Automated releases

Each run contributes to the deployment count.

Absence of Cleanup Mechanism

Although Azure manages some cleanup automatically, it is not always sufficient in high-frequency environments. Manual or automated cleanup is often required.

Resolution Approaches

Option 1: Manual Cleanup from Azure Portal

Navigate to:

Azure Portal
Resource Group
Deployments

Delete older deployment entries manually to free up space.

Option 2: Cleanup Using Azure CLI

#List deployments: az deployment group list \ --resource-group prod-rg \ --query "[].name" -o tsv #Delete a deployment: az deployment group delete \ --resource-group prod-rg \ --name <deployment-name>

Option 3: Automated Cleanup (Recommended)

Example PowerShell approach:

$deployments = Get-AzResourceGroupDeployment -ResourceGroupName "prod-rg" if ($deployments.Count -gt 700) { $deployments | Sort-Object Timestamp | Select-Object -First 100 | ForEach-Object { Remove-AzResourceGroupDeployment -ResourceGroupName "prod-rg" -Name $_.DeploymentName } }

This approach ensures that older deployments are periodically removed, preventing quota exhaustion.

Option 4: Use Multiple Resource Groups

Instead of using a single resource group for all deployments:

Separate environments (Dev, Test, Prod)
Temporary or experimental deployments

This helps distribute deployment records across multiple scopes.

Best Practices

Implement Deployment Retention Policy

Maintain only recent deployments (for example, last 100–200)
Automate deletion of older entries

Control Deployment Frequency

Avoid unnecessary pipeline triggers
Batch multiple changes into a single deployment

Use Predictable Deployment Naming

Example:

name: 'deploy-${utcNow()}'

This improves traceability and cleanup management.

Monitor Deployment Count

Example:

az deployment group list \ --resource-group prod-rg \ --query "length(@)"

Set alerts or monitoring thresholds if required.

Understand Deployment Mode Behavior

Incremental deployments prevent unwanted deletions of resources but still increase the deployment history count.

Common Misconceptions

Misconception	Reality
Resource quota exceeded	The issue is related to deployment history
Template is invalid	The template can be valid but blocked by quota
Permission issue	Not related to RBAC
Regional limitation	Independent of region

Related Deployment Errors

While troubleshooting deployments, other common errors may appear, such as:

Authorization failures (insufficient permissions) [learn.microsoft.com]
Invalid template errors (syntax or parameter mismatch)
Concurrent deployment conflicts

It is important to analyze deployment logs to identify the exact failure reason.

Summary

Area	Key Insight
Error Type	Deployment quota limitation
Limit	800 deployments per resource group
Primary Cause	Frequent CI/CD executions
Resolution	Delete older deployment history
Prevention	Automate cleanup and monitor usage

Closing Thoughts

For teams operating at scale with Azure Bicep and automated pipelines, this issue is common but preventable.

The key takeaway is to treat deployment history as an actively managed component of your environment. Without proper governance, it can become a blocking factor for ongoing automation efforts.

CHERIoT-Ibex: Closing the door on memory safety vulnerabilities with hardware-enforced protection

kunyanliu — Sat, 09 May 2026 05:08:11 GMT

Memory safety vulnerabilities—largely arising from widely used programming languages such as C and C++—remain a leading cause of exploitable software defects across systems, from embedded devices to cloud-scale infrastructure. In simple terms, memory safety ensures that software accesses only the data it is intended to use; when this protection fails, attackers can exploit these defects to gain control of devices or disrupt critical services. 

Industry data shows that about 70 percent of the vulnerabilities Microsoft assigns as Common Vulnerabilities and Exposures (CVE) each year are memory safety issues, highlighting how frequently these software defects translate into real-world security risk (CISA – The Urgent Need for Memory Safety in Software Products). Hardware-enforced protections such as CHERIoT-Ibex can help eliminate these vulnerabilities at their source, reducing the likelihood that low-level software flaws can be exploited to compromise devices or disrupt workloads, supporting more trustworthy infrastructure by design. 

An open and certified foundation for memory-safe embedded systems

CHERIoT-Ibex is the first open-source production-quality implementation of the CHERIoT instruction set architecture and among the first cores certified by the CHERI Alliance (CHERI Alliance – CHERIoT). CHERIoT is an extension of the CHERI (Capability Hardware Enhanced RISC Instructions) instruction set, with a focus on embedded and Internet of Things (IoT) applications. Ibex is an open‑source 32‑bit RISC‑V core developed by LowRISC. CHERIoT‑Ibex builds on Ibex by including CHERIoT capability extensions to provide hardware‑enforced memory safety and fine‑grained compartmentalization. It is the result of a close partnership between Microsoft Research and Azure Hardware Systems & Infrastructure, combining advanced research innovation with industry-leading silicon IP development expertise. 

In 2023, Microsoft open-sourced the CHERIoT Platform to bring hardware-enforced memory safety to embedded systems, including an instruction set architecture, toolchain, real-time operating system, and the RTL implementation of the CHERIoT-Ibex core. The CHERI Alliance certification recognizes its ability to provide spatial and temporal memory safety, fine-grained compartmentalization, and compatibility with the broader CHERI ecosystem. Critically, CHERIoT-Ibex achieves these security guarantees with power and area efficiency comparable to low-cost microcontrollers, demonstrating that security doesn’t have to come at a premium. 

Why memory safety remains a foundational security challenge

Traditional embedded and microcontroller-class designs rely on software hardening and coarse-grained hardware protections that struggle to prevent attacks such as buffer overflows and use-after-free vulnerabilities, often adding complexity while still leaving gaps in protection. 

Consider a controller that runs privileged firmware responsible for device initialization, telemetry, and system health monitoring, while also hosting networking functionality exposed to external inputs. A memory-safe vulnerability in the networking stack could allow attackers to execute unauthorized code within the firmware environment, potentially affecting other critical services on the device. In tightly integrated systems, these failures can propagate beyond a single component, increasing overall risk.

Constraining failures with hardware-enforced isolation

CHERIoT-Ibex enables hardware-enforced isolation between these components, helping ensure that even if the networking stack is compromised, its ability to impact system initialization or telemetry functions remains constrained. By limiting the blast radius of software failures, CHERIoT-Ibex supports a system-level approach to security rather than relying on individual components to defend themselves in isolation.

Advancing memory-safe infrastructure by design

CHERIoT-Ibex’s certification by the CHERI Alliance marks an important milestone for open-source memory-safe solutions. It validates that strong security guarantees can coexist with efficiency and transparency, reflecting Microsoft’s broader silicon-to-systems strategy of embedding security into the foundational hardware infrastructure.

Explore and engage with the open-source CHERIoT ecosystem by visiting the CHERIoT Platform and the CHERIoT-Ibex GitHub repository (microsoft/cheriot-ibex). The repositories enable developers and researchers to experiment with, contribute to, and build on memory-safe hardware and software foundations. 

Safely Migrating Terraform Managed Disks on Azure Using Stable Keys and Copilot

shwetayadav — Fri, 08 May 2026 16:02:01 GMT

The Root Cause: Index-Based "for_each" Keys:

Many Terraform modules flatten VM and disk definitions into a list and use the list index as the for_each key:

for_each = { for index, sp in local.managed_disks : index => sp }

This pattern looks harmless, but the index is not stable:

Adding a disk to one VM shifts downstream indices
Reordering environment JSON changes flatten order
Terraform treats shifted indices as new resources

The result: Terraform plans to destroy and recreate all affected managed disks—even though nothing changed in Azure.

Why This Is Especially Risky on Azure:

Azure managed disks are often:

Attached to stateful application tiers
Used for databases, middleware, or batch workloads
Deployed across zones for resiliency

A forced disk replacement can mean:

Data loss
Extended outages
Failed change windows

This makes state stability a first-class design concern—not an implementation detail.

The Stable Key Pattern:

The fix is conceptually simple: use a domain-stable identifier for each disk.

A proven pattern is:

"${sp.vm}-${sp.data_disk.lun}"

This key is:

Deterministic
Independent of ordering
Human-readable
Stable across environments

Example:

VM	LUN	Stable Key
vm1	0	vm1-0
vm1	1	vm1-1
vm2	0	vm2-0

Once applied, adding a new disk results in exactly one new resource, with zero churn.

The Migration Challenge: Terraform State:

Changing for_each keys alone is not enough.

Terraform tracks resources by their state address, not by Azure resource ID. When keys change, Terraform believes:

“The old disks were deleted, and new ones must be created.”

To prevent this, we must move the state, not recreate the resource.

That is where terraform state mv comes in.

Automating the Migration with GitHub Copilot Skills:

To remove risk and human error, the team created a reusable Copilot skill for managed disk key migration.

What the Skill Does:

Inspects Terraform modules for index-based for_each
Reads environment JSON files (such as ALZ variable abstractions)
Reconstructs the exact flatten order used by Terraform
Generates precise terraform state mv commands

This ensures:

No guessing
No manual address mapping
No production surprises

The skill is stored directly inside the repository under .github/skills, making it:

Discoverable
Versioned
Shareable across teams

Example: Generating State Move Commands:

Based on environment JSON, Copilot can generate commands like:

terraform state mv \

'module.managed_disk_windowsvm_app["0"]' \

'module.managed_disk_windowsvm_app["vm1-0"]'

This is repeated deterministically for every existing disk—before any plan or apply.

Recommended Migration Workflow:

Confirm clean state
- terraform plan shows no pending changes
Update the module
- Replace index-based keys with stable keys
Back up the state
- Especially critical with remote backends (Azure Storage)
Run terraform state mv
- Only after terraform init is connected to the correct backend
Re-run plan
- Existing disks should show no changes
Add new disks safely
- Terraform creates only the new disk

CI/CD and Remote Backend Considerations:

A critical finding from this migration:

terraform state mv always modifies the currently initialized backend.

In pipeline-driven environments:

Ensure the correct environment is initialized
Run migrations once per environment
Never merge stable-key code before migrating all environments

Failing to align code and state can cause disk destruction in production.

Key Takeaways:

Index-based for_each keys are unsafe for long-lived Azure disks
Stable keys such as vm-lun eliminate accidental resource churn
State migration is mandatory—not optional
Copilot skills are powerful for institutionalizing safe patterns
Small Terraform design choices can have enterprise-scale impact

Closing Thoughts:

This pattern is broadly applicable beyond disks—to NICs, extensions, and any resource where identity must outlive ordering.

By combining:

Stable Terraform design
State-aware migrations
GitHub Copilot automation

Teams can make infrastructure changes boring again—and that is the ultimate reliability goal.

Building Secure AI Platforms in Banking Using Azure Enterprise Architecture

divyanshi_varshney — Thu, 07 May 2026 16:03:01 GMT

1. Introduction: AI in Banking Is Not Just a Model Problem

Modern banking institutions are no longer asking “Can we use AI?”
The real question is:
“Can we use AI without violating regulatory, security, and data residency constraints?”

Unlike public AI applications, banking systems must ensure:

No public internet exposure
Strict identity-based access control
End-to-end auditability
Data residency compliance
Fully controlled inference pipelines

👉 In enterprise environments, AI success is driven by secure infrastructure—not just model accuracy.

2. Core Design Principle: Controlled Intelligence System

Every AI request must follow a security-enforced execution pipeline:

User Request ↓ Secure Edge (Application Gateway + WAF) ↓ API Governance Layer (API Management - Internal Mode) ↓ AI Orchestration Layer (AKS / App Services) ↓ Retrieval + Policy Layer (RAG + Guardrails) ↓ Private AI Services (Azure OpenAI) ↓ Observability Layer (AMPLS) ↓ Final Response

Key Insight:
This is not just an architecture—it is a controlled and auditable execution model.

3. Azure Enterprise AI Architecture (Production-Ready Pattern)

A real-world architecture used in banking environments:

4. Private Connectivity Model (Critical for Compliance)

Key components:

Private Endpoints → Secure PaaS isolation
Private DNS Zones → Controlled name resolution
VNet Integration → Internal service communication
Azure Firewall → Traffic inspection and control

⚠️ Common Production Failure:

AKS pods fail to resolve Azure OpenAI private endpoint
Root cause:
- Missing Private DNS links
- Incorrect VNet configuration

👉 This is one of the most frequent failures in enterprise AI deployments.

“Debugging Private Endpoint Failures”

Include:

nslookup behavior in AKS
DNS zone linking check
VNet integration validation
UDR / Firewall inspection

5. Identity-First Security Model (No Secrets Architecture)

Modern banking architectures eliminate static credentials entirely.

Authentication Flow:

AKS Workload → Managed Identity → Azure AD → Azure Services

Key Principle:
👉 Identity is the new security perimeter.

Benefits:

No API keys or secrets
Simplified access management
RBAC-based governance
Fully auditable access

6. Secure AI Inference Pipeline

A production AI request flow:

def process_request(user_request): # 1. Authenticate user via Azure AD identity = authenticate_aad(user_request.token) if not identity or not identity.is_valid: return "ACCESS_DENIED" # 2. Enforce rate limiting per identity if not rate_limit(identity): return "RATE_LIMIT_EXCEEDED" # 3. Apply prompt security guardrails (injection protection) safe_prompt = apply_prompt_guardrails(user_request.prompt) # 4. Content safety filtering (PII / harmful content detection) if not content_filter(safe_prompt): return "CONTENT_BLOCKED" # 5. Retrieve secure RAG context context = retrieve_rag_context( query=safe_prompt, secure_mode=True ) # 6. Build final prompt final_prompt = merge_prompt_and_context(safe_prompt, context) # 7. Call Azure OpenAI with circuit breaker protection response = circuit_breaker( lambda: call_openai( prompt=final_prompt, identity=ManagedIdentity() ) ) # 8. Validate and sanitize model output validated_output = sanitize(response) # 9. Log everything for audit + compliance (AMPLS / SIEM) log_to_ampls( identity=identity, request=user_request, response=validated_output ) return validated_output

Security controls include:

Prompt injection filtering
Context grounding (RAG)
Output sanitization
Full audit logging

7. RAG Architecture: Enterprise AI Backbone

User Query → Embedding Model → Azure AI Search (Vector Store) → Context Retrieval → Azure OpenAI → Final Response

Why RAG is preferred in banking:

No model retraining required
Controlled data exposure
Easier compliance validation
Real-time knowledge updates

In banking systems, retrieval is not just about relevance—it is about controlled disclosure of sensitive context

8. Observability with AMPLS (A Critical Yet Overlooked Layer)

AI telemetry flows through:

Azure Services → Private Link → AMPLS → Log Analytics / App Insights

Why this matters: Logs may contain:

Sensitive financial data
PII
Prompt inputs

👉 AMPLS ensures telemetry remains private and compliant.

9. Regulatory Mapping: Banking Requirements to Azure Capabilities

Requirement	Azure Implementation
No public exposure	Private Endpoints
Identity-based security	Azure AD + Managed Identity
Audit compliance	Log Analytics + AMPLS
Data protection	Customer-Managed Keys (CMK)
Network isolation	VNet + Firewall
Access governance	RBAC + PIM

10. Real-World Production Challenges

Common failure points in enterprise AI systems:

DNS Misconfiguration – Private endpoints fail resolution
Latency Chains – Excessive service hops
OpenAI Rate Limits – High enterprise load
Identity Propagation Issues – Cross-subscription failures
Observability Gaps – Missing distributed tracing

11. Enterprise Architecture Best Practices

Design with zero-trust principles
Treat AI as a distributed system, not a single component
Centralize governance using API Management
Never expose AI services publicly
Use identity everywhere—no secrets
Separate:
- Control Plane (governance)
- Data Plane (inference execution)

12. Azure Service Mapping (Quick Reference)

Layer	Azure Services
Edge Security	Application Gateway (WAF)
API Layer	API Management
Compute	AKS / App Services
AI Services	Azure OpenAI
Retrieval	Azure AI Search
Data	Azure Storage / SQL
Identity	Azure AD + Managed Identity
Networking	Private Link + VNet
Observability	AMPLS + Log Analytics

13. Common Failure Patterns

Issue	Root Cause
AI endpoint unreachable	DNS / Private endpoint misconfig
Data leakage risk	Missing prompt filtering
High latency	Over-layered architecture
Unauthorized access	Identity misconfiguration
Poor response quality	Weak RAG implementation

14. Final Thought

In enterprise banking AI systems:

Models are replaceable. Architecture is not.

The real challenge is designing a system where AI is:

Secure
Controlled
Observable
Fully compliant

How Validation‑Driven Terraform Made Our Azure Function Deployments Predictable

AkshitaBajpai — Thu, 07 May 2026 05:12:06 GMT

When Terraform deploys Azure Functions, the most expensive failures are rarely “syntax” problems. They’re environmental mismatches discovered too late—during terraform apply, after approvals, after a change window starts, and often after multiple teams are already watching the pipeline.

After a few painful production-grade rollouts, we shifted to a validation-driven approach: instead of letting Azure reject misconfigurations at apply time, we fail fast at PR/plan time with clear messages that engineers can fix immediately.

What we mean by “validation‑driven”

Validation-driven Terraform uses three guardrails together:

PR checks: formatting, linting, security scanning, module contract tests
Pre-flight checks: quick Azure sanity checks (provider registration, storage prerequisites, RBAC basics)
Terraform-native validations: input validations + preconditions that stop invalid configurations before they reach Azure

The idea is simple: apply should be boring. If something is going to fail, it should fail earlier with better errors.

Azure Functions as the example: where failures actually happen

Azure Functions bring a few recurring “gotchas” that tend to show up late:

mismatch between plan/SKU and features (e.g., VNET integration expectations)
missing or inaccessible storage account settings
unsupported/incorrect runtime stack/version for a chosen hosting model/region/policy
inconsistent app settings required by your org platform standards

We converted these into guardrails with minimal code and clearer pipeline signals.

Case study 1: Wrong plan SKU causing runtime capability failures

Problem

A team deployed a Function App expecting network integration behavior, but the selected plan/SKU didn’t align with what the workload required. The pipeline failed late, after approvals, and the rollback discussion took longer than the fix.

What we changed

We added a small validation/precondition rule: if a team enables a capability that requires a certain class of plan, Terraform fails early with a targeted message.

Outcome

Failure moved from apply-time → plan-time
2+ hours saved per failed deployment cycle
Zero repeat incidents for that class of issue

Case study 2: Missing storage configuration blocking deployments

Problem

Function Apps depend heavily on storage configuration. We saw intermittent failures when storage settings pointed to deleted/incorrect resources, or when access expectations didn’t match reality.

What we changed

We introduced a pre-flight check step in Azure DevOps: verify storage existence/access and fail fast before plan/apply.

Outcome

Deployments stopped failing mid-run
Fewer “investigation loops” across teams
Reduced incident noise (the pipeline became self-explanatory)

Case study 3: Unsupported runtime version (region/org guardrails mismatch)

Problem

Engineers selected a runtime stack/version that was valid in isolation, but not aligned with platform support or readiness in the target environment. Failures appeared in apply or after release.

What we changed

We centralized an “allowed runtime list” (per org standards) and validated runtime inputs at plan time.

Outcome

Plan failed fast with clear explanation
No redeploy cycles
Better compliance posture (standards became enforceable code)

Why this matters (beyond “fewer red pipelines”)

Validation-driven Terraform improved more than deployment success rate:

Developer experience: errors became precise and actionable
Operational safety: fewer emergency approvals and late-night fixes
Standardization: platform rules stopped living in tribal knowledge and wikis
Manager-visible impact: less delivery friction, fewer escalations, faster releases

The best part: this wasn’t achieved by writing massive frameworks. It was achieved by adding small, high-leverage validations in the right places.

Minimal code approach (what we actually used)

We intentionally kept Terraform code small:

A few input validations (environment, runtime, naming contracts)
A few preconditions (must-have settings and plan constraints)
A light Azure DevOps pre-flight step for checks Terraform can’t reliably infer (like “does this dependency exist and is it accessible?”)

This way, the module stays readable, and the pipeline remains fast.

Conclusion

In one of our Azure Function platforms, repeated deployment failures were not caused by bugs in application code or gaps in Terraform itself—they were caused by discovering platform constraints too late. Each failed terraform apply triggered rework, additional approvals, and unnecessary operational noise across teams.

By introducing a validation‑driven approach—combining Terraform input validations, targeted preconditions, and lightweight Azure DevOps pre‑flight checks—we moved failure discovery to the right place: pull requests and plan stages. Azure Function‑specific issues such as incorrect plan capabilities, unsupported runtimes, and missing storage prerequisites were surfaced early, with clear, actionable messages.

If your Azure DevOps pipelines still use terraform apply as a discovery mechanism, validation is not an optimization—it’s a foundational platform capability.

Operationalizing Responsible AI in Microsoft Foundry within Enterprise Network Boundaries

Shruti9162 — Wed, 06 May 2026 05:23:44 GMT

Strategic Overview

Deploying Microsoft Foundry within a VNet-integrated landing zone requires a thoughtful balance between innovation and enterprise-grade security, especially in highly regulated industries like banking. This architecture enforces Responsible AI (RAI) principles and robust content safety controls while aligning with stringent security and compliance requirements.

By adopting a dual-stream design - comprising an AI Platform Layer and a Data Integration Layer, you can decouple model orchestration from data ingestion, enabling flexibility and scalability. Leveraging private networking constructs such as VNets, subnets, NSGs, and controlled routing ensures that all AI workloads operate within secure boundaries, while seamless integration with services like Azure AI Search, Azure Cosmos DB, Azure SQL Database, and Azure Document Intelligence enhances data accessibility and intelligence.

Event-driven ingestion patterns powered by Azure Data Factory and Azure Event Grid further enable real-time responsiveness. At the same time, real-world constraints - such as IP range allowlisting for Microsoft Foundry and private networking limitations—must be carefully accounted for. Ultimately, this approach ensures a secure, compliant, and scalable foundation for enterprise AI adoption.

Below are the pointers that this blog focuses:

Deploy Azure Microsoft Foundry in a VNet-integrated landing zone
Enforce Responsible AI (RAI) policies and content safety controls
Align AI architecture with enterprise (banking) security requirements
Implement a dual-stream architecture:
- AI Platform Layer
- Data Integration Layer
Use private networking with VNet, subnets, NSGs, and routing
Integrate with Azure AI Search, Cosmos DB, SQL, and Document Intelligence
Enable event-driven ingestion using Azure Data Factory and Event Grid
Account for real-world constraints:
- IP range allowlisting for Microsoft Foundry
- Private networking limitations
Design for secure, compliant, and scalable AI consumption

Problem Statement

Operationalizing Responsible AI in enterprise environments requires more than defining policies.

Key challenges include:

Translating Responsible AI principles into enforceable platform controls
Deploying AI services within private, enterprise-grade networks
Managing network constraints and service limitations
Ensuring consistent integration across:
- AI services
- Data pipelines
- Application layers

Without a structured approach, AI platforms risk being non-compliant, insecure, or difficult to scale.

Goals

Design an AI landing zone that:

Enforces Responsible AI at the platform level
Enables secure model deployment and access
Operates fully within private network boundaries
Integrates AI and Data services seamlessly
Provides governed and controlled AI consumption

Architecture Overview

Structure the platform into four layers:

Network Layer → VNet, subnets, NSGs, routing
AI Platform Layer → Microsoft Foundry, models, RAI policies
Data Layer → ADF, SHIR, Event Grid
Application Layer → Function Apps, Web Apps

Microsoft Foundry Setup in Enterprise Context

Set up Azure Microsoft Foundry as the core AI platform layer.

Key steps:

Create Microsoft Foundry projects to isolate use cases
Deploy models within controlled project boundaries
Restrict access using:
- VNet integration
- Private endpoints
Disable public access wherever possible
Integrate with supporting services:
- Azure AI Search (retrieval)
- Cosmos DB / SQL (data storage)

Design Principle

Treat AI services as governed platform components, not standalone resources.

Responsible AI Implementation

1. RAI Policies

Define policies at the model interaction layer
Configure controls for:
- Output moderation
- Prompt handling
Align policies with organizational compliance requirements

2. Content Safety

Integrate content safety as a mandatory runtime layer
Ensure all model outputs pass through filtering before reaching applications

Content Safety Flow

3. Model Governance

Control model deployment via Microsoft Foundry
Restrict direct access to models
Route all interactions through:
- Function Apps
- API layers

Handling VNet-Integrated Deployment Challenges

Enterprise deployments introduce constraints that must be addressed early.

Challenge 1: Microsoft Foundry VNet Integration

Microsoft Foundry requires careful network planning
Standard enterprise patterns may not work without validation

Challenge 2: IP Range Constraints

When designing the VNet:

10.x.x.x range
→ Not GA for all Azure regions by default
Requires:
→ Allowlisting via Microsoft Product Group
Supported ranges (commonly observed):
- 172.x.x.x
- 192.x.x.x

Recommended Approach

Validate supported IP ranges before finalizing network design
Avoid assuming default enterprise CIDR blocks will work
Plan subnets specifically for AI workloads

Challenge 3: Platform Constraints

AI services may behave differently compared to traditional PaaS services
Validate:
- Private endpoint compatibility
- Service integration within VNet

Challenge 4: Security vs Accessibility

Private deployments improve security but add complexity
Address this by:
- Providing controlled access paths
- Using jump hosts or secure access mechanisms

Key Design Considerations

Treat networking as a core dependency for AI platforms
Enforce Responsible AI across:
- Model layer
- Platform layer
- Runtime layer
Use layered security architecture:
- Network isolation
- Policy enforcement
- Content filtering
Validate constraints early to avoid redesign

Best Practices

Plan IP addressing specifically for AI workloads
Use private endpoints and VNet integration by default
Centralize model access through application layers
Apply Responsible AI controls as mandatory, not optional
Design AI platforms with governance built-in from the start

Conclusion

Operationalizing Responsible AI in Microsoft Azure goes beyond defining policies—it demands tight alignment across AI services, infrastructure, networking, and governance controls. A well-architected AI landing zone provides the foundation for securely deploying models, enforcing content filtering on outputs, and ensuring that access remains strictly within enterprise-defined boundaries.

AI services
Infrastructure
Networking
Governance controls

A well-designed AI landing zone ensures that:

Models are deployed securely
Outputs are governed and filtered
Access is controlled within enterprise boundaries

Responsible AI is not just a policy—it is an architectural outcome driven by platform design, network constraints, and enforcement mechanisms.

This holistic approach transforms Responsible AI from a conceptual guideline into a practical, enforceable outcome of system design. Crucially, early architectural decisions—particularly those related to networking, private access, and service compatibility—have a lasting impact on how effectively Responsible AI can be scaled across the organization.

Deploying Azure Resources with Managed HSM Keys Using Bicep

Roslin_Nivetha — Tue, 05 May 2026 09:30:46 GMT

Architecture Overview

The deployment includes:

Managed HSM instance
Key creation inside HSM
User-assigned managed identity / service principal
Role assignments for key access
Azure resource (e.g., Storage / Databricks / Disk) using CMK

Flow:

Create Managed HSM
Create encryption key
Assign permissions
Deploy resource with CMK reference

Prerequisites

Before starting, ensure:

Azure subscription with proper permissions
Access to create Managed HSM
Knowledge of RBAC vs access policies
Bicep CLI installed

Step 1: Deploy Managed HSM

Managed HSM is different from regular Key Vault:

Uses RBAC only (no access policies)
Requires security domain initialization

Bicep snippet:

resource managedHsm 'Microsoft.KeyVault/managedHSMs@2023-02-01' = {

name: hsmName

location: location

sku: {

name: 'Standard_B1'

family: 'B'

}

properties: {

tenantId: tenant().tenantId

initialAdminObjectIds: [

adminObjectId

]

}

Step 2: Create Key in Managed HSM

resource key 'Microsoft.KeyVault/managedHSMs/keys@2023-02-01' = {

name: '${managedHsm.name}/cmk-key'

properties: {

kty: 'RSA-HSM'

keySize: 2048

}

Step 3: Assign Permissions

Since Managed HSM uses RBAC, assign roles like:

Managed HSM Crypto User
Managed HSM Crypto Officer

resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {

name: guid(resourceGroup().id, principalId, roleDefinitionId)

properties: {

principalId: principalId

roleDefinitionId: roleDefinitionId

scope: managedHsm

}

Step 4: Configure Resource with CMK

Example: Storage Account encryption

resource storage 'Microsoft.Storage/storageAccounts@2023-01-01' = {

name: storageName

location: location

kind: 'StorageV2'

sku: {

name: 'Standard_LRS'

}

properties: {

encryption: {

keySource: 'Microsoft.Keyvault'

keyvaultproperties: {

keyname: key.name

keyvaulturi: managedHsm.properties.hsmUri

}

Common Challenges

1. Permission Issues

Resource identity must have access to HSM key
Missing role → deployment failure

2. Key Rotation Impact

When keys are rotated:

Resource may not automatically pick latest version
You may need to redeploy or update configuration

3. Deployment Errors

Typical issue:

Storage/Databricks cannot access HSM key

Fix:

Ensure correct RBAC role assignment
Validate principal ID used during deployment

Key Rotation Strategy

Managed HSM supports:

Manual rotation
Rotation policies

Best practice:

Use version-less key URI if supported
Automate redeployment pipeline

When to Use Managed HSM vs Key Vault

Feature	Managed HSM	Key Vault
FIPS Level	Level 3	Level 2
Multi-tenant isolation	No (dedicated)	Yes
RBAC only	Yes	Optional
Cost	Higher	Lower

Conclusion

Using Managed HSM with Bicep enables:

Stronger security with hardware-backed keys
Full automation via Infrastructure as Code
Enterprise-grade compliance

However, it requires careful handling of:

RBAC permissions
Key rotation
Resource integration

Building an AI Agent for Azure Infrastructure Validation

ranjsharma — Tue, 05 May 2026 06:05:23 GMT

1. Introduction

Infrastructure consistency is critical in large-scale Azure environments, especially in migration programs and DevOps-driven deployments. While Infrastructure as Code (IaC) using Terraform improves reproducibility, it does not fully eliminate:

Manual errors in design specifications
Drift between Terraform and deployed resources
Misalignment between approved design (Excel/architecture docs) and deployed state

To address this, we propose building an AI-powered Infrastructure Validation Agent that continuously validates and reconciles:

Excel (Source of Truth)
Terraform (.tf files)
Azure Deployed Resources

This blog explains the architecture, implementation, validation logic, and real-world applicability of such an agent.

2. Problem Statement

In enterprise environments, infrastructure data flows through multiple stages:

Source	Purpose
Excel / Design Sheets	Approved architecture specifications
Terraform	Infrastructure as Code implementation
Azure Portal	Actual deployed infrastructure

3.Common Challenges

Configuration mismatches across stages
Drift due to manual portal changes
Incorrect SKU, region, or configuration deployment
Lack of automated validation before and after deployment

The absence of unified validation leads to compliance risks, deployment errors, and operational inefficiencies.

4. Solution Overview

The proposed solution is an AI-powered validation agent that:

Ingests Excel as configuration input
Parses Terraform configurations
Fetches deployed resource details from Azure

5. Architecture Overview

High-Level Architecture Components

1. Input Layer
  - Excel file (configuration source)
2. Processing Layer
  - Terraform Parser
  - Azure Resource Fetcher
  - AI-based Validator (optional reasoning layer)
3. Comparison Engine
  - Schema-based comparison
  - Drift detection logic
4. Output Layer
  - Validation report (JSON / Excel / HTML)
5. Hosting
  - Azure Function App
6. Optional Enhancements
  - Azure AI Search for semantic matching and reasoning

6. Agent Design (Modular Components)

Module	Description
Excel Reader	Reads and standardizes input
Terraform Parser	Extracts resource configuration
Azure Fetcher	Queries deployed resources
Comparator Engine	Identifies mismatches
AI Validator	Enhances validation and recommendations
Report Generator	Produces actionable outputs

7. Agent Design
Step 1: Read Excel Input

import pandas as pd

ef read_excel(file_path):

df = pd.read_excel(file_path)

df.columns = df.columns.str.strip()

return df

excel_df = read_excel("infra_config.xlsx")

print(excel_df.head())

Step 2:Parse Terraform Files

import hcl2

def parse_terraform(file_path):

with open(file_path, 'r') as file:

data = hcl2.load(file)

resources = []

for resource_type in data.get('resource', []):

for rtype, instances in resource_type.items():

for name, config in instances.items():

resource = {

"resource_type": rtype,

"resource_name": name,

"config": config

}

resources.append(resource)

return resources

tf_resources = parse_terraform("main.tf")

print(tf_resources)

Step 3:Parse Terraform Files

from azure.identity import DefaultAzureCredential

from azure.mgmt.resource import ResourceManagementClient

credential = DefaultAzureCredential()

subscription_id = "your-subscription-id"

resource_client = ResourceManagementClient(credential, subscription_id)

def fetch_azure_resources():

resources = []

for resource in resource_client.resources.list():

res = {

"name": resource.name,

"type": resource.type,

"location": resource.location,

"id": resource.id

}

resources.append(res)

return resources

azure_resources = fetch_azure_resources()

print(azure_resources)

Step 4:Normalize Data

def normalize_excel(df):

return df.to_dict(orient='records')

def normalize_tf(tf_resources):

normalized = []

for res in tf_resources:

normalized.append({

"resource_name": res["resource_name"],

"resource_type": res["resource_type"],

"config": res["config"]

})

return normalized

def normalize_azure(azure_resources):

normalized = []

for res in azure_resources:

normalized.append({

"resource_name": res["name"],

"resource_type": res["type"],

"location": res["location"]

})

return normalized

Step 5: Validation Logic (Drift Detection)

def compare_resources(excel_data, tf_data, azure_data):

issues = []

for excel_res in excel_data:

name = excel_res['resource_name']

tf_match = next((r for r in tf_data if r['resource_name'] == name), None)

az_match = next((r for r in azure_data if r['resource_name'] == name), None)

if not tf_match:

issues.append({

"resource": name,

"issue": "Missing in Terraform",

"severity": "High"

})

if not az_match:

issues.append({

"resource": name,

"issue": "Missing in Azure",

"severity": "Critical"

})

if tf_match and az_match:

if excel_res['region'] != az_match.get('location'):

issues.append({

"resource": name,

"issue": "Region mismatch",

"expected": excel_res['region'],

"actual": az_match.get('location')

})

return issues

drift_report = compare_resources(

normalize_excel(excel_df),

normalize_tf(tf_resources),

normalize_azure(azure_resources)

)

print(drift_report)

Step 6: Export Report to Excel

Sample validation

Resource	Issue	Expected	Actual	Severity
func-app-01	Missing in Terraform	-	-	High
search-01	SKU mismatch	Standard	Basic	Medium
webapp-01	Region mismatch	East US	West Europe	High

Building Multi-File Refactoring Agents with GitHub Copilot Workspace

Devi_Priya — Tue, 05 May 2026 03:49:40 GMT

Introduction

AI-assisted development continues to evolve beyond inline code suggestions toward end-to-end engineering workflows. While tools such as GitHub Copilot have significantly improved developer productivity at the function and file level, modern applications demand capabilities that operate across the entire repository.

Refactoring, modernization, and architectural changes rarely occur in isolation. They require coordinated updates across multiple files, services, and layers.

With GitHub Copilot Workspace, developers can now move from incremental edits to intent-driven, multi-file transformations powered by AI.

This article walks through:

The role of Copilot Workspace in modern development workflows
How to access and use the Workspace experience
A practical, end-to-end refactoring scenario
Key benefits and considerations for enterprise adoption

From Code Assistance to Code Orchestration

Traditional AI-assisted development focuses on generating code snippets in response to local context. While effective, this approach is limited when tasks require:

Cross-file consistency
Architectural alignment
Large-scale refactoring

Copilot Workspace introduces a different model:

Developers define intent, and AI orchestrates repository-wide execution.

This enables a shift toward:

Task-oriented development
Structured planning before execution
Coordinated multi-file updates

Getting Started with Copilot Workspace

Access to GitHub Copilot Workspace depends on feature availability and organizational enablement. The following entry points are commonly used:

Access from a Repository

Navigate to a repository in GitHub
Select the Copilot option or Open in Workspace

Direct Workspace Access

You can also navigate to:

https://github.com/copilot/workspace

If enabled, this opens the Workspace interface.

Understanding the Workspace Experience

Copilot Workspace provides a structured interface designed for task execution:

Intent Panel – Define the desired outcome
Planning View – Review AI-generated steps
Multi-file Editor – Inspect and refine changes
Execution Controls – Apply updates and create pull requests

This workflow emphasizes transparency and control, ensuring developers remain in the loop.

Key Benefits

Repository-Aware Intelligence

Copilot Workspace analyzes relationships across files, enabling more accurate and consistent transformations.

Intent-Driven Workflows

Developers focus on what needs to be done, while the system determines how to execute it.

Consistent Multi-File Updates

Changes are applied uniformly across controllers, services, and supporting components.

Accelerated Refactoring and Modernization

Large-scale changes can be executed efficiently, reducing manual effort and risk.

Practical Scenario: Modernizing Authentication

To illustrate the capabilities of Copilot Workspace, consider a common enterprise scenario:

An application currently uses password-based authentication, implemented across multiple layers. The goal is to migrate to a token-based authentication model using a centralized service.

Initial State

Authentication logic is distributed across the application:

ValidateUser(username, password)

This pattern appears in:

Controllers
Services
Middleware

Defining the Intent

Within GitHub Copilot Workspace, the developer provides a structured instruction:

Replace all password-based authentication with token-based authentication using AuthService. Update all references, introduce dependency injection, and ensure consistency across the application.

AI-Generated Plan

Copilot Workspace analyzes the repository and produces a plan that includes:

Identifying all usages of ValidateUser
Introducing a centralized authentication service
Updating controllers to return tokens
Refactoring middleware for token validation
Configuring dependency injection

This plan provides a transparent view of the proposed transformation.

Refactored State

Centralized Authentication Service

public interface IAuthService { string GenerateToken(string username); bool ValidateToken(string token); }

public class AuthService : IAuthService { public string GenerateToken(string username) { return Convert.ToBase64String(Encoding.UTF8.GetBytes(username)); } public bool ValidateToken(string token) { return !string.IsNullOrEmpty(token); } }

Updated Controller

public class UserController : Controller { private readonly IAuthService _authService; public UserController(IAuthService authService) { _authService = authService; } public IActionResult Login(string username, string password) { var token = _authService.GenerateToken(username); return Ok(new { Token = token }); }

Updated Middleware

public class AuthMiddleware { private readonly RequestDelegate _next; private readonly IAuthService _authService; public AuthMiddleware(RequestDelegate next, IAuthService authService) { _next = next; _authService = authService; } public async Task Invoke(HttpContext context) { var token = context.Request.Headers["Authorization"]; if (!_authService.ValidateToken(token)) { context.Response.StatusCode = 401; return; } await _next(context); } }

Dependency Injection

services.AddScoped<IAuthService, AuthService>();

Outcome

The transformation delivers measurable improvements:

Centralized authentication logic
Improved security model
Consistent implementation across layers
Reduced technical debt

Best Practices for Adoption

To maximize effectiveness:

Provide clear, structured intent
Review generated plans before execution
Validate changes through testing and code review
Start with targeted scenarios before scaling

Conclusion

GitHub Copilot Workspace represents a meaningful advancement in AI-assisted development. By enabling developers to define intent and delegate execution, it supports repository-wide transformations with greater consistency and efficiency.

As development workflows continue to evolve, tools that combine context awareness, planning, and execution will play a central role in modern engineering practices.

Build and Deploy Logic App Workflows Using Visual Studio Code and CI/CD Pipeline

Devi_Priya — Tue, 05 May 2026 03:39:26 GMT

Throughout this guide, you'll create a Standard logic app workspace and project, build your workflow, and deploy it as a Standard logic app resource in Azure. This enables your workflow to run in a single-tenant Azure Logic Apps environment or within an App Service Environment v3 (restricted to Windows-based App Service plans).

Key advantages of Standard logic apps include:

You can locally develop, debug, run, and test workflows within the Visual Studio Code environment. Although both the Azure portal and Visual Studio Code support building, running, and deploying Standard logic app resources and workflows, Visual Studio Code allows you to perform all these actions locally, offering greater flexibility during development.

Prerequisites

Visual Studio Code
Azure Account extension for Visual Studio Code
Download and install the following Visual Studio Code dependencies for your specific operating system using either method:

Install all dependencies manually --> For manual installation
Install all dependencies automatically.

Starting with version 2.81.5, the Azure Logic Apps (Standard) extension for Visual Studio Code includes a dependency installer that automatically installs all the required dependencies in a new binary folder and leaves any existing dependencies unchanged.

For more information, see Get started more easily with the Azure Logic Apps (Standard) extension for Visual Studio Code.

This extension includes the following dependencies:

Dependency	Description
C# for Visual Studio Code	Enables F5 functionality to run your workflow.
Azurite for Visual Studio Code	Provides a local data store and emulator to use with Visual Studio Code so that you can work on your logic app project and run your workflows in your local development environment. If you don't want Azurite to automatically start, you can disable this option: 1. On the File menu, select Preferences > Settings. 2. On the User tab, select Extensions > Azure Logic Apps (Standard). 3. Find the setting named Azure Logic Apps Standard: Auto Start Azurite, and clear the selected checkbox.
.NET SDK 6.x.x	Includes the .NET Runtime 6.x.x, a prerequisite for the Azure Logic Apps (Standard) runtime.
Azure Functions Core Tools - 4.x version	Installs the version based on your operating system (Windows, macOS, or Linux). These tools include a version of the same runtime that powers the Azure Functions runtime, which the Azure Logic Apps (Standard) extension uses in Visual Studio Code.
Node.js version 16.x.x unless a newer version is already installed	Required to enable the Inline Code Operations action that runs JavaScript.

Set up Visual Studio code

To make sure that all the extensions are correctly installed, reload or restart Visual Studio Code.

Confirm that the Azure Logic Apps Standard: Project Runtime setting for the Azure Logic Apps (Standard) extension is set to version ~4:
On the File menu, go to Preferences > Settings.
On the User tab, go to > Extensions > Azure Logic Apps (Standard).
You can find the Azure Logic Apps Standard: Project Runtime setting here or use the search box to find other settings:

Connect to your Azure account

On the Visual Studio Code Activity Bar, select the Azure icon.

In the Azure window, on the Workspace section toolbar, from the Azure Logic Apps menu, select Create New Project.

From the templates list that appears, select either Stateful Workflowor Stateless Workflow.

Provide a name for your workflow and press Enter.

If Visual Studio Code prompts you to open your project in the current Visual Studio Code or in a new Visual Studio Code window, select Open in current window.

Visual Studio Code finishes creating your project.

The Explorer pane shows your project, which now includes automatically generated project files. For example, the project has a folder that shows your workflow's name. Inside this folder, the workflow.json file contains your workflow's underlying JSON definition.

Open the workflow.json file's shortcut menu, and select Open Designer.

If it asks for Enable connectors in Azure, select Use connectors from Azure

After the Select subscription list opens, select the Azure subscription to use for your logic app project.
After the resource groups list opens, select RG to use for your logic app project.
After you perform this step, Visual Studio Code opens the workflow designer.

After you open a blank workflow in the designer, the Add a trigger prompt appears on the designer. You can now start creating your workflow by adding a trigger and actions and save it.

Run, test, and debug locally

Make sure to start the emulator before you run your workflow:
In Visual Studio Code, from the View menu, select Command Palette.
After the command palette appears, enter Azurite: Start.
On the Visual Studio Code Activity Bar, open the Run menu, and select Start Debugging (F5).

The Terminal window opens so that you can review the debugging session.

Now, find the callback URL for the endpoint on the Request trigger.

Reopen the Explorer pane so that you can view your project.
From the jsonfile's shortcut menu, select Overview.

Click on Run trigger

If it is stateful workflow, you’ll be able to see the status as shown below.

To view it, click on identifier.

It will open a new window with the results.

Note: Incase while using storage account in your workflow if you get any forbidden error then whitelist your IP in that storage account and rerun the workflow by choosing Run and debug in VS code.

Upon completion stop the debug by choosing the stop button and push the code to azure repo using git commands to push the code.

Use a Pipeline to Deploy the Created Workflow

Build.yaml

jobs: - job: logic_app_build displayName: "Build and publish Logic App" steps: - script: sudo apt-get update && sudo apt-get install -y zip displayName: 'Install zip utility' - task: CopyFiles@2 displayName: 'Create project folder' inputs: sourceFolder: '$(System.DefaultWorkingDirectory)' contents: | azure_logicapps/** targetFolder: 'project_output' - task: ArchiveFiles@2 displayName: 'Create project Zip' inputs: rootFolderOrFile: '$(System.DefaultWorkingDirectory)/project_output/azure_logicapps' includeRootFolder: false archiveType: 'zip' archiveFile: '$(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip' replaceExistingArchive: true - task: PublishPipelineArtifact@1 displayName: 'Publish project zip artifact' inputs: targetPath: '$(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip' artifact: 'logicAppCIArtifact' publishLocation: 'pipeline'

Deploy.yaml

jobs: - deployment: deploy_logicapp_resources displayName: Deploy Logic App environment: ${{ parameters.environmentToDeploy }} strategy: runOnce: deploy: steps: - download: current artifact: logicAppCIArtifact - task: AzureFunctionApp@1 displayName: 'Deploy Logic App workflows' inputs: azureSubscription: ${{ parameters.azureServiceConnection }} appType: 'functionApp' appName: ${{ parameters.vars.LogicAppName }} package: '$(Pipeline.Workspace)/logicAppCIArtifact/$(Build.BuildId).zip' deploymentMethod: 'zipDeploy'

Running GitHub Actions Runners on Azure Container Apps with KEDA Autoscaling

shubhijain — Mon, 04 May 2026 13:03:30 GMT

GitHub-hosted runners work well for most scenarios. But as workloads grow, teams often need:

Better cost optimization — pay only when jobs run
More control over execution environments and installed tools
Scalable parallel execution — run 10, 20, or 50 jobs simultaneously
Network access to private resources (databases, internal APIs)

Traditionally, this is solved using self-hosted runners on Virtual Machines. But VMs come with challenges — always-on cost, manual scaling, patching, and maintenance overhead.

In this guide, we'll walk through a modern, serverless alternative:

👉 Running self-hosted GitHub runners on Azure Container Apps Jobs with KEDA autoscaling

What You Will Build

By the end of this guide:

Capability	What You Get
Runner type	Self-hosted, ephemeral (one job = one container)
Scaling	Automatic via KEDA — scales to zero when idle
Cost	Zero cost when no jobs are running
Infrastructure	Fully managed by Azure Container Apps
Security	Secrets stored in Azure Key Vault, Managed Identity for auth

Architecture Overview

Here's how the system works at a high level:

How It Works (Step by Step):

A developer pushes code or triggers a GitHub Actions workflow
GitHub queues the job and looks for a runner with matching labels
KEDA (built into Container Apps) polls the GitHub Actions API for pending jobs
When a pending job is detected, KEDA triggers the Container App Job to start a new execution
A fresh container starts, registers itself as a self-hosted runner with GitHub
The runner picks up the job, executes it, and reports results back to GitHub
The container shuts down and is destroyed — fully ephemeral
When no jobs are pending, KEDA scales back to zero — no cost

Runtime Flow

Pre-requisites

Before you begin, make sure you have:

Requirement	Details
GitHub account	With a repository or organization where you want to run workflows
Azure subscription	With permissions to create resources (Contributor role or higher)
Azure CLI	Installed locally, OR use Azure Cloud Shell (no install needed)
Basic knowledge	Familiarity with GitHub Actions and Azure Portal

Where You'll Run Commands

Throughout this guide, you'll need a terminal to create files and run CLI commands. You have three options:

Option	When to Use	How to Open
VS Code Terminal (Recommended)	You have VS Code installed locally	Open VS Code → Ctrl + `` (backtick) → Terminal opens at the bottom
Azure Cloud Shell	No local tools installed, or restricted machine	Go to portal.azure.com → click the >_ icon in the top toolbar
Any terminal	PowerShell, CMD, Bash — whatever you prefer	Just ensure Azure CLI (az) is installed

💡recommended to use VS Code because you'll create files (Dockerfile, start.sh) AND run commands — VS Code lets you do both in one place.

Azure Resources We Will Create

Resource	Purpose
Resource Group	Logical container for all resources
Azure Container Registry (ACR)	Stores the runner Docker image
Azure Container Apps Environment	Hosting environment for container jobs
Azure Container App Job	The actual runner job definition
Azure Key Vault	Securely stores the GitHub PAT token
Managed Identity	Allows the container job to access ACR and Key Vault without passwords

💡 Note: This guide covers organization-level runners. For repository-level runners, the only difference is the GitHub API endpoint used for registration. We'll call out the differences where applicable.

Step 1: Create a GitHub Personal Access Token (PAT)

Before we touch Azure, we need a token that allows our runner to register with GitHub. You can use either a Fine-grained token (recommended) or a Classic token.

Option A: Fine-Grained PAT (Recommended)

Fine-grained tokens let you scope access to specific repositories only. This is critical for two reasons:

Avoid GitHub API rate limits: KEDA continuously polls the GitHub API for pending jobs. If your token has access to your entire org (potentially hundreds of repos), KEDA scans all of them on every polling cycle. GitHub allows only 5,000 API requests/hour — with broad access, you'll hit this limit quickly and KEDA will stop detecting jobs.
Security: Least-privilege access — the token only works on the repos you explicitly select.

🔑 This is why we recommend fine-grained tokens over classic tokens. By selecting only the repos that need runners, KEDA polls fewer repos and stays well within API limits.

Go to github.com → Settings → Developer settings
Click Personal access tokens → Fine-grained tokens
Click Generate new token
Fill in:

Field	Value
Token name	container-app-runner
Expiration	Choose based on your needs (e.g., 90 days)
Resource owner	Your GitHub username or org
Repository access	Only select repositories → pick the repos where you want runners

⚠️ IMPORTANT — Remember these repo names! The repos you select here are the ONLY repos this token can access. Later in Step 8, when you configure the KEDA scale rule, you must list these exact same repos in the repos metadata field. KEDA uses this token to poll GitHub for pending jobs — if a repo isn't included in the token, KEDA can't see its jobs and your runners won't scale for it.

Example: If you select my-app and my-api here, your KEDA config must have repos: my-app,my-api.

Under Permissions, set the following:

Repository permissions (required):

Permission	Access Level	Why It's Needed
Actions	Read and write	Manage workflow runs and artifacts
Administration	Read and write	Register and manage self-hosted runners
Metadata	Read-only	(Auto-selected, required)
Workflows	Read and write	Update GitHub Action workflow files

Organization permissions (only if using org-level runners):

Permission	Access Level	Why It's Needed
Self-hosted runners	Read and write	Register runners at the org level

📝 For personal accounts (no org): You only need the Repository permissions above. Skip the Organization permissions.

Click Generate token
⚠️ IMPORTANT: Copy the token NOW — you won't be able to see it again!

Option B: Classic PAT

If you prefer a classic token (simpler but broader access):

Go to Settings → Developer settings → Personal access tokens → Tokens (classic)
Click Generate new token (classic)
Select these scopes:

Scope	Why It's Needed
✅ repo	Full access to repositories
✅ workflow	Manage workflows
✅ admin:org	Required for org-level runners (skip for personal repos)

Click Generate token and copy it immediately

⚠️ Classic tokens give access to ALL repositories in your account. For better security and to avoid API rate limits, prefer fine-grained tokens scoped to specific repos.

Save the token somewhere safe temporarily. We'll store it in Azure Key Vault in the next steps.

Where Will Runners Appear on GitHub?

Once runners are deployed and register with GitHub, you can see them here:

For repository-level runners:

Go to your repo → Settings → Actions → Runners

For organization-level runners:

Go to your org → Settings → Actions → Runners

Runners automatically register themselves when the container starts. You'll see them appear as "Idle" or "Active" in the Runners list. You do NOT need to manually create individual runners on GitHub.

Setting Up Runner Groups (Organization-Level) — Recommended

Since this guide is for organization-level runners, it's recommended to create a Runner Group on GitHub. Runner Groups let you control which repositories in your org can use these runners.

Steps to Create a Runner Group:

Go to your GitHub Organization page (e.g., https://github.com/your-org)
Click Settings (top menu bar)
In the left sidebar, expand Actions → click Runner groups
Click New runner group
Fill in:

Field	Value
Name	container-app-runners (or any descriptive name)
Repository access	Choose one:
	• All repositories — any repo in the org can use these runners
	• Selected repositories — pick specific repos (recommended for control)
Allow public repositories	Uncheck this for security (unless needed)
Workflow access	Leave default (all workflows)

⚠️ CRITICAL: If ANY of the repositories using these runners are PUBLIC, you MUST check "Allow public repositories". If this is unchecked, GitHub will silently refuse to dispatch jobs from public repos to runners in this group — the runner will register and show as "Idle", but workflows will stay stuck in "Queued" forever. This is the most common and hardest-to-debug issue with runner groups.

Click Create group

Why Create a Runner Group?

Without Runner Group	With Runner Group
Any repo in the org can use your runners	Only selected repos can use them
Harder to track which teams use runners	Clear visibility and access control
Potential security risk for public repos	Can block public repos from using runners

📝 For personal GitHub accounts: Runner groups are not available. Runners will automatically appear under your repo's Settings → Actions → Runners. No extra setup needed.

Step 2: Create Azure Resources (Portal)

We'll create all the Azure infrastructure through the Azure Portal. If you prefer CLI, see the CLI alternative at the end.

⚠️ IMPORTANT: All Azure resources (Resource Group, ACR, Key Vault, Container Apps Environment, Container App Job) must be in the SAME region. Pick one region (e.g., Central US or West US 2) and use it for everything. Mixing regions can cause connectivity issues and increased latency.

2.1: Create a Resource Group

Go to Azure Portal
Search for "Resource groups" in the top search bar
Click + Create
Fill in:

Field	Value
Subscription	Select your Azure subscription
Resource group	rg-github-runners (or your preferred name)
Region	West US 2 (or any region that supports Container Apps)

Click Review + create → Create

2.2: Create an Azure Container Registry (ACR)

This is where we'll store our runner Docker image.

Search for "Container registries" in the Azure Portal
Click + Create
Fill in:

Field	Value
Subscription	Your subscription
Resource group	rg-github-runners
Registry name	yourregistryname (must be globally unique, lowercase, letters and numbers only)
Location	Same as your resource group (e.g., West US 2)
Pricing plan	Basic (sufficient for this guide; use Standard/Premium for production)

Click Review + create → Create
Once created, note down the Login server (e.g., yourregistryname.azurecr.io) — you'll need this later

2.3: Create an Azure Key Vault

Search for "Key vaults" in the Azure Portal
Click + Create
Fill in:

Field	Value
Subscription	Your subscription
Resource group	rg-github-runners
Key vault name	kv-github-runners (must be globally unique)
Region	Same region
Pricing tier	Standard

Go to the Access configuration tab:
- Select Azure role-based access control (RBAC) as the permission model
Click Review + create → Create

2.4: Store the GitHub PAT in Key Vault

Open your newly created Key Vault
First, give yourself permission:
- Go to Access Control (IAM) → + Add role assignment
- Role: Key Vault Secrets Officer
- Assign to: Your own Azure account
- Click Review + assign
Now go to Objects → Secrets → + Generate/Import
Fill in:

Field	Value
Upload options	Manual
Name	github-pat
Secret value	Paste your GitHub PAT from Step 1

Click Create

Step 3: Create the Runner Docker Image

This is the Docker image that will run as your GitHub Actions runner. We need two files: a Dockerfile and a start.sh script.

3.1: Set Up Your Working Directory

Open VS Code on your local machine
Open the integrated terminal: Ctrl + ` (backtick) or Terminal → New Terminal
Create a new folder and navigate into it:

mkdir github-runner-image cd github-runner-image

You'll create two files in this folder: Dockerfile and start.sh
In VS Code, click File → Open Folder and open the github-runner-image folder (so you can edit files easily)

📌 All commands in Step 3 and Step 4 should be run from inside this github-runner-image folder.

3.2: Choose Your Approach

There are two approaches to creating the runner image:

Approach	Best For	Docker Required Locally?
Option A: Build locally with Docker	Development/testing	✅ Yes
Option B: Build remotely with ACR Tasks	Production / no Docker access	❌ No

We'll create the same files for both approaches. The only difference is the build command.

3.3: Create the start.sh Script

This script runs when the container starts. It registers the runner with GitHub, executes the job, and then the container shuts down.

Create a file named start.sh:

#!/bin/bash set -e # ──────────────────────────────────────────── # CONFIGURATION # ──────────────────────────────────────────── # These values are passed as environment variables # GITHUB_PAT → Your GitHub Personal Access Token # GITHUB_OWNER → Your GitHub org or username # GITHUB_REPO → (Optional) Repository name for repo-level runners # RUNNER_SCOPE → "org" or "repo" # RUNNER_LABELS → Comma-separated labels (e.g., "container-app,linux") # RUNNER_GROUP → Runner group name (org-level only, default: "Default") # Set this to the runner group you created in Step 1 (e.g., "container-app-runners") # If not set, runners register in GitHub's "Default" group — which means # ANY repo in the org can use them and you lose access control. RUNNER_SCOPE="${RUNNER_SCOPE:-org}" RUNNER_LABELS="${RUNNER_LABELS:-container-app}" RUNNER_GROUP="${RUNNER_GROUP:-Default}" # ⚠️ IMPORTANT: Always set the RUNNER_GROUP environment variable on your Container App Job # to match the runner group you created on GitHub (e.g., "container-app-runners"). # The "Default" fallback above is only a safety net — do NOT rely on it. # ──────────────────────────────────────────── # GET REGISTRATION TOKEN # ──────────────────────────────────────────── if [ "$RUNNER_SCOPE" == "org" ]; then echo "🔑 Requesting registration token for organization: $GITHUB_OWNER" REG_TOKEN=$(curl -s -X POST \ -H "Authorization: token $GITHUB_PAT" \ -H "Accept: application/vnd.github+json" \ "https://api.github.com/orgs/${GITHUB_OWNER}/actions/runners/registration-token" \ | jq -r .token) RUNNER_URL="https://github.com/${GITHUB_OWNER}" else echo "🔑 Requesting registration token for repository: $GITHUB_OWNER/$GITHUB_REPO" REG_TOKEN=$(curl -s -X POST \ -H "Authorization: token $GITHUB_PAT" \ -H "Accept: application/vnd.github+json" \ "https://api.github.com/repos/${GITHUB_OWNER}/${GITHUB_REPO}/actions/runners/registration-token" \ | jq -r .token) RUNNER_URL="https://github.com/${GITHUB_OWNER}/${GITHUB_REPO}" fi if [ -z "$REG_TOKEN" ] || [ "$REG_TOKEN" == "null" ]; then echo "❌ Failed to get registration token. Check your GITHUB_PAT and permissions." exit 1 fi echo "✅ Registration token obtained successfully" # ──────────────────────────────────────────── # CONFIGURE RUNNER # ──────────────────────────────────────────── echo "⚙️ Configuring runner..." ./config.sh --unattended \ --name "runner-$(hostname)" \ --url "$RUNNER_URL" \ --token "$REG_TOKEN" \ --runnergroup "$RUNNER_GROUP" \ --ephemeral \ --labels "$RUNNER_LABELS" \ --replace echo "🚀 Starting runner..." ./run.sh

Key flags explained:

--ephemeral: Runner processes one job then exits (container stops)
--runnergroup: Registers the runner in a specific runner group (org-level only)
--replace: Replaces any existing runner with the same name
--unattended: No interactive prompts

⚠️ Do NOT use --disableupdate! In newer GitHub versions, this flag prevents GitHub from dispatching jobs to the runner. The runner will appear as "Idle" but never pick up work.

3.4: Create the Dockerfile

Create a file named Dockerfile:

FROM ubuntu:22.04 # Prevent interactive prompts during package installation ENV DEBIAN_FRONTEND=noninteractive # Install required dependencies RUN apt-get update && apt-get install -y \ curl \ git \ jq \ ca-certificates \ unzip \ wget \ apt-transport-https \ software-properties-common \ && rm -rf /var/lib/apt/lists/* # Create a non-root user for the runner (GitHub requires this) RUN useradd -m runner # Set up the runner directory WORKDIR /home/runner/actions-runner # Download the latest GitHub Actions Runner # Check latest version: curl -s https://api.github.com/repos/actions/runner/releases/latest | jq -r '.tag_name' ARG RUNNER_VERSION=2.334.0 RUN curl -L -o actions-runner.tar.gz \ "https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \ && tar xzf actions-runner.tar.gz \ && rm actions-runner.tar.gz # Install runner dependencies RUN ./bin/installdependencies.sh # Copy the startup script COPY start.sh . RUN chmod +x start.sh # Set ownership to the runner user RUN chown -R runner:runner /home/runner # Switch to non-root user USER runner ENTRYPOINT ["./start.sh"]

💡 Tip: Check the latest runner version by running:

curl -s https://api.github.com/repos/actions/runner/releases/latest | jq -r '.tag_name'

At the time of writing, 2.334.0 is the latest. Update the ARG RUNNER_VERSION value if a newer version is available.

⚠️ Using a deprecated runner version will cause runners to connect but refuse to pick up jobs. You'll see the error: "Runner version vX.X.X is deprecated and cannot receive messages." Always use a recent version.

Step 4: Build and Push the Docker Image

Now we need to build this image and push it to your Azure Container Registry (ACR).

Option A: Build Locally with Docker (Development)

Use this if you have Docker Desktop installed on your machine.

Where to run: In the VS Code terminal (or any terminal), make sure you're inside the github-runner-image folder where your Dockerfile and start.sh are located.

# 0. First, log in to Azure (this opens a browser window for authentication) az login # 1. Log in to your ACR (replace yourregistryname with your actual ACR name) az acr login --name yourregistryname # 2. Build the image docker build -t yourregistryname.azurecr.io/github-runner:v1 . # 3. Push the image to ACR docker push yourregistryname.azurecr.io/github-runner:v1

📌 Make sure Docker Desktop is running before you execute these commands. If you see "Cannot connect to the Docker daemon", start Docker Desktop first.

🔒 Don't have az login or Docker on your machine? Use Azure Cloud Shell instead — it's a browser-based terminal at shell.azure.com that comes pre-authenticated with Azure CLI (no az login needed) and has Docker available. See Option B below if you can't use Docker at all.

Option B: Build Remotely with ACR Tasks (Production — No Docker Needed) ⭐

This is the recommended approach for production environments where:

You don't have Docker installed
Your production environment is fully private / locked down
You want to build images directly in Azure without any local tooling

Where to run:

VS Code terminal → Run az login first, then the build command
Azure Cloud Shell (shell.azure.com) → No az login needed, you're already authenticated. Upload your Dockerfile and start.sh files using the Upload button in Cloud Shell, then run the build command

You must be inside the folder where your Dockerfile and start.sh are located.

# If running from VS Code terminal (skip this line if using Azure Cloud Shell) az login # Build directly in ACR — no Docker required! az acr build --registry yourregistryname --image github-runner:v1 --file Dockerfile .

☝️ That's it — one single command. No Docker install, no Docker daemon, nothing.

Using Azure Cloud Shell? To upload files:

Open shell.azure.com
Click the Upload/Download button (📁 icon) in the toolbar
Upload Dockerfile and start.sh
They'll land in your home directory (~/). Run az acr build from there.

This command:

Uploads your source code to ACR
Builds the Docker image in Azure (not on your machine)
Tags and stores it in your registry
No Docker daemon needed at all!

🔒 Production Note: In locked-down environments where even az acr build isn't possible from your machine, you can:

Use Azure Cloud Shell (browser-based, always has Azure CLI)
Set up an ACR Task with a Git trigger — ACR automatically builds when you push to a repo
Use Azure DevOps / GitHub Actions pipeline to build and push the image

Example: Auto-build from a GitHub repo (single line for Cloud Shell)

az acr task create --registry yourregistryname --name build-runner-image --image "github-runner:{{.Run.ID}}" --context https://github.com/your-org/your-runner-repo.git --file Dockerfile --git-access-token YOUR_GITHUB_PAT

Verify the Image

After building, confirm the image exists in your registry:

Portal: Go to your ACR → Repositories → you should see github-runner listed

CLI:

az acr repository list --name yourregistryname --output table

Step 5: Create the Container Apps Environment

The Container Apps Environment is the hosting platform for your container jobs. Think of it as the "cluster" where your runners will live.

Steps (Azure Portal):

Search for "Container Apps Environment" in the Azure Portal
Click + Create
Fill in:

Tab	Field	Value
Basics	Subscription	Your subscription
	Resource group	rg-github-runners
	Environment name	cae-github-runners
	Region	West US 2 (same as other resources)
	Environment type	Consumption only (or Consumption + Dedicated if you need workload profiles)
Monitoring	Log Analytics workspace	Create new or select existing

Leave Networking as defaults for now (we'll discuss production networking later)
Click Review + create → Create

⏳ This takes 1-2 minutes to create.

Step 6: Create the Container App Job

This is the core resource — the Container App Job that will run your GitHub runners.

⚠️ IMPORTANT: Complete Step 4 (Build & Push Image) BEFORE this step. The Container App Job creation form requires you to select a container image from your ACR. If your ACR is empty (no images pushed), the portal will show an error: "The ACR does not have any images. Please push an image to the ACR and try again." So make sure your image is pushed first!

Steps (Azure Portal):

Search for "Container App Jobs" in the Azure Portal search bar
Click + Create Container App Job

Tab 1: Basics

Field	Value	Notes
Subscription	Your subscription
Resource group	rg-github-runners	Same RG as other resources
Container app job name	github-runner-job	Lowercase, letters, numbers, hyphens
Region	West US 2	Must match your CAE region
Container Apps Environment	cae-github-runners	Select the environment created in Step 5

Click Next: Container >

Tab 2: Container

Field	Value	Notes
Name	github-runner	Name of the container within the job
Image source	Azure Container Registry	Select this radio button
Registry	yourregistryname.azurecr.io	Select your ACR from the dropdown
Image	github-runner	Select the image you pushed
Image tag	v1	The tag you used during build
Registry authentication	Managed identity	Leave as default
Managed identity	System assigned Identity (environment)	Leave as default — Azure will auto-assign ACR Pull role
Command override	(Leave empty)	Our Dockerfile already has an ENTRYPOINT
Arguments override	(Leave empty)	Not needed
Workload profile	Consumption	Default is fine
CPU and memory	0.5 CPU cores, 1 Gi memory	Increase if your jobs need more resources

💡 CPU/Memory guidance:

0.5 CPU / 1 Gi — Light jobs (linting, simple tests)
1 CPU / 2 Gi — Medium jobs (building apps, running test suites)
2 CPU / 4 Gi — Heavy jobs (compiling large projects, ML workloads)

Environment Variables

Click + Add for each variable:

Name	Source	Value
GITHUB_OWNER	Manual	Your GitHub org name (e.g., Quality-Framework)
RUNNER_SCOPE	Manual	org (or repo for repo-level runners)
RUNNER_LABELS	Manual	container-app
GITHUB_REPO	Manual	Comma-separated repo names that this runner will serve (e.g., my-app,my-api,my-infra). These should match the repos selected in your PAT (Step 1).
RUNNER_GROUP	Manual	The GitHub runner group name (e.g., container-app-runners). Must match the group created in Step 1. If not set, defaults to Default.
GITHUB_PAT	(We'll configure this as a secret reference in Step 7)

⚠️ Do NOT put the PAT directly as an environment variable value. We will securely reference it from Key Vault in Step 7.

For now, add only GITHUB_OWNER, RUNNER_SCOPE, RUNNER_LABELS, GITHUB_REPO, and RUNNER_GROUP. We'll add GITHUB_PAT after setting up the identity and secret.

⚠️ Make sure RUNNER_SCOPE matches runnerScope in the scale rule below. If you're using org-level runners, both should be org.

Scale Rule Settings (same page, scroll down)

Below the environment variables, you'll see Scale rule settings:

Field	Value	Notes
Min executions	0	Scale to zero when no jobs are pending
Max executions	5	Maximum parallel runners (adjust as needed). Do NOT set to 0 — this means unlimited and can cause hundreds of runner containers to spawn!
Polling interval	30	How often (in seconds) KEDA checks for pending jobs

Scale Rules — Click + Add

A side panel opens with the "Add scale rule" form. Fill in:

Field	Value
Rule name	github-runner-rule
Custom rule type	github-runner

Scale parameters (key-value pairs — click + Add for each):

Name	Value	Notes
githubAPIURL	https://api.github.com	Pre-filled by the portal, leave as-is
owner	your-github-org	Your GitHub org or username (e.g., Quality-Framework)
runnerScope	org	org for org-level, repo for repo-level
repos	your-repo-name	Must match the repos you selected in your PAT (Step 1). Comma-separated, no spaces.
targetWorkflowQueueLength	1	Number of pending jobs needed to trigger one runner
labels	container-app	Must match RUNNER_LABELS env var and runs-on in your workflow

⚠️ CRITICAL: Do NOT skip the labels parameter! Without it, KEDA cannot match pending jobs to your runner and will always show MetricValue: 0.00 — meaning no containers will ever start. This is the most common setup mistake.

Authentication:

Secret reference	Trigger parameter
(Leave empty for now — we'll configure this after creating the job and setting up Key Vault secrets in Step 7)	personalAccessToken

💡 The portal may let you save the scale rule without a secret reference. If it blocks you, delete the Authentication row entirely, click Add scale rule, and we'll add the authentication after the job is created.

Click Add scale rule → Then click Review + create → Create

⏳ Creation takes about 1 minute.

⚠️ First-time creation may fail with an image pull error if you're creating a new Container Apps Environment at the same time. This happens because the environment's managed identity gets the AcrPull role during deployment, but the image pull happens before the role propagates. If this occurs, simply Redeploy or create the Container App Job again — the role is already assigned and the second attempt will succeed.

Step 7: Configure Managed Identity and Secrets

Now we need to:

Enable Managed Identity on the Container App Job
Grant it access to Key Vault and ACR
Reference the GitHub PAT as a secret

7.1: Enable System-Assigned Managed Identity

Open your Container App Job (github-runner-job)
In the left menu, go to Settings → Identity
Under System assigned, toggle Status to On
Click Save → Click Yes to confirm
Note the Object ID that appears — you'll need this

7.2: Grant Key Vault Access

Go to your Key Vault (kv-github-runners)
Go to Access Control (IAM) → + Add role assignment
Fill in:

Field	Value
Role	Key Vault Secrets User
Assign access to	Managed identity
Members	Search for github-runner-job and select it

Click Review + assign

7.3: Grant ACR Pull Access

Go to your Container Registry (yourregistryname)
Go to Access Control (IAM) → + Add role assignment
Fill in:

Field	Value
Role	AcrPull
Assign access to	Managed identity
Members	Search for github-runner-job and select it

Click Review + assign

7.4: Add GitHub PAT as a Secret Reference

Go back to your Container App Job (github-runner-job)
In the left menu, go to Settings → Secrets
Click + Add
Fill in:

Field	Value
Type	Key Vault reference
Key	github-pat
Key Vault secret URL	Select your Key Vault and the github-pat secret
Managed Identity	System assigned

Click Add

7.5: Map the Secret to an Environment Variable

Go to Settings → Containers
Click on your container → Edit
Go to the Environment variables tab
Click + Add:

Name	Source	Value
GITHUB_PAT	Reference a secret	github-pat

Click Save

💡 If the Save button is greyed out, try making a small edit to another field first (e.g., click into a value and click out) to trigger the save state. Alternatively, re-create the container with the correct env vars.

Step 8: Configure KEDA Scale Rule (GitHub Runner Scaler)

This is where the magic happens. KEDA's GitHub Runner scaler monitors the GitHub Actions API for pending workflow jobs and scales your Container App Job accordingly.

Steps:

Open your Container App Job (github-runner-job)
Go to Settings → Scale (or Scale and replicas)
Under Scale rule, click + Add
Fill in the scale rule:

Field	Value	Notes
Name	github-runner-rule	Any descriptive name
Type	Custom	Select Custom
Custom rule type	github-runner	This is the KEDA scaler type

Under Metadata, add these key-value pairs:

Key	Value	Notes
owner	your-github-org	Your GitHub org or username (e.g., Quality-Framework)
repos	repo1,repo2	Must match the repos you selected in your PAT token (Step 1). Comma-separated, no spaces. E.g., qualityframework-demo,qualityframework-bicep. Do NOT leave empty — if left empty, KEDA scans ALL org repos and you'll hit GitHub API rate limits.
runnerScope	org	org for org-level runners, repo for repo-level
labels	container-app	Must match the RUNNER_LABELS env var and the runs-on labels in your workflow YAML
targetWorkflowQueueLength	1	Number of pending jobs needed to trigger one new runner instance

Under Authentication, add:

Key	Value
Secret reference	github-pat
Trigger parameter	personalAccessToken

Click Add → Save

Understanding the Scale Rule

Scenario	KEDA Action
0 pending jobs with container-app label	0 runners (scale to zero)
1 pending job	Starts 1 container
3 pending jobs	Starts 3 containers (up to max)
Jobs complete	Containers stop, scale back to zero

Step 9: Test the Setup

9.1: Create a Test Workflow

In any repository within your GitHub organization, create a workflow file:

File: .github/workflows/test-container-runner.yml

name: Test Container App Runner on: workflow_dispatch: # Allows manual trigger from GitHub UI push: branches: [main] jobs: test-runner: runs-on: [self-hosted, container-app] # Must match your RUNNER_LABELS steps: - name: Checkout code uses: actions/checkout@v4 - name: Print runner info run: | echo "✅ Hello from Azure Container App Runner!" echo "Runner Name: $RUNNER_NAME" echo "Runner OS: $RUNNER_OS" echo "Workspace: $GITHUB_WORKSPACE" - name: Run a simple test run: | echo "Current directory: $(pwd)" echo "Files in repo:" ls -la echo "System info:" uname -a echo "Memory:" free -h

9.2: Trigger the Workflow

Go to your repository on GitHub
Click Actions tab
Select "Test Container App Runner" from the left sidebar
Click Run workflow → Run workflow

9.3: Watch It Work

In the GitHub Actions tab, you'll see the job show as "Queued"
In Azure Portal, go to your Container App Job → Execution history
Within ~30 seconds (your polling interval), you should see a new execution start
The job will:
- Container starts → Runner registers → Job executes → Container stops
Back in GitHub, the workflow run should show as ✅ completed

Troubleshooting

Problem	Likely Cause	Fix
Job stays "Queued" forever	KEDA not detecting jobs	Check scale rule metadata — ensure labels, owner, repos, runnerScope are all filled correctly
Container starts but workflow doesn't complete	Wrong secret reference or empty env vars	Verify GITHUB_PAT env var points to correct secret, and all env vars have values
"Permission denied" errors	PAT missing required scopes	Edit PAT and add missing permissions (Step 1)
Image pull errors on first deploy	ACR access timing issue	Redeploy — the AcrPull role was assigned but hadn't propagated yet
Runner registers but job doesn't run	Label mismatch	Ensure runs-on labels in workflow match RUNNER_LABELS env var AND labels in KEDA scale rule
Runner shows "Idle" but job stays queued	--disableupdate flag or wrong runner group	Remove --disableupdate from start.sh and rebuild the image. Also verify RUNNER_GROUP env var matches the runner group name on GitHub, and the group has the correct repos assigned
Runner version deprecated error	Outdated runner binary	Update RUNNER_VERSION in Dockerfile to the latest version and rebuild. Run curl -s https://api.github.com/repos/actions/runner/releases/latest \| jq -r '.tag_name' to check
Hundreds of offline runners spawning	maxExecutions set to 0 (unlimited)	Set maxExecutions to a reasonable limit (e.g., 5 or 10) in the scale rule settings
Runner idle + job queued forever (public repo)	Runner group blocks public repos	Go to Org Settings → Actions → Runner groups → your group and check "Allow public repositories". Without this, GitHub silently refuses to dispatch jobs from public repos to runners in the group

To check container logs:

Go to your Container App Job → Monitoring → Logs and run:

ContainerAppConsoleLogs_CL | where ContainerGroupName_s startswith "github-runner-job" | where TimeGenerated > ago(30m) | order by TimeGenerated desc | take 20

Production Considerations

🔒 Networking: Private Environments

In production, your Container Apps Environment may be deployed inside a VNet with no public internet access. Here's how to handle that:

Private ACR Access

Container App Job ──(private endpoint)──► ACR

Create a Private Endpoint for your ACR
Disable public access on ACR
Ensure your Container Apps Environment is in the same (or peered) VNet

Private Key Vault Access

Same pattern — create a Private Endpoint for Key Vault and disable public access.

GitHub API Access

Your runner containers need outbound access to:

github.com (runner registration)
api.github.com (KEDA polling)
*.actions.githubusercontent.com (downloading actions)

If using a firewall or NSG, ensure these are allowed.

🔐 Using GitHub App Instead of PAT (Recommended for Production)

PATs are tied to individual users and have broad scopes. For production, consider using a GitHub App:

Create a GitHub App in your organization
Grant it Organization Self-hosted runners: Read & Write permissions
Install the app in your organization
Use the App ID and Private Key instead of PAT

The KEDA GitHub Runner scaler supports GitHub App authentication natively.

📊 Monitoring and Alerting

Set up monitoring for your runners:

Container App Job Metrics (Azure Monitor):
- Execution count
- Execution duration
- Failed executions
Alerts to set up:
- Alert when executions fail repeatedly
- Alert when execution queue is growing (KEDA can't keep up)
- Alert when runner registration fails (PAT expired?)
Log Analytics queries:

// Find failed executions ContainerAppConsoleLogs_CL | where ContainerGroupName_s startswith "github-runner-job" | where Log_s contains "error" or Log_s contains "failed" | order by TimeGenerated desc

💰 Cost Optimization

Strategy	Impact
Scale to zero (min: 0)	No cost when idle
Right-size CPU/memory	Don't over-provision
Set reasonable max executions	Prevent runaway costs
Use Consumption plan	Pay per-second billing
Set replica timeout	Kill stuck jobs

🔄 Keeping the Runner Image Updated

Runner versions get outdated. Set up automated rebuilds:

# Create a scheduled ACR Task to rebuild weekly (single line for Cloud Shell) az acr task create --registry yourregistryname --name rebuild-runner-weekly --image github-runner:latest --context https://github.com/your-org/runner-image-repo.git --file Dockerfile --schedule "0 0 * * 0" --git-access-token YOUR_PAT

Appendix A: CLI Commands for All Steps

If you prefer CLI over Portal, here are all the commands. Run these in Azure Cloud Shell or any terminal with Azure CLI.

All commands are written as single lines so they work directly in Azure Cloud Shell without formatting issues.

# ─── Set your variables (update these values) ─── RESOURCE_GROUP="rg-github-runners" LOCATION="westus2" ACR_NAME="yourregistryname" KV_NAME="kvgithubrunners" CAE_NAME="cae-github-runners" JOB_NAME="github-runner-job" IMAGE_NAME="github-runner" IMAGE_TAG="v1" GITHUB_ORG="your-github-org" # ─── Step 1: Create Resource Group ─── az group create --name $RESOURCE_GROUP --location $LOCATION # ─── Step 2: Create ACR ─── az acr create --name $ACR_NAME --resource-group $RESOURCE_GROUP --sku Basic # ─── Step 3: Create Key Vault ─── az keyvault create --name $KV_NAME --resource-group $RESOURCE_GROUP --location $LOCATION --enable-rbac-authorization # ─── Step 4: Store PAT in Key Vault ─── az keyvault secret set --vault-name $KV_NAME --name "github-pat" --value "YOUR_PAT_HERE" # ─── Step 5: Build Image with ACR Tasks (run from the folder with Dockerfile) ─── az acr build --registry $ACR_NAME --image $IMAGE_NAME:$IMAGE_TAG . # ─── Step 6: Create Container Apps Environment ─── az containerapp env create --name $CAE_NAME --resource-group $RESOURCE_GROUP --location $LOCATION # ─── Step 7: Create Container App Job (single command) ─── az containerapp job create --name $JOB_NAME --resource-group $RESOURCE_GROUP --environment $CAE_NAME --trigger-type Event --replica-timeout 1800 --replica-retry-limit 1 --replica-completion-count 1 --parallelism 1 --image "$ACR_NAME.azurecr.io/$IMAGE_NAME:$IMAGE_TAG" --cpu "0.5" --memory "1Gi" --min-executions 0 --max-executions 5 --polling-interval 30 --scale-rule-name "github-runner-rule" --scale-rule-type "github-runner" --scale-rule-metadata "owner=$GITHUB_ORG" "runnerScope=org" "labels=container-app" "targetWorkflowQueueLength=1" --scale-rule-auth "personalAccessToken=github-pat" --secrets "github-pat=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/github-pat,identityref:system" --env-vars "GITHUB_PAT=secretref:github-pat" "GITHUB_OWNER=$GITHUB_ORG" "RUNNER_SCOPE=org" "RUNNER_LABELS=container-app" "RUNNER_GROUP=container-app-runners" --registry-server "$ACR_NAME.azurecr.io" --registry-identity "system"

Appendix B: Repo-Level Runner Changes

If you want runners at the repository level instead of organization level:

Setting	Org-Level	Repo-Level
PAT scope	admin:org	repo only
RUNNER_SCOPE env var	org	repo
GITHUB_REPO env var	Not needed	Required (e.g., my-repo)
KEDA runnerScope	org	repo
KEDA repos	Optional	Required

Summary

Here's what we built:

Component	What It Does
Dockerfile + start.sh	Creates an ephemeral GitHub Actions runner image
ACR	Stores the runner image securely
Key Vault	Stores the GitHub PAT securely
Container Apps Environment	Provides the hosting platform
Container App Job	Runs the runner containers on demand
KEDA Scale Rule	Automatically scales runners based on pending GitHub jobs
Managed Identity	Connects everything securely without passwords

The Result:

✅ Zero cost when idle — no VMs running 24/7 ✅ Automatic scaling — KEDA handles it ✅ Ephemeral runners — clean environment every time ✅ Secure — secrets in Key Vault, Managed Identity for auth ✅ No Docker required — ACR Tasks builds images in the cloud ✅ Production ready — private networking, monitoring, automated image updates

-------------------------------------------------------------------------------------------

Have questions or feedback? Drop a comment below!

Tags: Azure, Container Apps, GitHub Actions, KEDA, Self-hosted Runners, DevOps, Serverless

Modernizing Terraform Pipelines on Azure: OIDC Federation for GitHub Actions and Azure DevOps

ssinghkalra — Sat, 02 May 2026 20:34:19 GMT

The secret nobody wants to rotate

Most Terraform-on-Azure pipelines we see still authenticate the same way they did three years ago. A long-lived ARM_CLIENT_SECRET sitting in GitHub Actions or Azure DevOps, set once, copied around, and rotated only when something breaks.

It's the most ignored credential in the cloud, and statistically the most likely one to leak. A developer screenshots a variable group. A pipeline log echoes a value. A fork inherits a secret. Or the secret simply expires on a Friday evening and takes production deployments with it.

Workload Identity Federation (WIF) makes this whole class of problem go away. The pipeline mints a short-lived token at runtime, exchanges it for an Azure access token via Microsoft Entra, and never touches a secret. GitHub Actions has supported it since 2021. Azure DevOps service connections went GA with WIF in February 2024. The azurerm Terraform provider has supported it since v3.7.

This post walks through the pattern end-to-end, for both GitHub Actions and Azure DevOps, the way I've rolled it out across multiple customer estates.

How the exchange actually works

Before any YAML, it helps to picture what's happening:

The CI system (GitHub or ADO) signs a short-lived JWT describing exactly what's running- which repo, which branch, which environment, which service connection.
The pipeline sends that JWT to Microsoft Entra ID.
Entra checks it against a federated identity credential you've configured on a managed identity or app registration. The iss, sub, and aud claims must match case-sensitively.
If it matches, Entra returns an Azure access token valid for the duration of the job.
Terraform uses it. The job ends. The token expires. Nothing persists.

The token is bound to a specific subject like repo:contoso/platform:environment:prod or sc://contoso/platform/azure-prod. It can't be reused from another repo, branch, or pipeline.

Recommended Architecture

A few choices that usually hold up in production:

Decision	Choice
Identity type	User-assigned managed identity (UAMI), not app registration
Identity granularity	One UAMI per environment (not per pipeline)
Trust scope	Pinned to the environment claim, not the branch
RBAC scope	Resource group, not subscription
Remote state	OIDC + use_azuread_auth = true, shared key access disabled

Why UAMIs? They live in your subscription, don't need Application Administrator rights to manage, and follow the lifecycle of the resource group they belong to. Why one per environment? Pipeline-per-identity explodes into hundreds of identities. Environment-per-identity maps cleanly to deployment scopes.

Part 1 - GitHub Actions

Step 1: Create the identity and federate it

Two commands per environment. That's it.

az identity create -g rg-platform-identity -n id-tf-prod -l eastus az identity federated-credential create \ --name github-prod \ --identity-name id-tf-prod \ --resource-group rg-platform-identity \ --issuer https://token.actions.githubusercontent.com \ --subject repo:contoso/platform:environment:prod \ --audiences api://AzureADTokenExchange

Repeat for nonprod. No secret is created anywhere.

Step 2: Wire it up in GitHub

In repo Settings → Environments, create nonprod and prod. On prod, add required reviewers and a branch rule restricting deployments to main. Then add three environment variables (not secrets - these aren't sensitive): AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID.

The workflow itself stays small:

permissions: id-token: write contents: read jobs: apply: runs-on: ubuntu-latest environment: prod env: ARM_USE_OIDC: "true" ARM_CLIENT_ID: ${{ vars.AZURE_CLIENT_ID }} ARM_TENANT_ID: ${{ vars.AZURE_TENANT_ID }} ARM_SUBSCRIPTION_ID: ${{ vars.AZURE_SUBSCRIPTION_ID }} steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 - run: terraform init && terraform apply -auto-approve

Three things make this secure:

id-token: write is the only elevated permission, and it doesn't grant write access to anything in GitHub, it just lets the runner mint a JWT.
The environment: line picks the right AZURE_CLIENT_ID and drives the sub claim. The federation refuses anything else.
No azure/login step is needed for Terraform. The azurerm provider reads GitHub's OIDC environment variables automatically.

Part 2 - Azure DevOps

The model is identical. The mechanics are different.

ADO offers two creation paths for a WIF service connection: automatic (it creates an app registration for you) and manual (you bring your own UAMI). For platform teams, manual + UAMI is almost always the better choice to ensure identity lives where governance lives.

The flow is a small dance between the two portals:

In Azure DevOps, create a new ARM service connection → choose Workload Identity Federation (manual) → fill in your UAMI's client ID, tenant ID, and subscription. Save as draft. ADO shows you an issuer URL and a subject identifier.
In Azure, on the UAMI, add a federated credential using the values ADO showed you. The subject looks like sc://contoso/platform/azure-prod.
Back in ADO, click Verify and save.

In the pipeline, the service connection only "activates" if a task in the job loads it. The simplest way is the AzureCLI@2 task:

- task: AzureCLI@2 inputs: azureSubscription: azure-prod # the WIF service connection scriptType: bash scriptLocation: inlineScript inlineScript: | terraform init && terraform apply -auto-approve env: ARM_USE_OIDC: "true" ARM_CLIENT_ID: $(AZURE_CLIENT_ID) ARM_TENANT_ID: $(AZURE_TENANT_ID) ARM_SUBSCRIPTION_ID: $(AZURE_SUBSCRIPTION_ID) ARM_ADO_PIPELINE_SERVICE_CONNECTION_ID: $(SERVICE_CONNECTION_ID) SYSTEM_ACCESSTOKEN: $(System.AccessToken) SYSTEM_OIDCREQUESTURI: $(System.OidcRequestUri)

For teams converting dozens of legacy connections, the Azure DevOps team published a PowerShell helper that walks every ARM service connection in a project and converts them in place. There's a 7-day rollback window on each connection, which makes the migration genuinely low-risk.

Don't forget the state file

The Terraform state is your real blast radius. With OIDC, it's almost free to lock it down too. The same UAMI can read and write blob data without the storage account key:

backend "azurerm" { resource_group_name = "rg-tfstate" storage_account_name = "sttfstateprodeastus" container_name = "platform-prod" key = "platform.tfstate" use_oidc = true use_azuread_auth = true }

Grant the UAMI Storage Blob Data Contributor on the container (not the account), disable shared key access on the storage account, and you've removed the last secret in the pipeline.

RBAC and break-glass

Federation removes a credential, not a privilege. A few habits worth keeping:

Scope role assignments to resource groups, not subscriptions. The whole point of federation is that scoping is now trivially easy.
Use Role Based Access Control Administrator instead of User Access Administrator if your Terraform creates role assignments. It's a more recent, narrower role.
Have a documented break-glass. If GitHub or ADO has a token-service incident, you still need a path to ship a hotfix. A single hardware-key-protected emergency app registration in a separate identity boundary works well, audited monthly.
Monitor sign-ins. Every federated exchange shows up in Entra sign-in logs as a service principal sign-in. Pipe these to Sentinel and alert on anomalies like sign-ins outside expected hours, or from IPs outside GitHub's published ranges.

The errors you will hit (and what they really mean)

Symptom	What it actually is
AADSTS70021: No matching federated identity record found	Case-sensitive mismatch in iss, sub, or aud. Almost always a trailing slash or a capitalised character
AADSTS700016: Application not found in directory	Wrong client ID or tenant. Not a federation problem
403 on a resource even though token exchange worked	Federation is fine. Your RBAC isn't. Check the exact scope
Unable to determine OIDC token (ADO)	No task in the job loaded the service connection. Add an AzureCLI@2 step
Works on main, fails on tags	You pinned sub to a branch ref. Add a second federated credential for tags, or move to environment-based scoping

Migrating without a maintenance window

You almost never get to do this on a greenfield repo. The order that has worked for me on legacy estates:

Create the new UAMI alongside the old service principal, with the same role assignments.
Federate one canary pipeline. Verify it deploys equivalently.
Cut over pipelines in waves, lowest-risk environment first.
Once a full release cycle passes cleanly, disable the old SP's secret.
Wait another cycle. Then delete the SP entirely.
Add a CI gate that fails any new pipeline introducing ARM_CLIENT_SECRET.

The old and new auth methods coexist on the same subscription throughout. There's no hard cutover and no maintenance window, just a steady drift toward zero secrets.

Wrapping up

If you do nothing else after reading this, do one thing: search your CI variable groups for ARM_CLIENT_SECRET. Every result is an outage or a breach waiting to happen.

Federation is one of those rare changes that's both more secure and less work to operate. Once you've set it up, you stop thinking about credential rotation, secret expiry, and quarterly access reviews for service principals. The pipeline simply runs, and the audit trail is in Entra where it belongs.

That's a good trade.