Blog Post

Apps on Azure Blog
10 MIN READ

Event-Driven IaC Operations with Azure SRE Agent: Terraform Drift Detection via HTTP Triggers

Vineela-Suri's avatar
Vineela-Suri
Icon for Microsoft rankMicrosoft
Apr 16, 2026

Drift detection is a solved problem. terraform plan tells you what changed. But what happens next — who changed it, why, and whether it's safe to revert — that's still a manual investigation every single time. Until now.

What Happens After terraform plan Finds Drift?

If your team is like most, the answer looks something like this:

  1. A nightly terraform plan runs and finds 3 drifted resources
  2. A notification lands in Slack or Teams
  3. Someone files a ticket
  4. During the next sprint, an engineer opens 4 browser tabs — Terraform state, Azure Portal, Activity Log, Application Insights — and spends 30 minutes piecing together what happened
  5. They discover the drift was caused by an on-call engineer who scaled up the App Service during a latency incident at 2 AM
  6. They revert the drift with terraform apply
  7. The app goes down because they just scaled it back down while the bug that caused the incident is still deployed

Step 7 is the one nobody talks about. Drift detection tooling has gotten remarkably good — scheduled plans, speculative runs, drift alerts — but the output is always the same: a list of differences. What changed. Not why. Not whether it's safe to fix.

The gap isn't detection. It's everything that happens after detection.

HTTP Triggers in Azure SRE Agent close that gap. They turn the structured output that drift detection already produces — webhook payloads, plan summaries, run notifications — into the starting point of an autonomous investigation. Detection feeds the agent. The agent does the rest: correlates with incidents, reads source code, classifies severity, recommends context-aware remediation, notifies the team, and even ships a fix.

Here's what that looks like end to end.

What you'll see in this blog:

  • An agent that classifies drift as BenignRisky, or Critical — not just "changed"
  • Incident correlation that links a SKU change to a latency spike in Application Insights
  • A remediation recommendation that says "Do NOT revert" — and why reverting would cause an outage
  • A Teams notification with the full investigation summary
  • An agent that reviews its own performance, finds gaps, and improves its own skill file
  • A pull request the agent created on its own to fix the root cause

The Pipeline: Detection to Resolution in One Webhook

The architecture is straightforward. Terraform Cloud (or any drift detection tool) sends a webhook when it finds drift. An Azure Logic App adds authentication. The SRE Agent's HTTP Trigger receives it and starts an autonomous investigation.

Terraform Drift Detection Pipeline Architecture — Terraform Cloud sends a webhook to the Azure Logic App auth bridge, which forwards an authenticated request to the SRE Agent HTTP Trigger. The agent then performs drift detection, incident correlation, source code analysis, smart remediation, Teams notification, self-improving skill updates, and auto PR creation.The end-to-end pipeline: Terraform Cloud detects drift and sends a webhook. The Logic App adds Azure AD authentication via Managed Identity. The SRE Agent's HTTP Trigger fires and the agent autonomously investigates across 7 dimensions.

Setting Up the Pipeline

Step 1: Deploy the Infrastructure with Terraform

We start with a simple Azure App Service running a Node.js application, deployed via Terraform. The Terraform configuration defines the desired state:

  • App Service Plan: B1 (Basic) — single vCPU, ~$13/mo
  • App Service: Node 20-lts with TLS 1.2
  • Tags: environment: demo, managed_by: terraform, project: sre-agent-iac-blog
resource "azurerm_service_plan" "demo" {
  name                = "iacdemo-plan"
  resource_group_name = azurerm_resource_group.demo.name
  location            = azurerm_resource_group.demo.location
  os_type             = "Linux"
  sku_name            = "B1"
}

A Logic App is also deployed to act as the authentication bridge between Terraform Cloud webhooks and the SRE Agent's HTTP Trigger endpoint, using Managed Identity to acquire Azure AD tokens. Learn more about HTTP Triggers here.

Step 2: Create the Drift Analysis Skill

Skills are domain knowledge files that teach the agent how to approach a problem. We create a terraform-drift-analysis skill with an 8-step workflow:

  1. Identify Scope — Which resource group and resources to check
  2. Detect Drift — Compare Terraform config against Azure reality
  3. Correlate with Incidents — Check Activity Log and App Insights
  4. Classify Severity — Benign, Risky, or Critical
  5. Investigate Root Cause — Read source code from the connected repository
  6. Generate Drift Report — Structured summary with severity-coded table
  7. Recommend Smart Remediation — Context-aware: don't blindly revert
  8. Notify Team — Post findings to Microsoft Teams

The key insight in the skill: "NEVER revert critical drift that is actively mitigating an incident." This teaches the agent to think like an experienced SRE, not just a diff tool.

Step 3: Create the HTTP Trigger

In the SRE Agent UI, we create an HTTP Trigger named tfc-drift-handler with a 7-step agent prompt:

A Terraform Cloud run has completed and detected infrastructure drift.

Workspace: {payload.workspace_name}
Organization: {payload.organization_name}
Run ID: {payload.run_id}
Run Message: {payload.run_message}

STEP 1 — DETECT DRIFT: Compare Terraform configuration against actual Azure state...
STEP 2 — CORRELATE WITH INCIDENTS: Check Azure Activity Log and App Insights...
STEP 3 — CLASSIFY SEVERITY: Rate each drift item as Benign, Risky, or Critical...
STEP 4 — INVESTIGATE ROOT CAUSE: Read the application source code...
STEP 5 — GENERATE DRIFT REPORT: Produce a structured summary...
STEP 6 — RECOMMEND SMART REMEDIATION: Context-aware recommendations...
STEP 7 — NOTIFY TEAM: Post a summary to Microsoft Teams...
HTTP Triggers page showing tfc-drift-handler with status On and 3 completed runsThe HTTP Trigger dashboard showingtfc-drift-handleractive with 3 completed runs.

Step 4: Connect GitHub and Teams

We connect two integrations in the SRE Agent Connectors settings:

  • Code Repository: GitHub — so the agent can read application source code during investigations
  • Notification: Microsoft Teams — so the agent can post drift reports to the team channel
Connectors page showing Teams and GitHub connectedBoth connectors show "Connected" status — the agent can read source code and notify the team.

The Incident Story

Act 1: The Latency Bug

Our demo app has a subtle but devastating bug. The /api/data endpoint calls processLargeDatasetSync() — a function that sorts an array on every iteration, creating an O(n² log n) blocking operation.

On a B1 App Service Plan (single vCPU), this blocks the Node.js event loop entirely. Under load, response times spike from milliseconds to 25-58 seconds, with 502 Bad Gateway errors from the Azure load balancer.

Act 2: The On-Call Response

An on-call engineer sees the latency alerts and responds — not through Terraform, but directly through the Azure Portal and CLI. They:

  1. Add diagnostic tags — manual_update=True, changed_by=portal_user (benign)
  2. Downgrade TLS from 1.2 to 1.0 while troubleshooting (risky — security regression)
  3. Scale the App Service Plan from B1 to S1 to throw more compute at the problem (critical — cost increase from ~$13/mo to ~$73/mo)
Azure Portal showing iacdemo-webapp with TLS warning, unauthorized tags, and S1 SKUThe Azure Portal tells the story: a TLS security warning banner across the top, unauthorizedmanual_updateandchanged_bytags, and the App Service Plan upgraded to S1. Three types of drift, all from a single incident response.

The incident is partially mitigated — S1 has more compute, so latency drops from catastrophic to merely bad. Everyone goes back to sleep. Nobody updates Terraform.

Act 3: The Drift Check Fires

The next morning, a nightly speculative Terraform plan runs and detects 3 drifted attributes. The notification webhook fires, flowing through the Logic App auth bridge to the SRE Agent HTTP Trigger.

The agent wakes up and begins its investigation.

What the Agent Found

Layer 1: Drift Detection

The agent compares Terraform configuration against Azure reality and produces a severity-classified drift report:

Drift Report showing Critical, Risky, and Benign drift with Expected vs Actual columnsThe agent's drift report — organized by severity. The "Incident Correlation" column (partially visible) is what makes this more than aterraform planwrapper.

Three drift items detected:

  • Critical: App Service Plan SKU changed from B1 (~$13/mo) to S1 (~$73/mo) — a +462% cost increase
  • Risky: Minimum TLS version downgraded from 1.2 to 1.0 — a security regression vulnerable to BEAST and POODLE attacks
  • Benign: Additional tags (changed_by: portal_user, manual_update: True) — cosmetic, no functional impact

Layer 2: Incident Correlation

Here's where the agent goes beyond simple drift detection. It queries Application Insights and discovers a performance incident correlated with the SKU change:

Application Insights analysis showing /api/data endpoint with 25,919ms avg latency and 57,697ms P95The agent found thatGET /api/datais averaging 25,919ms with a P95 of 57,697ms — affecting 97.6% of all requests. It also discovered that the/api/dataendpoint exists in production butnotin the repository source code.

Key findings from the incident correlation:

  • 97.6% of requests (40 of 41) were impacted by high latency
  • The /api/data endpoint does not exist in the repository source code — the deployed application has diverged from the codebase
  • The endpoint likely contains a blocking synchronous pattern — Node.js runs on a single event loop, and any synchronous blocking call would explain 26-58s response times
  • The SKU scale-up from B1→S1 was an attempt to mitigate latency by adding more compute, but scaling cannot fix application-level blocking code on a single-threaded Node.js server

Layer 3: Smart Remediation

This is the insight that separates an autonomous agent from a reporting tool. Instead of blindly recommending "revert all drift," the agent produces context-aware remediation recommendations:

Smart Remediation Recommendations showing different actions for Tags, TLS, and SKU driftThree different recommendations based on context: safe to revert (tags), revert immediately for security (TLS), and critically — do NOT revert the SKU until the code is fixed.

The agent's remediation logic:

  1. Tags (Benign) → Safe to revert anytime via terraform apply -target
  2. TLS 1.0 (Risky) → Revert immediately — the TLS downgrade is a security risk unrelated to the incident
  3. SKU S1 (Critical) → DO NOT revert until the /api/data performance root cause is fixed
Agent explaining "Do NOT revert the SKU from S1 back to B1 yet" with recommended action sequence and code fixThe agent explainswhythe SKU shouldn't be reverted: "Reverting to B1 while the/api/datablocking code is still deployed would worsen the performance incident." It then provides a 5-step action sequence and a suggested async code pattern.

This is the logic an experienced SRE would apply. Blindly running terraform apply to revert all drift would scale the app back down to B1 while the blocking code is still deployed — turning a mitigated incident into an active outage.

Layer 4: Investigation Summary

The agent produces a complete summary tying everything together:

Investigation summary showing drift table, key findings including actor and performance incident, remediation recommendations, and actions takenThe final summary includes: who made the changes (identified via Activity Log), the performance incident details, the code-infrastructure mismatch finding, and three actions taken — Teams notification, skill improvement, and PR creation.

Key findings in the summary:

  • Actor: surivineela@microsoft.com made all changes via Azure Portal at ~23:19 UTC
  • Performance incident: /api/data averaging 25-57s latency, affecting 97.6% of requests
  • Code-infrastructure mismatch: /api/data exists in production but not in the repository source code
  • Root cause: SKU scale-up was emergency incident response, not unauthorized drift

Layer 5: Teams Notification

The agent posts a structured drift report to the team's Microsoft Teams channel:

Teams channel showing "Terraform Drift Detected" notification with drift table, performance incident, root cause, and recommended actionsThe Teams notification includes the severity-coded drift table, the performance incident context, the root cause explanation, and the 5-step recommended action sequence — all posted automatically, with a link back to the full SRE Agent investigation thread.

The on-call engineer opens Teams in the morning and sees everything they need: what drifted, why it drifted, and exactly what to do about it — without logging into any dashboard.

The Payoff: A Self-Improving Agent

Here's where the demo surprised us. After completing the investigation, the agent did two things we didn't explicitly ask for.

The Agent Improved Its Own Skill

The agent performed an Execution Review — analyzing what worked and what didn't during its investigation — and found 5 gaps in its own terraform-drift-analysis.md skill file:

Execution Review showing what worked well, 5 gaps found in the skill, and the agent editing terraform-drift-analysis.mdThe agent identified gaps including "No incident correlation guidance," "No smart remediation logic," and "No Activity Log integration" — then updated its own skill file with these learnings for next time.

What worked well:

  • Drift detection via az CLI comparison against Terraform HCL was straightforward
  • Activity Log correlation identified the actor and timing
  • Application Insights telemetry revealed the performance incident driving the SKU change

Gaps it found and fixed:

  1. No incident correlation guidance — the skill didn't instruct checking App Insights
  2. No code-infrastructure mismatch detection — no guidance to verify deployed code matches the repository
  3. No smart remediation logic — didn't warn against reverting critical drift during active incidents
  4. Report template missing incident correlation column
  5. No Activity Log integration guidance — didn't instruct checking who made changes and when

The agent then edited its own skill file to incorporate these learnings. Next time it runs a drift analysis, it will include incident correlation, code-infra mismatch checks, and smart remediation logic by default.

This is a learning loop — every investigation makes the agent better at future investigations.

The Agent Created a PR

Without being asked, the agent identified the root cause code issue and proactively created a pull request to fix it:

GitHub PR #1 "Improve terraform-drift-analysis skill with incident correlation and smart remediation" showing code changes to server.jsPR #1 — the agent modified bothserver.js(adding safety constants and capping delay values) andterraform-drift-analysis.md(incorporating the learnings from the investigation). Two commits, two files changed, +103/-10 lines.

The PR includes:

  • App safety fixes: Adding MAX_DELAY_MS and SERVER_TIMEOUT_MS constants to prevent unbounded latency
  • Skill improvements: Incorporating incident correlation, code-infra mismatch detection, and smart remediation logic

From a single webhook: drift detected → incident correlated → root cause found → team notified → skill improved → fix shipped.

Key Takeaways

  1. Drift detection is not enough. Knowing that B1 changed to S1 is table stakes. Knowing it changed because of a latency incident, and that reverting it would cause an outage — that's the insight that matters.
  2. Context-aware remediation prevents outages. Blindly running terraform apply after drift would have scaled the app back to B1 while blocking code was still deployed. The agent's "DO NOT revert SKU" recommendation is the difference between fixing drift and causing a P1.
  3. Skills create a learning loop. The agent's self-review and skill improvement means every investigation makes the next one better — without human intervention.
  4. HTTP Triggers connect any platform. The auth bridge pattern (Logic App + Managed Identity) works for Terraform Cloud, but the same architecture applies to any webhook source: GitHub Actions, Jenkins, Datadog, PagerDuty, custom internal tools.
  5. The agent acts, not just reports. From a single webhook: drift detected, incident correlated, root cause identified, team notified via Teams, skill improved, and PR created. End-to-end in one autonomous session.

Getting Started

HTTP Triggers are available now in Azure SRE Agent:

  1. Create a Skill — Teach the agent your operational runbook (in this case, drift analysis with severity classification and smart remediation)
  2. Create an HTTP Trigger — Define your agent prompt with {payload.X} placeholders and connect it to a skill
  3. Set Up an Auth Bridge — Deploy a Logic App with Managed Identity to handle Azure AD token acquisition
  4. Connect Your Source — Point Terraform Cloud (or any webhook-capable platform) at the Logic App URL
  5. Connect GitHub + Teams — Give the agent access to source code and team notifications

Within minutes, you'll have an autonomous pipeline that turns infrastructure drift events into fully contextualized investigations — with incident correlation, root cause analysis, and smart remediation recommendations.

The full implementation guide, Terraform files, skill definitions, and demo scripts are available in this repository.

 

Updated Apr 16, 2026
Version 2.0
No CommentsBe the first to comment