Blog Post

Azure Infrastructure Blog
8 MIN READ

From ServiceNow to Self-Healing Infrastructure: A Multi-Repo Azure AI Platform

Valini_Sunthwal's avatar
Apr 29, 2026

A question before we begin

Quick — what version of infrastructure is running in your production subscription right now?

Not what your pipeline last deployed. Not what your tracking spreadsheet says. What's actually running — right now, this second — and can you prove it?

If you had to think about it, you're in good company. So did we. That's basically why we ended up building everything I'm about to describe.


The Friday night that started all this

So here's what happened. Production had a broken NSG rule. One of our engineers woke up, logged into the Azure Portal, edited the rule, hit Save. Done. Went back to bed.

Fair enough — production was healthy again.

The problem? Nobody noticed until Monday that someone had changed live infrastructure directly in the portal. No pipeline involved. No Terraform. No tracking. Our deployment registry still showed the old config. For three days, our "source of truth" was just... wrong.

And that's the thing people don't talk about enough with cloud infrastructure. It's not just Terraform drift. It's anyone making a change through the Portal, CLI, PowerShell, REST API — basically any path that isn't your pipeline. Our system catches all of it now, but I'll get to that.


Why this is harder than it sounds

Let me paint the picture. A single Azure environment has easily 20+ resources. VNets with subnets and NSGs. Key Vaults. Storage Accounts with private endpoints. Container App Environments. AI services — OpenAI, AI Foundry. DNS zones, route tables, firewalls, managed identities, RBAC bindings tying everything together.

Each resource has dozens of properties. And each property can be changed through the Portal, CLI, PowerShell, ARM, Bicep, Terraform, or raw REST calls. Any of those changes happen without your pipeline having a clue.

Now multiply that across lab, non-live, and live. Across multiple subscriptions. Across multiple markets. Yeah.

 


What we mean by "market"

We use the word market to mean a country or regional business unit. Each one runs in its own Azure region with its own subscription. UK deploys to UK South, Czech to Germany West Central, and so on. Same platform blueprint, different geography, different network space, different secrets, different deployment lifecycle.

Think of it as completely isolated copies of the same thing.


The actual problem we had

We run Azure AI infrastructure across multiple markets. Each market has three environments — lab, non-live, live. Each environment has 20+ resources. So we're talking hundreds of resources across dozens of subscriptions, all deployed through Terraform, all expected to stay in sync.

They don't. Pipelines fail halfway through. Someone fixes something manually in the portal because it's 2 AM and they just want it to work. A registry update fails silently. And slowly, what you think is deployed and what's actually deployed start to diverge. One quiet failure at a time.

We needed something different. Not just a deployment pipeline — we needed a system that knows what it deployed, can check whether that's still true, and notices when someone changes something outside the pipeline.


Can it actually detect portal changes?

Yes. But there's a nuance.

When someone edits a resource through the Azure Portal — modifies an NSG rule, changes a Key Vault policy, scales a Container App — the actual Azure resource changes. But Terraform's state file doesn't update because Terraform wasn't involved. The state serial stays the same. The version tracker stays the same. The registry stays the same.

So the daily reconciliation pipeline, which compares serials and versions, would not catch this. All three tracking files still agree — they're just all wrong.

The portal change gets caught the next time the pipeline runs terraform plan. Terraform talks to Azure, compares what's in state versus what's actually there, and shows the diff:

~ resource "azurerm_network_security_rule" "example" {
    ~ access = "Allow" -> "Deny"   # changed outside of Terraform
  }

Then terraform apply reverts the change — brings Azure back in line with the code. That's the self-healing bit.

So in short:

  • The reconciliation pipeline catches pipeline-level drift — someone ran terraform outside the pipeline, or a registry write failed.
  • The next terraform plan/apply catches resource-level drift — portal changes, CLI changes, anything outside Terraform.

Between the two, nothing stays hidden for long.


Three repos, one platform

We split things into three repositories. Each does one thing well.

 

1. Platform Repo

Deploys the foundation — VNets, DNS, Key Vaults, Container App Environments, AI Foundry, managed identities, firewall rules. Everything a market needs before any application team can deploy anything. Triggered by ServiceNow tickets or code merges.

2. Modules Repo

37 reusable Terraform modules, all built on top of Azure Verified Modules. Both the Platform Repo and Use-Case Repo pull from this. The whole point is that if the same Key Vault definition lives in two places, it will diverge within three months. This makes that impossible.

Everything is version-pinned. No silent updates:

source = "git::https://...//avm_modules/keyvault?ref=avm-res-keyvault-vault/v0.10.2"

3. Use-Case Repo

This is where application teams deploy their stuff — storage, databases, function apps, AI Search, container apps — on top of what the Platform Repo already set up. The repo has pre-written Terraform for 20+ resources. Teams don't write Terraform. They uncomment what they need. Push to a feature branch, it deploys to lab. Merge to staging, it goes to non-live. Merge to main, it goes to live. Approval gates at each step.


The five pipelines

Release pipeline. Fires on merge to main or from a ServiceNow trigger. Runs semantic-release, creates a version tag (like v7.0.0), then deploys to lab → non-live → live with an approval gate before live.

Feature pipeline. Lab-only sandbox. Test your changes without touching anything real. Creates soft tags like lab-v7.0.0.1 so you can track what was deployed without cluttering the main version history.

Rollback pipeline. Pick any previous tag (say v6.1.0) and it restores it. The version tracker marks this as a deliberate rollback so reconciliation doesn't flag it as drift.

Reconciliation pipeline. Runs at 6 AM every day. Compares what the registry says is deployed against what's actually in the Terraform state. Catches pipeline-level drift — someone ran terraform apply outside the pipeline, or a registry write failed. Portal changes get caught on the next plan/apply run instead.

Commitlint pipeline. Enforces conventional commit messages. That's what makes semantic versioning work — fix: bumps patch, feat: bumps minor, BREAKING CHANGE: bumps major.


The three tracking files

This is where most platforms stop too early. We keep three independent records for every deployment.

 

1. Terraform State File

Terraform's own record of what it manages. Stored in Azure Blob Storage. Has a serial number that increments every time Terraform runs. If someone runs terraform apply outside the pipeline, this serial changes but nothing else does.

2. Version Tracker

A JSON file sitting next to the state file. Written by the pipeline after every successful deploy. Records the version tag, commit SHA, run ID, timestamp. If someone deploys outside the pipeline, this file doesn't update — that's how we spot the mismatch.

3. Subscription Registry

A central blob container where each market gets a subscription.json. Records what the platform thinks is deployed in each environment. But if the blob write fails (network blip), it can be wrong. That's why we cross-check against the other two.

When all three agree, we're good. When they don't, we know exactly what drifted, where, and why.


How drift detection works in practice

Every morning at 6, the reconciliation pipeline checks every market, subscription, and environment.

It uses three tiers:

Tier 1 (best case): Version tracker exists. Compare its serial against the actual state serial and the registry. If anything's off, flag it.

Tier 2 (no tracker): Older environments without a version tracker. Fall back to comparing the registry's recorded serial against the actual state serial.

Tier 3 (no serials at all): Compare timestamps. If the state was modified more than five minutes after the last registry update, flag it.

When it finds drift, it does not auto-fix. It generates a report:

Customer     | Environment | Registry  | Tracker   | State Serial | Status
customer-cz  | non-live    | v6.1.0    | v7.0.0    | 45           | DRIFT
customer-cz  | live        | v7.0.0    | v7.0.0    | 41           | OK
customer-uk  | lab         | v7.0.0    | v7.0.0    | 22           | OK

A human reviews and approves before anything gets corrected. No silent overwrites.


Did it work?

Within the first week, the reconciliation pipeline caught something real. Someone had run terraform apply manually — outside the pipeline. The state serial changed, the version tracker didn't. Flagged at 6 AM the next morning.

Separately, the next scheduled deployment caught a portal change. An NSG rule had been modified directly in the Azure Portal. terraform plan showed the unexpected diff and terraform apply reverted it.

Between the two mechanisms, nothing went undetected for more than a day.


Security

No secrets stored anywhere. Everything is OIDC. GitHub App creds, state backend config — all fetched at runtime from Key Vault. Nothing in repo secrets or pipeline variables.

Private endpoints on everything. Storage, Key Vault, AI Search, SQL, PostgreSQL, Container Registry — all deployed with public access disabled.

Quality gates on every PR. Terraform fmt, validate, Checkov, tfsec, commitlint. Doesn't pass, doesn't merge.

Rollback is a first-class thing. Restore any previous version. The tracker records it as intentional so reconciliation doesn't freak out.

Here's the pattern every pipeline job follows:

# Every job:
- name: "Azure Login"
  uses: ./.github/actions/azure-login

- name: "Fetch Secrets from Key Vault"
  id: kv-secrets
  uses: ./.github/actions/keyvault-secrets
  with:
    keyvault_name: ${{ env.KEYVAULT_NAME }}

- name: "Generate GitHub App Token"
  uses: ./.github/actions/github-app-token
  with:
    app_id: ${{ steps.kv-secrets.outputs.app_id }}
    private_key: ${{ steps.kv-secrets.outputs.private_key }}

What we'd tell you

Separate platform from application on day one. The repo boundary is the governance boundary. Shared repo means shared blast radius.

Pin everything. Modules, providers, Terraform versions. If today's deploy can produce a different result tomorrow because something upstream changed, you don't have reproducible infrastructure.

Assume every component will fail on its own. The pipeline will succeed but the registry write will fail. Build verification loops before you need them, not after.

Make the right path the easy path. If uncommenting a Terraform block is easier than writing one from scratch, people will uncomment. If deploying through a form is easier than asking the platform team, they'll use the form. Design for that.

Understand the two types of drift. Pipeline drift (someone ran terraform outside the pipeline) gets caught by metadata comparison. Resource drift (portal changes) gets caught by the next terraform plan. You need both.


The bottom line

We started with: can you prove what's deployed in production right now?

Now we can. And the platform checks it every morning at 6 AM.

Three repos. Five pipelines. 37 modules. SemVer on every deployment. Three tracking files that cross-check each other. One daily alarm clock asking: is everything still true?

It always is. And when it isn't, we know within hours.


If you're managing multi-subscription Azure infrastructure and dealing with drift or visibility problems — we'd genuinely like to hear how you're handling it.

Updated Apr 29, 2026
Version 1.0
No CommentsBe the first to comment