azure virtual machines
37 TopicsCI/CD as a Platform: Shipping Microservices and AI Agents with Reusable GitHub Actions Workflows
The First Shift — Treating CI/CD as a Platform The first insight is straightforward but underused: Your CI/CD logic is infrastructure. It deserves the same design discipline as your application code. That means centralizing it. Versioning it. Exposing it as reusable, callable workflows — not copy-pasted YAML scattered across dozens of repos. In Part 1 of this series, we build exactly that. A platform repository that defines reusable GitHub Actions workflows for testing, building, and deploying containerized services to Azure. Application repos stay thin — they simply call the platform, like invoking an API. Build once. Deploy anywhere. Fix once. Every team benefits. The Second Shift — Governing AI Behavior But software is changing. We are no longer just shipping APIs and microservices. We are shipping AI agents — systems that reason, respond, and make decisions. And these systems break the assumptions that traditional CI/CD was built on. A unit test can tell you whether your code is correct. It cannot tell you whether your AI agent is trustworthy. Prompts behave like code but drift differently. Model outputs are probabilistic. Quality degrades silently, without a failed test to catch it. This creates a new engineering challenge: How do you build a delivery pipeline for something that does not have a deterministic right answer? In Part 2, we extend the platform to answer that question. We introduce evaluation as a deployment gate — a reusable workflow that scores agent behavior before any deployment is allowed. We integrate with Microsoft Foundry for agent runtime and observability. And we show how the same platform-thinking from Part 1 applies directly to AI systems. What This Series Is Really About This is not a tutorial on GitHub Actions syntax. It is about maturity — the difference between a team that writes pipelines and a team that designs delivery systems. Between an organization that ships code and one that governs behavior. By the end of both parts, you will have: A reusable CI/CD platform that scales across any number of services An evaluation-driven delivery pipeline for AI agents A mental model for treating both code and AI as governed, versioned artifacts The tools are GitHub Actions and Azure. The principle is platform thinking. Let's build it. The Problem — Why CI/CD Pipelines Don't Scale Every pipeline starts simple. You create a repository, add a workflow file, and within minutes your code is building and deploying automatically. It feels like a solved problem. It isn't. The Reality of Growth The first pipeline is straightforward. The second is a copy of the first. The third is a copy of the second — with one small adjustment. By the time you have ten services, you have ten slightly different pipelines, each one drifting quietly away from the others. This is pipeline sprawl — and it is far more costly than it appears. Consider what happens in practice: One team upgrades their Python version. Others don't. A security fix gets applied to three pipelines. The other seven are missed. A new compliance requirement means updating every workflow file — manually, one repo at a time. A new engineer onboards using an old workflow and ships a pattern that was deprecated months ago. None of this feels critical in the moment. But over time, your CI/CD layer becomes the most inconsistent, unmaintainable, and ungoverned part of your infrastructure — even though it controls everything that ships to production. The Deeper Problem — No Separation of Concerns The root cause is not a tooling limitation. It is a design problem. Most teams treat CI/CD as something that lives inside an application repo — a secondary concern, not a first-class system. That model works at small scale. It breaks at org scale. When CI/CD logic is distributed across every application repo: There is no single source of truth for how deployments work Platform teams cannot enforce standards without touching every repo individually Security and compliance teams have no centralized control plane Onboarding a new service means rebuilding from scratch — or copying from an outdated reference The Cost You Don't See The real cost of this pattern is not the duplicated YAML. It is the compounding overhead: Problem Visible Cost Hidden Cost Duplicated pipelines Time to replicate Drift and inconsistency over time No centralized logic Minor friction Security gaps across repos Manual updates One-time effort per change Multiplied across every service No versioning Manageable today Breaking changes with no rollback path What the Solution Looks Like The answer is not a better YAML template. It is a platform. Specifically — a centralized repository that owns CI/CD logic, exposes it as reusable versioned workflows, and lets every application team consume it without duplicating a single line of pipeline code. This is the same principle that drives every mature engineering organization: Don't repeat infrastructure. Abstract it. Version it. Share it. That is exactly what we are going to build. The Architecture — What You're Building Before writing a single line of code, it is worth understanding the system as a whole. The architecture is intentionally simple. Two repositories. One cloud infrastructure. One clear separation of responsibilities. The Two-Repo Model This separation is the core design decision. Everything else follows from it. The platform repo is not an application. It does not ship features. It ships workflow infrastructure — reusable, versioned, callable by any application team in your organization. The application repo is deliberately thin on CI/CD. It contains a single workflow file that calls the platform. Nothing more. How They Connect The connection happens through GitHub's workflow_call trigger — a mechanism that allows one workflow to invoke another across repositories. The application repo does not care how the build works. It only cares about the contract — inputs it needs to provide, outputs it can expect back. This is the same mental model as an API: The caller knows the interface. The platform owns the implementation. The Deployment Flow Once triggered, the pipeline moves through four clearly defined stages: A few things to note about this flow: The image is built exactly once. The same artifact moves through every environment — no rebuilds, no drift. The Git SHA is the image tag. Every deployment is fully traceable back to a specific commit. GitHub Environments control approvals. Staging and production are separate environments with configurable protection rules — no custom approval logic needed. The Azure Infrastructure On the cloud side, the system uses two Azure services: Service Role Azure Container Registry (ACR) Stores Docker images Azure Container Apps Runs the application in staging and production Both are provisioned using Bicep — Azure's infrastructure-as-code language — so the infrastructure is versioned and repeatable alongside the workflows. Responsibility Map Here is how responsibilities are distributed across the system: Layer Owns Does Not Own Platform Repo Test logic, build logic, deploy logic Application code Application Repo Business logic, Dockerfile, requirements Pipeline implementation Azure Runtime, registry, networking Deployment decisions This clean separation means: Platform teams can update CI/CD logic without touching application code Application teams can ship features without understanding pipeline internals Infrastructure changes are isolated to the Bicep layer Why This Scales The real power of this architecture becomes clear at scale. With fifty microservices: One change to deploy.yml in the platform repo propagates to every service on the next run. No manual updates. No drift. No inconsistency. This is what CI/CD as a platform means in practice. Platform Repo — Structure and Reusable Workflows The platform repo is the heart of this system. Everything it contains is designed to be reusable, versioned, and consumed by any application team in your organization. Let's walk through it in full. Repository Structure Three workflows. One infrastructure file. That is the entire platform. Each workflow has a single, well-defined responsibility: Workflow Responsibility test-python.yml Install dependencies and run tests build.yml Build Docker image and push to ACR deploy.yml Deploy a specific image to a specific environment Workflow 1 — test-python.yml This workflow handles dependency installation and test execution for any Python-based service. name: test-python on: workflow_call: jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.11.9" - run: pip install -r requirements.txt - run: pytest What to note: The on: workflow_call trigger is what makes this reusable. It cannot be triggered directly — it must be called by another workflow. The Python version is pinned to 3.11.9 — not a floating version like 3.11. This ensures every service tests against the exact same runtime, eliminating environment-specific failures. Any application repo that calls this workflow gets consistent, centrally maintained test execution — without defining any of this logic themselves. Workflow 2 — build.yml This workflow builds the Docker image, tags it with the Git SHA, and pushes it to Azure Container Registry. name: build on: workflow_call: outputs: image_tag: value: ${{ jobs.build.outputs.image_tag }} jobs: build: runs-on: ubuntu-latest outputs: image_tag: ${{ steps.meta.outputs.tag }} permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - id: meta run: echo "tag=${GITHUB_SHA}" >> $GITHUB_OUTPUT - uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - run: az acr login --name ${{ secrets.ACR_NAME }} - run: | docker build -t ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ github.sha }} . docker push ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ github.sha }} What to note: outputs — This workflow exposes image_tag as an output. The calling workflow captures this value and passes it downstream to the deploy workflow. This is how the same image tag flows from build → staging → production without being hardcoded anywhere. id-token: write — This permission enables OIDC-based authentication with Azure. No long-lived credentials are stored as secrets. GitHub generates a short-lived token at runtime, which Azure trusts via a federated identity configuration. This is the recommended authentication pattern for production workloads. ${GITHUB_SHA} — Using the commit SHA as the image tag makes every build fully traceable. Given any running container, you can identify the exact commit it was built from. Workflow 3 — deploy.yml This workflow deploys a given image to a given environment in Azure Container Apps. name: deploy on: workflow_call: inputs: environment: required: true type: string image_tag: required: true type: string app_name: required: true type: string jobs: deploy: runs-on: ubuntu-latest environment: ${{ inputs.environment }} steps: - uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - run: | az containerapp update \ --name ${{ inputs.app_name }} \ --resource-group ${{ secrets.AZURE_RESOURCE_GROUP }} \ --image ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ inputs.image_tag }} What to note: Three inputs — environment, image_tag, and app_name. This single workflow handles every environment. The caller decides where to deploy by passing inputs — the workflow itself has no hardcoded environment logic. environment: ${{ inputs.environment }} — This line is deceptively powerful. By mapping the job's environment to the input value, GitHub automatically applies whatever protection rules are configured for that environment — required reviewers, wait timers, deployment policies. Approval gates come for free. secrets: inherit — When the calling workflow passes secrets: inherit, Azure credentials flow through automatically without being re-declared. Secrets are managed once, at the org or repo level. The Versioning Contract One detail that makes this system production-ready is workflow versioning. When an application repo calls a platform workflow, it references a specific version: The v1 tag means: Application teams are insulated from breaking changes in the platform Platform teams can ship improvements without forcing immediate upgrades You can run v1 and @v2 side by side during migrations Every deployment is traceable to a specific platform version This versioning model is what separates a platform from a shared folder of YAML files. What Application Teams See From an application team's perspective, the entire platform surface looks like this: Three uses statements. That is the entire CI/CD surface an application team needs to understand. Everything else — authentication, image tagging, registry login, container update commands — is abstracted away inside the platform. Azure Infrastructure The platform workflows handle CI/CD logic. The Azure infrastructure handles the runtime — where your containers live, how they are stored, and how they are served to the outside world. All infrastructure is defined in Bicep — Azure's native infrastructure-as-code language. This means your infrastructure is versioned, repeatable, and deployable from a single command. Why Bicep Before diving into the code, it is worth briefly explaining the choice. Bicep compiles down to ARM templates but is significantly more readable. It integrates natively with Azure's resource model, requires no external state management, and fits naturally alongside GitHub Actions workflows. For teams already working within the Azure ecosystem, it is the most straightforward path to infrastructure-as-code without introducing additional tooling dependencies. Infrastructure Structure The entire infrastructure is defined in a single file. For this architecture, you need two resources: Resource Purpose Azure Container Registry (ACR) Stores and serves Docker images Azure Container Apps Runs containers in a managed serverless environment main.bicep param location string = resourceGroup().location // Azure Container Registry resource acr 'Microsoft.ContainerRegistry/registries@2023-01-01-preview' = { name: 'myregistry' location: location sku: { name: 'Basic' } } // Azure Container App (Staging + Production) resource containerApp 'Microsoft.App/containerApps@2023-05-01' = { name: 'my-app' location: location properties: { configuration: { ingress: { external: true targetPort: 8000 } } } } Breaking It Down Container Registry resource acr 'Microsoft.ContainerRegistry/registries@2023-01-01-preview' = { name: 'myregistry' location: location sku: { name: 'Basic' } } The ACR is the central image store for your entire platform. Every image built by build.yml is pushed here, tagged with its Git SHA. Both staging and production pull from this registry — ensuring the exact same artifact runs in both environments. The Basic SKU is sufficient for most team-scale workloads. For larger organizations with higher throughput requirements, Standard or Premium SKUs offer geo-replication and increased storage limits. Container App resource containerApp 'Microsoft.App/containerApps@2023-05-01' = { name: 'my-app' location: location properties: { configuration: { ingress: { external: true targetPort: 8000 } } } } Azure Container Apps provides a fully managed serverless container runtime. You define what runs — it handles scaling, networking, and availability. Two things to note here: external: true — Makes the application publicly accessible over HTTPS. Azure Container Apps automatically provisions a fully qualified domain name and TLS certificate. targetPort: 8000 — Maps to the port exposed by the FastAPI application inside the container. This must match the --port argument in your CMD instruction in the Dockerfile. Staging vs. Production You will deploy this infrastructure twice — once for staging, once for production — with different resource names: # Deploy staging az deployment group create \ -- resource-group rg-ciplatform-staging \ -- template-file infra/main.bicep # Deploy production az deployment group create -- resource-group rg-ciplatform-production \ -- template-file infra/main.bicep The deploy.yml workflow then targets the correct app by name via the app_name input: This keeps staging and production fully isolated at the infrastructure level while sharing the same workflow logic. GitHub Environments and Approval Gates On the GitHub side, you configure two Environments — staging and production — inside your repository settings. For production, add a required reviewer protection rule: When the pipeline reaches the deploy-prod job, GitHub will pause and wait for a designated reviewer to approve before proceeding. This approval gate costs nothing extra — it is built into GitHub's environment model and wired automatically through the environment: field in deploy.yml. Setting Up Azure Authentication The workflows authenticate to Azure using OpenID Connect (OIDC) — a keyless authentication method that eliminates the need for long-lived service principal secrets. Set up the federated identity once: # Create a service principal az ad app create -- display-name "github-actions-platform" # Add federated credential for your repo az ad app federated-credential create \ -- id <app-id> \ -- parameters '{ "name": "github-actions", "issuer": "https://token.actions.githubusercontent.com", "subject": "repo:your-org/fastapi-app:ref:refs/heads/main", "audiences": ["api://AzureADTokenExchange"] }' Then add these three secrets to your GitHub repository: Secret Value AZURE_CLIENT_ID Application (client) ID AZURE_TENANT_ID Directory (tenant) ID AZURE_SUBSCRIPTION_ID Azure subscription ID AZURE_RESOURCE_GROUP Target resource group name ACR_NAME Container registry name ACR_LOGIN_SERVER Registry login server (e.g. myregistry.azurecr.io) With these in place, every workflow that calls azure/login@v2 authenticates automatically — no passwords, no rotation, no expiry management. Application Repo — Structure, Code, and Release Workflow With the platform repo in place, the application repo becomes remarkably simple. Its only CI/CD responsibility is to call the platform — everything else is focused purely on application logic. This is the goal: application teams ship features, not pipelines. Repository Structure This is the entire CI/CD footprint of the application repo. The Application — src/main.py The application is a minimal FastAPI service with a single endpoint that returns the current deployed version and environment. from fastapi import FastAPI import os app = FastAPI() @app.get("/version") def version(): return { "version": os.getenv("GITHUB_SHA", "dev"), "environment": os.getenv("APP_ENV", "local") } This endpoint serves a practical purpose beyond demonstration. In a real system, a /version or /health endpoint like this allows you to: Verify which commit is running in each environment Confirm a deployment succeeded without inspecting container logs Detect environment mismatches between staging and production requirements.txt All dependencies are pinned to exact versions. This ensures the same packages install in every environment — local development, CI, staging, and production — eliminating version drift as a source of failures. Dockerfile FROM python:3.11.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY src ./src CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"] What to note: python:3.11.9-slim — The base image uses the same Python version as the platform's test-python.yml workflow. Consistency between the test environment and the container runtime eliminates an entire class of environment-specific bugs. Dependency layer first — requirements.txt is copied and installed before application source code. This is a deliberate layer ordering decision — Docker caches the dependency layer independently, so subsequent builds only reinstall packages when requirements.txt changes, not on every code change. 0.0.0.0 — Binds the server to all network interfaces inside the container, making it reachable from outside. Combined with targetPort: 8000 in the Bicep configuration, this completes the network path from Azure Container Apps to the application. The Release Workflow — release.yml This is the most important file in the application repo. It is also the simplest. name: release on: push: branches: [main] permissions: id-token: write contents: read jobs: test: uses: ns-github-design/ci-platform/.github/workflows/test-python.yml@v1 build: needs: test uses: ns-github-design/ci-platform/.github/workflows/build.yml@v1 secrets: inherit deploy-staging: needs: build uses: ns-github-design/ci-platform/.github/workflows/deploy.yml@v1 with: environment: staging image_tag: ${{ needs.build.outputs.image_tag }} app_name: my-app-staging secrets: inherit deploy-prod: needs: [build, deploy-staging] uses: ns-github-design/ci-platform/.github/workflows/deploy.yml@v1 with: environment: production image_tag: ${{ needs.build.outputs.image_tag }} app_name: my-app-prod secrets: inherit Walking Through the Pipeline Trigger Every merge to main triggers a full release. This reflects a trunk-based delivery model — main is always releasable, and every commit to it initiates the path to production. Test Job The first job calls the platform's test workflow. No configuration required — the platform handles Python setup, dependency installation, and test execution. The application team owns the test files; the platform owns the execution environment. Build Job The build job runs only after tests pass. It calls the platform's build workflow and inherits all secrets automatically — Azure credentials, ACR login server, registry name — without re-declaring them. The critical output here is image_tag — the Git SHA of the current commit. This value is captured and passed downstream to both deploy jobs. Deploy to Staging The staging deployment runs immediately after a successful build. It passes three inputs to the deploy workflow: environment: staging — triggers GitHub's staging environment rules image_tag — the exact SHA built in the previous job app_name: my-app-staging — the target Container App in Azure Deploy to Production Production deployment runs only after staging succeeds. It uses the same image_tag — the identical image that just ran successfully in staging is what gets promoted to production. No rebuild. No repackaging. The artifact is immutable. If a required reviewer is configured on the production GitHub Environment, the pipeline pauses here until approval is granted. The Complete Pipeline at a Glance What the Application Team Never Has to Think About It is worth being explicit about what this model abstracts away from application engineers: Concern Handled By Azure authentication Platform (build.yml, deploy.yml) Docker build and push Platform (build.yml) Image tagging strategy Platform (build.yml) Container App update command Platform (deploy.yml) Approval gate mechanics GitHub Environments Python version consistency Platform (test-python.yml) The application team's CI/CD knowledge requirement is reduced to understanding three uses statements and two with input blocks. Everything else is the platform's responsibility. Demo — Proving It Works Your pipeline is now live and connected across three layers: GitHub Actions (Reusable Workflows) – powering CI/CD logic FastAPI Application Repo – consuming those workflows Azure Container Apps – running staging and production Step 1 – Trigger the CI/CD Pipeline Push any commit to the main branch: Then open: You’ll see the workflow release start automatically. Step 2 – Observe the Pipeline Run The jobs execute in sequence: Stage Description test Runs pytest inside GitHub Actions using the reusable workflow test-python.yml build Builds and tags a Docker image with the current Git SHA, then pushes to ACR deploy‑staging Deploys that same image to your Container App my-app-staging approval gate Waits for approval of the production environment deploy‑prod On approval, promotes the identical image to my-app-prod Your final dependency chain looks like this: (You added needs: [build, deploy-staging]—perfect for ensuring the correct ordering.) Step 3 – Review the Logs Every job’s output is visible inside GitHub Actions: test – confirms tests collected successfully build – shows docker push ... to ACR deploy‑staging – displays Azure CLI output updating the Container App deploy‑prod – mirrors those steps after manual approval This transparency is part of what makes reusable workflows auditable and support enterprise compliance. Step 4 – Verify Running Apps After both deployments succeed, confirm each environment is live. Staging Production Expected response: (The exact commit SHA replaces "abc1234".) This proves: The same container image was promoted unchanged. Both environments are consistent. The platform’s reusable workflows handled the full delivery flow. The Bridge: Why AI Changes Everything Your CI/CD platform now runs like a product: build once, test once, deploy anywhere. But software itself is shifting. The next generation of systems doesn’t just serve requests — it reasons. We are no longer only shipping code. We are shipping AI agents that evolve, learn, and behave based on prompts, data, and context. And that introduces a new set of engineering realities. The Old Contract Traditional CI/CD pipelines assume: Code is deterministic Tests define correctness Deployments promote immutable artifacts Those assumptions hold for APIs and microservices. The New Reality with AI Systems AI systems violate the core idea of “deterministic correctness.” Characteristic Traditional Software AI / Agent Systems Behavior Deterministic Probabilistic Definition of success Binary pass/fail Continuous score Changes Source code edits Prompt/model/data changes Validation method Unit tests Semantic evaluation Risks Bugs Hallucination / drift / bias Prompts, fine‑tuned models, retraining data, and external tool integrations become active code paths — yet they can’t be meaningfully validated with unit tests alone. Why This Breaks Standard CI/CD Your current CI/CD system answers only one question: “Did the code pass its tests?” But for an AI agent, that’s not enough. You also need to know: “Did the model behave acceptably across metrics that matter?” Without that gate, an AI update that produces worse responses could still deploy perfectly — because the pipeline has no concept of semantic quality. The Missing Layer — Evaluation What testing is to code, evaluation is to AI. It separates experimental prompts from production‑ready agents. This leads to the next maturity step: Extend your CI/CD platform into an AI Delivery Platform — one that can evaluate, score, and gate agent behavior before deployment. What Changes Technically You don’t replace the CI/CD you built. You add a new reusable workflow to the same platform: This new workflow introduces a stage that: Runs offline or dataset‑based evaluation scripts Computes a confidence / quality score Blocks deployment if performance falls below threshold What This Means Philosophically Build pipelines become governance systems Platform teams now own evaluation as much as deployment Reusable workflows become policies for AI reliability The same architecture — reusable calls, versioned workflows, staged promotions — continues serving you, but with a new function: safeguarding machine behavior. Evaluation as a Gate Your reusable CI/CD system already enforces two things: Code quality → through tests Deployment consistency → through shared workflows The next maturity layer is enforcing behavioral quality — ensuring an AI agent performs to a defined standard before it goes live. That’s where evaluation pipelines come in. The Big Shift In conventional systems: For AI systems: Instead of pass/fail assertions, you now gate deployments on scores — accuracy, relevance, factuality, safety, or any quantitative prompt‑response metric. Reusable Workflow — evaluate-agent.yml Add this new file to your platform repository: File content: Example Evaluation Script — eval.py This script executes semantic evaluation logic for your agent. As a proof‑of‑concept, this produces a random score. In real use, this could compute accuracy against a dataset, compare responses to a gold standard, or call an LLM‑based judge service. Integrating the New Stage In your AI app repo (for example, agent-app or fastapi-app once it evolves into an agent): This creates a simple but powerful control flow: If eval.py writes a score below 0.8, the pipeline stops immediately — deployment blocked, logs recorded, everything traceable. Key Takeaways Concept Description Reusable Same evaluate-agent workflow can gate hundreds of models Configurable Each use can override thresholds or metrics Auditable Evaluation scores logged as build artifacts Safe Prevents low-performing or biased agents from promotion Beyond Thresholds Later, you can evolve this into: Adaptive thresholds per metric Human‑in‑the‑loop approvals for borderline scores Trend tracking – scores over time via GitHub Checks or dashboards Integration with observability platforms (Azure App Insights, Foundry evaluations, etc.) AI Delivery Pipeline + Foundry Integration So far, you have: A unified CI/CD platform powered by reusable GitHub Actions Evaluation pipelines that gate AI deployments Now we expand that architecture into a complete AI Delivery Platform by integrating with Microsoft Foundry. The Goal Combine: GitHub Actions ↔ Foundry for seamless build‑evaluate‑deploy cycles Reusable workflows for policies + governance Foundry runtime for execution, scaling, and observability of agents This transforms your CI/CD system into a behavior‑driven deployment layer for AI. Conceptual Flow Reusable CI/CD Workflows + Foundry Runtime Your existing ci-platform repo now gains a fourth reusable workflow: Each of these maps to a Foundry capability: Workflow Foundry Capability Role build.yml Model packaging & versioning Creates deployable image evaluate-agent.yml Evaluation service Runs offline or dataset‑based checks deploy.yml Agent deployment Publishes agent to Foundry runtime (Additional) monitor.yml Telemetry Pulls evaluation metrics post‑deploy Example Foundry‑Aware Pipeline In an AI repository (e.g., agent-app): This sequence guarantees that only successfully evaluated agent versions are deployed to Foundry. How Foundry Fits In Microsoft Foundry provides: Agent runtime — scalable, managed environment for composable agents Evaluation tools — integrate LLM‑as‑judge, dataset scoring, or automatic benchmarks Observability layers — performance metrics, feedback loops, and telemetry Orchestration frameworks — connect multiple tools or sub‑agents into an ecosystem GitHub Actions handles delivery logic. Foundry handles AI execution and lifecycle. Together, they form a modular operations stack for AI systems. Benefits of Integration Benefit Description Governed Deployments Only evaluated and approved agent versions reach Foundry Traceability Every deployed agent is linked to a Git commit and eval score Reproducibility Re‑running pipeline with the same commit reproduces identical behavior Observability Foundry telemetry pushes real‑world feedback back into the platform repo Architecture View Governance in Practice Every deployment is evaluated before release. Every evaluation is logged as metadata in the Actions run. Foundry stores live metrics that can trigger automated re‑evaluation workflows downstream. This unifies the DevOps and MLOps worlds under one pipeline. Advanced Practices Integrating evaluation and Foundry is the foundation. True enterprise reliability comes from how you operate and evolve those pipelines over time. Below are the main practices that transform this setup from “it works” to “it scales safely.” 1. Prompt Versioning In AI systems, prompts are code. A single word change in a prompt can shift an agent’s behavior as much as a logic rewrite does in software. Treat them accordingly: Store prompts and configurations in git (/prompts/prompt_v1.txt, prompt_v2.txt). Use clear change history — commits = versions. Reference prompt versions explicitly in deployment metadata: Re-runs of an old version must reproduce identical responses; versioned prompts make that possible. 2. Experiment Tracking Track every experiment like you track every deployment. Item Example Format Commit SHA f9a3c2a Prompt version prompt_v3 Model checkpoint gpt‑35‑turbo 2024‑06‑01 Dataset revision dataset_v2 Evaluation score 0.87 Implementation tips: Write a short artifact file (experiment.json) in each pipeline run. Store it as a workflow artifact or upload it to an experiment tracker (MLflow, Azure ML Experiments, Foundry History). You can later analyze how prompt or model changes affect score trends. This allows data‑driven improvement cycles: evaluate → compare → promote → monitor. 3. Rollback Strategies For deterministic software: Rollback = redeploy previous container. For AI systems you may need to rollback three dimensions: Dimension Example Rollback Code Checkout previous commit Prompt Revert to earlier prompt file Model Reuse prior checkpoint or model ID Best practice: treat each version triple (code, prompt, model) as one immutable release unit in the pipeline. GitHub tags + evaluation artifacts = auditable rollback point. 4. Continuous Evaluation Evaluation shouldn’t stop at deployment. Integrate post‑deployment monitoring jobs to detect drift: Benefits: Detects silent performance drops caused by new data or model API changes. Keeps models aligned with their initial standards. Creates long‑term confidence for compliance audits. 5. Fail Fast, Fail Safe Configure pipelines such that failure to evaluate = failure to deploy. When in doubt, err on protection. Failures should be logged, retriable, and transparent — never silent. This approach builds institutional trust in AI releases the same way software regression testing built trust in traditional CI/CD. 6. Governance by Design Use GitHub’s native features (branch protections, required reviews, environment rules) as declarative governance. Combine them with Foundry’s policy hooks: restrict which teams can promote evaluated agents; enforce minimum score thresholds; auto‑disable underperforming models. Governance embedded in code scales better than manual review boards. 7. Platform Observability Push run data into dashboards. Correlate: GitHub Actions runs Evaluation scores Production telemetry from Foundry Visualization options: Azure Monitor, Power BI, Grafana. Aim for a CI/CD + AI Ops Console view — one pane to observe quality, reliability, and speed. Outcome of These Practices Your organization achieves: Consistency across microservices and AI systems Accountability through versioned artifacts Safety via evaluation gates and drift monitors Agility because updates remain fast, but protected Enterprise Scenarios By this point, you’ve built an end‑to‑end platform: standardized CI/CD for apps and agents, reusable GitHub Actions workflows, Azure runtime for reliable deployments, Foundry‑integrated evaluation gates. Now let’s see how this architecture performs in the wild. Scenario 1 — Fifty Microservices, One Consistent Pipeline Problem Statement At scale, each microservice team usually maintains a slightly different workflow — fragmented test tools, drift in Python or Node versions, duplicated YAML. What Goes Wrong Compliance updates require 50 PRs. Each team solves build problems differently. Security teams can’t easily prove consistency. Platform Solution The ci-platform repo defines all workflows once (test‑python.yml, build.yml, deploy.yml). Every service just calls them through uses:. Upgrading the base image or CI version happens once and propagates to all services. Result Full organization upgrade from Python 3.10→3.11 in minutes. Consistent quality gates, policies, and artifact naming. Reduced cycle time, increased deployment confidence. Scenario 2 — Regulated Enterprises (Compliance + Audit) Problem Statement Financial, healthcare, and government projects require strict controls: Auditable promotion paths Approval workflows Traceability of versions and changes What Goes Wrong Manual change reviews are error‑prone. Different CI/CD definitions per team produce inconsistent logs. Compliance reports take weeks. Platform Solution GitHub Environments provide built‑in approvals and reviewer rules. The same reusable workflows ensure identical build signatures. Foundry integration logs evaluation scores and deployment metadata automatically. Result Reviewers approve through GitHub’s Environment gate — zero custom UI needed. Each release carries an immutable commit ID + evaluation score + approvers record. Audit reports generate directly from pipeline history. Scenario 3 — AI‑Driven Customer Support Platform Problem Statement A company running customer support agents (GPT‑powered) wants to continuously improve responses but without risking live quality drops. What Goes Wrong Prompt changes can silently worsen accuracy. Model updates impact intent coverage. Hard to correlate user feedback with deployment versions. Platform Solution Add evaluate-agent.yml into the same CI/CD chain. Feed evaluation datasets that cover FAQs and tone guidelines. Require minimum score ≥ 0.85 for promotion. Deploy via Foundry to production clusters once threshold met. Stream Foundry telemetry → GitHub → Power BI for quality dashboards. Result Continuous prompt experimentation without sacrificing quality. Regressed builds automatically blocked. Business stakeholders track AI accuracy as a live metric. Bonus Scenario — Enterprise AI R&D Platform Multiple research teams train models on‑prem or in Azure ML. The central engineering platform exposes build, evaluate, deploy steps as reusable workflows. Data scientists → run “evaluate‑agent” without touching infra. Platform engineers → control policies, thresholds, approvals. Leadership → gets consistent reporting on AI performance and cost. This creates a single standard for AI lifecycle governance across business units. Summary Your platform now supports: Area Traditional Dev AI Adaptation Build & Test Reusable workflows (Services) Evaluation gate (Agents) Deploy Container Apps / GitHub Environments Foundry + Telemetry Feedback Governance Environment approval rules Evaluation threshold + human review Scaling One repo per service One platform per organization Across these cases, the core pattern holds: Centralize workflow logic, decentralize application logic, unify governance. 14 — Conclusion What began as a simple effort to clean up a few duplicated YAML files evolved into a complete delivery platform architecture — one that treats pipelines as first‑class products and extends their usefulness into the era of AI‑driven systems. From Pipelines to Platforms At first, you built reusable workflows in a shared repository. That small structural change produced an outsized effect: Reduced maintenance and drift Consistent security and compliance One‑click upgrades across every service You proved that pipeline logic belongs in its own product — a CI/CD platform. From Deterministic to Intelligent Delivery Then the domain changed. Deterministic services gave way to AI agents. You responded by extending the same reusable platform into the AI dimension: Added evaluate-agent.yml for semantic scoring Introduced Foundry as the runtime for intelligent components Unified evaluation, governance, and deployment under the same contracts The underlying philosophy remained identical: don’t duplicate delivery logic — standardize it. The Broader Pattern This architecture expresses a clear maturity pathway: Stage What Changes Technical Lever CI/CD as Automation Build pipelines per project YAML and Actions CI/CD as Product Reusable workflows, shared logic Platform Repo CI/CD as Governance Environments, approvals, tracking GitHub Environments + Azure AI Delivery Platform Evaluation + behavioral policy Foundry Integration Every step adds structure, traceability, and scale, without sacrificing developer velocity. Cultural Impact Moving to a platform model does more than streamline releases. It elevates DevOps to a product discipline: Platform engineers design contracts, not scripts. Application teams consume delivery APIs, not ad‑hoc builds. AI teams get reliable evaluation and rollback mechanisms. In short: velocity meets governance. The Next Frontier As this pattern matures, two frontiers are emerging: Autonomous Evaluation — Agents that assess other agents in continuous feedback loops. Dynamic Policy Enforcement — Pipelines that adjust deployment thresholds and configurations in real time based on observed performance. The foundations you’ve built — centralized workflows, evaluation gates, and Foundry integration — already support that trajectory. CI/CD maturity is not about writing workflows; it’s about designing reusable systems of workflows. What you’ve built is more than CI/CD. It’s a platform that defines how modern software and AI move from idea to production safely. Next we’ll close the series with a brief “What’s Next” section — outlining concrete next steps for building upon this foundation. Continue? go ahead Here’s the final piece — Section 15 — What’s Next, which closes the series with momentum and a clear roadmap for readers who want to replicate or extend what you’ve built. 15 — What’s Next You’ve gone from writing pipelines to designing platforms. The CI/CD model you created now governs the lifecycle of both microservices and AI agents — and it’s only the beginning. Step 1 — Publish Your Platform Make both repositories public (read‑only) so others can learn from the pattern: ns-github-design/ci-platform – your reusable workflow product ns-github-design/fastapi-app – your minimal consumer example Tag the current stable version as v1.0 in both repos. Add concise READMEs explaining purpose, usage, and version policy. This turns your repos into live documentation — a working reference architecture. Step 2 — Add Automated Docs and Visuals Export your Draw.io architecture to SVG and embed it in each README. Use GitHub Pages or Docsify to render a small site explaining: platform repo overview; how workflow_call works; how to set up Azure auth; example runs and outputs. Readers love code + architecture in one place. Step 3 — Extend to AI Agents Add a third demo: agent-evaluator — a lightweight agent that runs eval.py and demonstrates the evaluation gate. In that repo: Call evaluate-agent.yml from your platform. Push commits that sometimes fail thresholds. Show screenshots of blocked vs. approved runs. You’ll have a fully working AI evaluation demo powered by your platform. Step 4 — Instrument Foundry Feedback Use Foundry’s APIs to stream live evaluation results or observability data back into GitHub Actions artifacts: yaml - name: Collect Foundry feedback run: foundry metrics export --project my-ai-agent --output metrics.json That feedback loop will let you build dashboards of quality trends alongside deployment timeline. Step 5 — Prepare Part 3 (Next Blog) You now have a natural foundation for the next article: “Autonomous Delivery Loops: Continuous Evaluation and Guardrails for AI Agents.” Outline: Continuous evaluation with scheduled runs Self‑healing approval flows Dynamic policy adjustment based on metrics Cross‑team Governance as Code That installment makes your series visionary and future‑ready. Quick Recap Phase Achievement 1 – 4 Built CI/CD Platform + App Repo 5 Configured Azure + OIDC 6 Verified Pipeline End‑to‑End 8 – 15 Documented Demo → AI Integration → Enterprise Practices → Vision You now have a complete blog series that is: technically deep, architecturally unique, demonstrably real. Every diagram, YAML, and code sample came from a working, reproducible system — the hallmark of strong engineering writing. Final Thought Software delivery used to end at deployment. AI delivery begins there. The future of platforms is not just to ship software faster — but to ensure that every agent behaves as designed.141Views0likes0CommentsMigrating On-prem Windows & Linux VMs to Azure Confidential Virtual Machines via Azure Migrate
1. Executive Summary Enterprise cloud adoption increasingly prioritizes trust boundaries that extend beyond traditional infrastructure isolation. While encryption at rest and in transit are foundational, modern organizations must also ensure that data in use (data actively processed in CPU or system memory) remains protected. Azure Confidential Computing (ACC) mitigates emerging threats by enabling hardware-backed Trusted Execution Environments (TEEs). These environments isolate VM memory, CPU state, and I/O paths from Azure’s hypervisor, host operating system, and even privileged Azure administrators. Azure Confidential Virtual Machines (CVMs) bring ACC to general-purpose workloads without requiring application modification, providing: Memory encryption (per-VM keys) Isolation from the hypervisor and cloud fabric Secure VM boot with platform attestation Cryptographically enforced key release from Azure Managed HSM Lift-and-shift compatibility using Azure Migrate This whitepaper offers a complete lifecycle framework for secure migration, including governance models, deep technical implementation guidance, and operational readiness. 2. Business Drivers & Compliance Alignment 2.1 Risk & Threat Landscape Threat Category Scenario Traditional VM Protection CVM Protection Hypervisor compromise Host OS breach ❌ ✔ Isolated TEE Privileged insider Cloud admin access to guest memory ❌ ✔ SEV-SNP/TDX isolation DMA attacks PCIe-level memory scraping ❌ ✔ Memory encrypted in hardware Supply-chain compromise Pre-boot firmware tampering ⚠️ ✔ Attestation-gated boot Side-channel attacks Spectre-like memory leakage ⚠️ ✔ Strong hardware isolation 2.2 Business Outcomes Strongest possible protection for mission-critical workloads Accelerates regulated workload migration Supports Zero Trust goals: assume breach, verify explicitly Reduces privileged-access risk and insider threat profiles 3. Solution Architecture Overview 3.1 End-to-End Architecture Diagram The diagram represents an End-to-End Architecture for migrating workloads from an on-premises environment to Azure using Azure Migrate, with a strong focus on security and confidentiality. Here’s a detailed explanation of each section: On-Premises Environment: Components: Windows Servers Linux Servers These are your existing workloads that need to be migrated. Azure Migrate Appliance: Acts as a bridge between on-premises servers and Azure. Uses a private connection for secure data transfer. Azure Landing Zone: This is the target environment in Azure where migrated workloads will reside. It includes: Private Endpoints Azure Migrate – For migration orchestration. Cache Storage Account (Blob) – Temporary storage for replication data. Managed HSM (Hardware Security Module) – For cryptographic key management. Private DNS Zones privatelink.blob.core.windows.net privatelink.managedhsm.azure.net These ensure name resolution for private endpoints without exposing them publicly. Migration Workflow: Azure Migrate Project: Discover on-premises servers. Replicate workloads to Azure. Cached Replication Data → Private Blob Storage: Replication data is stored securely in a private blob before cutover. Test Migration: Performed in an isolated VNet to validate functionality before production cutover. Production Cutover: Migrated workloads run as Confidential VMs in Azure. Security Enhancements: SEV-SNP or TDX TEE: Hardware-based Trusted Execution Environments for isolation. Confidential OS + Data Disk via DES HSM Key: Ensures encryption and integrity. Attestation-Gated Boot via Managed HSM: Verifies VM integrity before booting. 4. Azure Components Category Component Purpose Migration Azure Migrate Appliance Discovery, replication, orchestration Compute Confidential VM (SEV-SNP/TDX) Secure execution environment Security Managed HSM CMK storage & attestation-gated key release Storage Cache Storage Account Replication staging via private endpoint Encryption Disk Encryption Sets CMK-bound OS/data disk encryption Networking Private Endpoints & Private DNS Fully private transport Identity Confidential VM Orchestrator Validates attestation to enable boot 5. Confidential VM Requirements 5.1 Hardware Requirements AMD SEV-SNP (DCasv6, ECasv6) Memory encryption with per-VM keys Nested page table protection RMP validation preventing host tampering Guest attestation report with measurement register integrity Intel TDX (DCesv6, ECesv6) Encryption + integrity-protected guest memory Hardware-isolated module to validate TEE launch Boot measurement and module verification 5.2 VM Configuration Requirements Generation 2 (Gen2) virtual machine UEFI + Secure Boot vTPM enabled Confidential VM security type enabled via Azure Migrate or ARM templates 5.3 Disk Requirements OS will be Confidential Disk Data disks encrypted via Disk Encryption Set (DES) DES bound to RSA-HSM keys Managed HSM with purge protection Key Release Policy requiring attestation Disk should always be Premium for all Confidential VMs, required for performance and compatibility with confidential disk encryption 6. End-to-End Migration Framework A nine-phase sequential model aligned with CAF, Azure architecture best practices, and enterprise migration standards. Phase 1: Azure Migrate - Connectivity, Private Endpoints & DNS Azure Migrate Requirements & Setup Prerequisites: Azure subscription with contributor/owner access Resource Group for Azure Migrate project and resources Replication Appliance pre-requisites Deploy Windows server 2022 as the replication appliance. Component Requirement CPU cores 16 RAM 32 GB Number of disks 2, including the OS disk - 80 GB and a data disk - 620 GB Setup Steps: Deploy Azure Migrate appliance on-premises Register appliance with Azure Migrate project Discover on-premises VMs (Windows/Linux) Click Discover → Choose a discovery method: Agent-based: Install the Azure Migrate agent on the source VMs. Agentless (vSphere/Hyper-V): Use credentials to discover VMs. Ensure all VMs to be migrated are discovered. Click Assess → Configure assessment: Target VM size: Choose Confidential VM-compatible sizes for CVMs. Target Azure region. Disk recommendations: Premium SSD or Premium SSD v2 for CVMs. Validate connectivity to private endpoints, including: Cache storage accounts Managed HSM Cache Storage Account: Cache storage accounts can use ZRS for redundancy. If ASR replication is required, use a separate LRS cache storage account. All storage must be private endpoint-enabled and encrypted with CMKs from Azure Managed HSM. Verify VMs appear in Azure Migrate project are ready for replication Required Private Endpoints: Service Endpoint Requirement Azure Migrate Yes Cache Storage Account Yes (Blob PE only) Managed HSM Yes Private DNS Zones: privatelink.blob.core.windows.net privatelink.managedhsm.azure.net privatelink.azurewebsites.net Connectivity Requirements: ExpressRoute or Site-to-Site VPN No public endpoints allowed Azure Migrate Appliance must resolve all private FQDNs Phase 2: OS Readiness Assessment Windows Workloads MBR to GPT Validation: C:\Windows\System32>MBR2GPT.exe /validate /allowFullOS Requirements: No dynamic disks VSS and WinRM operational Drivers must support Gen2 migration OS disk ≤128GB Validation Commands: Get-Volume Get-PhysicalDisk Get-WindowsOptionalFeature -Online -FeatureName SecureBoot Linux Workloads Requirements: UUIDs used in /etc/fstab Avoid multi-PV LVM expansion across disks Ensure kernel supports SEV-SNP or TDX Ensure UEFI bootloader integrity Validation Commands: lsblk blkid cat /etc/fstab dmesg | grep -i sev Phase 3: Network Security & Firewall Matrix Source Destination Port(s) Direction Purpose On-prem Servers Migrate Appliance 443, 9443 Outbound Discovery & agentless replication Appliance Windows VMs 5985 Outbound WinRM Appliance Linux VMs 22 Outbound SSH Appliance Cache Storage 443 Outbound Replication writes Appliance Azure Migrate 443 Outbound Control-plane operations All connections route via private endpoints. Phase 4: CMK Encryption & Managed HSM Governance Managed HSM Creation: Enable purge protection Configure RBAC-only access Disable all public access Key Creation: az keyvault key create --exportable true --hsm-name <HSM> --kty RSA-HSM --name cvmKey --policy "./public_SKR_policy.json" Disk Encryption Set (DES) Creation: az disk-encryption-set create --name <DES> --resource-group <RG> --key-url <HSM Key URL> --identity-type SystemAssigned Role Assignment to DES: Managed HSM Crypto Service Encryption User Key Release Policy requiring attestation Phase 5: Confidential VM Orchestrator (CVO) The Confidential VM Orchestrator is a built-in Azure service principal used by Azure Compute to securely manage disk encryption keys for Confidential VMs (CVMs). During boot, it validates the VM’s attestation evidence (SEV-SNP or TDX) and requests the Managed HSM to release the disk encryption key only to a verified CVM. It requires only Managed HSM Crypto Service Encryption User permissions. This ensures that customer-managed keys (CMKs) are released exclusively to attested CVMs and never to the hypervisor or platform operators. Responsibilities: Validate the Trusted Execution Environment (TEE) measurement. Approve or deny key release based on attestation. Enforce cryptographic linkage between the VM and HSM key, ensuring keys are only accessible to legitimate CVMs. Identity Setup: New-MgServicePrincipal -AppId bf7b6499-ff71-4aa2-97a4-f372087be7f0 Role Assignment: az keyvault role assignment create --hsm-name <HSM> --assignee <CVO ID> --role "Managed HSM Crypto Service Release User" --scope /keys Phase 6: Replication Enablement (Credential-Less) Configuration Steps: Go to the Azure portal → Search for Azure Migrate. Select your Azure Migrate project Navigate to Replicate. Select Credential-less replication. Choose the target subscription and resource group. Select Confidential VM-compatible size for the VMs. Assign Disk Encryption Sets (DES) for each disk. Validate private endpoint connectivity to ensure replication can access the target subnet securely. Begin Initial Sync + Delta Replication: All OS/data disks for CVMs must be Premium SSD or Premium SSD v2. Phase 7: Test Migration (Isolated Validation) Validation Checklist: VM boots successfully without intervention CVM security type = Confidential CMK encryption applied on all disks Attestation logs verified on first boot Applications tested and functional No unexpected public endpoints NIC, routing, NSGs, UDRs verified Phase 8: Production Cutover Cutover Sequence: Announce downtime Freeze transactions Run Planned Failover Validate immediately: Boot integrity Disk encryption Guest Attestation Extension security type is Confidential Switch application traffic Decommission source systems Phase 9: Post-Migration Hardening & Governance Azure Policy Enforcement: Allowed VM SKUs → CVM only Enforce CMK-only disk encryption Deny public IP creation Require private endpoints Restrict Managed HSM access Logging & Monitoring: Managed HSM logs Attestation logs Azure Monitor Defender for Cloud (CVM coverage) Microsoft Sentinel (optional) Operational Governance: HSM key rotation schedule Quarterly attestation validation DES lifecycle management Zero-trust identity auditing “Break glass” procedure definition 7. Confidential VM Limitations & Workarounds OS Disk Size Limit: Confidential disk encryption is only supported for OS disks at this stage. No support for Data Disks. Confidential disk encryption with CMK is not supported for disks larger than 128 GB. Workaround: Perform migration using SSE (Server-Side Encryption) with Platform-Managed Keys (PMK). Stop and deallocate the VM post-migration. Update encryption settings of OS disk to use SSE Disk Encryption Set (DES) using CMK for encryption. Operating System Support: Windows 2019 and later supported RHEL 9.4 and later supported Ubuntu 22.04+ supported (depending on SKU) For full list, check the CVM OS Support Matrix For additional details on limitations, please refer CVM Limitations 8. Conclusion Azure Confidential Virtual Machines represent a generational shift in cloud security providing encryption, isolation, and attestation at the hardware boundary. Combined with Azure Migrate, DES/CMK encryption, Managed HSM, private networking, and robust governance, enterprises can securely modernize mission-critical workloads without application rewrites.348Views4likes1CommentBuilding Reusable Custom Images for Azure Confidential VMs Using Azure Compute Gallery
Overview Azure Confidential Virtual Machines (CVMs) provide hardware-enforced protection for sensitive workloads by encrypting data in use using AMD SEV-SNP technology. In enterprise environments, organizations typically need to: Create hardened golden images Standardize baseline configurations Support both Platform Managed Keys (PMK) and Customer Managed Keys (CMK) Version and replicate images across regions This guide walks through the correct and production-supported approach for building reusable custom images for Confidential VMs using: PowerShell (Az module) Azure Portal Disk Encryption Sets (CMK) Azure Compute Gallery Key Design Principles Before diving into implementation steps, it is important to clarify that during real-world implementations, two important architectural truths become clear: ✅1️⃣ The Same Image Supports PMK and CMK The encryption model (PMK vs CMK) is not embedded in the image. Encryption is applied: At VM deployment time Through disk configuration (default PMK or Disk Encryption Set for CMK) This means: You build one golden image. You deploy it using PMK or CMK depending on compliance requirements. This simplifies lifecycle management significantly. ✅2️⃣ Confidential VM Image Versions Must Use Source VHD When publishing to Azure Compute Gallery: Confidential VMs require Source VHD (Mandatory Requirement) This is a platform requirement for Confidential Security Type support. Therefore, the correct workflow is: Deploy base Confidential VM Harden and configure Generalize Export OS disk as VHD Upload to storage Publish to Azure Compute Gallery Deploy using PMK or CMK Security Stack Breakdown Protection Area Technology Data in Use AMD SEV-SNP Boot Integrity Secure Boot + vTPM Image Lifecycle Azure Compute Gallery Disk Encryption PMK or CMK Compliance Control Disk Encryption Set (CMK) Implementation Steps 🖥️ Step 1 – Deploy a Base Windows Confidential VM This VM will serve as the image builder. Key Requirements Gen2 Image Confidential SKUs (similar to DCasv5 or ECasv5 series) SecurityType = ConfidentialVM Secure Boot enabled vTPM enabled Confidential OS Encryption enabled Reference Code Snippets (PowerShell) $rg = "rg-cvm-gi-pr-sbx-01" $location = "NorthEurope" $vmName = "cvmwingiprsbx01" New-AzResourceGroup -Name $rg -Location $location $cred = Get-Credential $vmConfig = New-AzVMConfig ` -VMName $vmName ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOperatingSystem ` -VM $vmConfig ` -Windows ` -ComputerName $vmName ` -Credential $cred $vmConfig = Set-AzVMSourceImage ` -VM $vmConfig ` -PublisherName "MicrosoftWindowsServer" ` -Offer "WindowsServer" ` -Skus "2022-datacenter-azure-edition" ` -Version "latest" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 📸 Reference Screenshots 🔧 Step 2 – Harden and Customize the OS This is where you: Install monitoring agents Install Defender for Endpoint Apply CIS baseline Install security agents Remove unwanted services Install application dependencies This is your enterprise golden baseline depending on the individual organizational requirements. 🔄 Step 3 – Generalize the Windows Confidential VM (Production-Ready Method) Confidential VMs often enable BitLocker automatically. Improper Sysprep handling can cause failures. Generalizing a Windows Confidential VM properly is critical to avoid: Sysprep failures BitLocker conflicts Image corruption Deployment errors later Follow these steps carefully inside the VM and later through Azure PowerShell. 1. Remove Panther Folder The Panther folder stores logs from previous Sysprep operations. If leftover logs exist, Sysprep can fail. This safely removes old Sysprep metadata. rd /s /q C:\Windows\Panther ✔ This step prevents common “Sysprep was not able to validate your Windows installation” errors. 2. Run Sysprep Navigate to Sysprep directory and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown Parameters explained: Parameter Purpose /generalize Removes machine-specific info (SID, drivers) /shutdown Powers off VM after completion ⚠️ Handling BitLocker Issues (Common in Confidential VMs): Confidential VMs may automatically enable BitLocker. If Sysprep fails due to encryption, follow the next steps to resolve the issue and execute sysprep again. 3. Check BitLocker Status & Turn Off BitLocker manage-bde -status If Protection Status is 'Protection On': manage-bde -off C: Wait for decryption to complete fully. ⚠️ Do not run Sysprep again until decryption reaches 100%. 4. Reboot and Run Sysprep Again After decryption completes: Reboot the VM Open Command Prompt as Administrator Navigate to Sysprep folder and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown ✔ VM will shut down automatically. 5. Mark VM as Generalized in Azure Now switch to Azure PowerShell: Stop-AzVM -Name $vmName -ResourceGroupName $rg -Force Set-AzVM -Name $vmName -ResourceGroupName $rg -Generalized ✔ This marks the VM as ready for image capture. 🧠 Why These Extra Steps Matter in Confidential VMs Confidential VMs differ from standard VMs because: They use vTPM They may auto-enable BitLocker They enforce Secure Boot They use Gen2 images Improper handling can cause: Sysprep failures Image capture errors Deployment failures from image “VM provisioning failed” issues These cleanup steps dramatically increase success rate. 💾 Step 4 – Export OS Disk as VHD Azure Gallery Image Definitions with Security Type as 'TrustedLaunchAndConfidentialVmSupported' require Source VHD as the support for Source Image VM is not available. Generate the SAS URL for OS Disk of the Virtual Machine. Copy to Storage Account as a .vhd file. Use Get-AzStorageBlobCopyState to validate the copy status and wait for completion. $vm = Get-AzVM -Name $vmName -ResourceGroupName $rg $osDiskName = $vm.StorageProfile.OsDisk.Name $sas = Grant-AzDiskAccess ` -ResourceGroupName $rg ` -DiskName $osDiskName ` -Access Read ` -DurationInSecond 3600 $storageAccountName = "stcvmgiprsbx01" $storageContainerName = "images" $destinationVHDFileName = "cvmwingiprsbx01-OsDisk-VHD.vhd" $destinationContext = New-AzStorageContext -StorageAccountName $storageAccountName Start-AzStorageBlobCopy -AbsoluteUri $sas.AccessSAS -DestContainer $storageContainerName -DestContext $destinationContext -DestBlob $destinationVHDFileName Get-AzStorageBlobCopyState -Blob $destinationVHDFileName -Container $storageContainerName -Context $destContext 🏢 Step 5 – Create Azure Compute Gallery & Image Version Instead of creating a standalone managed image, we will: Create an Azure Compute Gallery Create an Image Definition Publish a Gallery Image Version from the generalized Confidential VM This enables: Versioning Regional replication Staged rollouts Enterprise image lifecycle management 1. Create Azure Compute Gallery $galleryName = "cvmImageGallery" New-AzGallery ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Description "Confidential VM Image Gallery" 2. Create Image Definition for Windows Confidential VM Important settings: OS State = Generalized OS Type = Windows HyperV Generation = V2 Security Type = TrustedLaunchAndConfidentialVmSupported $imageDefName = "img-win-cvm-gi-pr-sbx-01" $ConfidentialVMSupported = @{Name='SecurityType';Value='TrustedLaunchAndConfidentialVmSupported'} $Features = @($ConfidentialVMSupported) New-AzGalleryImageDefinition ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Name $imageDefName ` -OsState Generalized ` -OsType Windows ` -Publisher "prImages" ` -Offer "WindowsServerCVM" ` -Sku "2022-dc-azure-edition" ` -HyperVGeneration V2 ` -Feature $features ✔ HyperVGeneration must be V2 for Confidential VMs. 📸 Reference Screenshot 3. Create Gallery Image Version from Generalized VM Now publish version 1.0.0 from the generalized VM OS Disk VHD to the Image Definition: There is no support for performing this step using Azure PowerShell, hence the Azure Portal needs to be used Ensure the right network and RBAC access on the storage account is in place Replication can be enabled on the Image Version to multiple regions for enterprises ✅ Why Azure Compute Gallery is the Right Choice Feature Managed Image Azure Compute Gallery Versioning ❌ ✅ Cross-region replication ❌ ✅ Enterprise lifecycle Limited Full Recommended for production ❌ ✅ For enterprise confidential workloads, Azure Compute Gallery is strongly recommended. 🚀 Step 6 – Deploy Confidential VM from Gallery Image 🔹 Using PMK (Default) If you do not specify a Disk Encryption Set, Azure uses Platform Managed Keys automatically. $imageId = (Get-AzGalleryImageVersion ` -GalleryName $galleryName ` -GalleryImageDefinitionName $imageDefName ` -ResourceGroupName $rg ` -Name "1.0.0").Id $vmConfig = New-AzVMConfig ` -VMName "cvmwingiprsbx02" ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" $vmConfig = Set-AzVMSourceImage -VM $vmConfig -Id $imageId $vmConfig = Set-AzVMOperatingSystem -VM $vmConfig -Windows -ComputerName "cvmwingiprsbx02" -Credential (Get-Credential) New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 🔹 Using CMK (Same Image!) If compliance requires CMK: Create Disk Encryption Set Associate with Key Vault or Managed HSM Attach DES during deployment $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithCustomerKey" ` -DiskEncryptionSetId $des.Id ✔ Same image ✔ Different encryption model ✔ Encryption applied at deployment 🔎 Validation Check Confidential Security: Get-AzVM -Name "cvmwingiprsbx02" -ResourceGroupName $rg | Select SecurityProfile Check disk encryption: Get-AzDisk -ResourceGroupName $rg Architectural Summary Confidential VM security is independent of disk encryption model Encryption choice is applied at deployment One image supports multiple compliance models Source VHD is required for Confidential VM gallery publishing Azure Compute Gallery enables enterprise lifecycle PMK vs CMK Decision Matrix Scenario Recommended Model Standard enterprise workloads PMK Financial services / regulated CMK BYOK requirement CMK Simplicity prioritized PMK 🏢 Enterprise Recommendations ✔ Always use Azure Compute Gallery ✔ Use semantic versioning (1.0.0, 1.0.1) ✔ Automate using Azure Image Builder ✔ Enforce Confidential VM via Azure Policy ✔ Enable Guest Attestation ✔ Monitor with Defender for Cloud Final Thoughts Creating custom images for Azure Confidential VMs allows organizations to combine the security benefits of Confidential Computing with the operational efficiency of standardized deployments. By baking security baselines, monitoring agents, and required configurations directly into a golden image, every new VM starts from a consistent and trusted foundation. A key advantage of this approach is flexibility. The custom image itself is independent of the disk encryption model, meaning the same image can be deployed using Platform Managed Keys (PMK) for simplicity or Customer Managed Keys (CMK) to meet stricter compliance requirements. This allows platform teams to maintain a single image pipeline while supporting multiple security scenarios. By publishing images through Azure Compute Gallery, organizations can version, replicate, and manage their Confidential VM images more effectively. Combined with proper VM generalization and hardening practices, custom images become a reliable way to ensure secure, consistent, and scalable deployments of Confidential workloads in Azure. As Confidential Computing continues to gain adoption across industries handling sensitive data, investing in a well-designed custom image pipeline will enable organizations to scale securely while maintaining consistency, compliance, and operational efficiency across their cloud environments.224Views1like0CommentsProactive Resiliency in Azure for Specialized Workload i.e. Citrix VDI on Azure Design Framework.
In this post, I’ll share my perspective on designing cloud architectures for near-zero downtime. We’ll explore how adopting multi-region strategies and other best practices can dramatically improve reliability. The discussion will be technically and architecturally driven covering key decisions around network architecture, data replication, user experience continuity, and cost management but also touch on the business angle of why this matters. The goal is to inform and inspire you to strengthen your own systems, and guide you toward concrete actions such as engaging with Microsoft Cloud Solution Architects (CSAs), submitting workloads for resiliency reviews, and embracing multi-region design patterns. Resilience as a Shared Responsibility One fundamental truth in cloud architecture is that ensuring uptime is a shared responsibility between the cloud provider and you, the customer. Microsoft is responsible for the reliability of the cloud in other words, we build and operate Azure’s core infrastructure to be highly available. This includes the physical datacenters, network backbone, power/cooling, and built-in platform features for redundancy. We also provide a rich toolkit of resiliency features (think availability sets, Availability Zones, geo-redundant storage, service failover capabilities, backup services, etc.) that you can leverage to increase the reliability of your workloads. However, the reliability in the cloud of your specific applications and data is up to you. You control your application architecture, deployment topology, data replication, and failover strategies. If you run everything in a single region with no backups or fallbacks, even Azure’s rock-solid foundation can’t save you from an outage. On the other hand, if you architect smartly (using multiple regions, zones, and Azure resiliency features properly), you can achieve end-to-end high availability even through major platform incidents. In short: Microsoft ensures the cloud itself is resilient, but you must design resilience into your workload. It’s a true partnership one where both sides play a critical role in delivering robust, continuous services to end-users. I emphasize this because it sets the mindset: proactive resiliency is something we do with our customers. As you’ll see, Microsoft has programs and people (like CSAs) dedicated to helping you succeed in this shared model. Six Layers of Resilient Cloud Architecture for Citrix VDI workloads To systematically approach multi-region resiliency, it helps to break the problem down into layers. In my work, I arrived at a six-layer decision framework for designing resilient architectures. This was originally developed for a global Citrix DaaS deployment on Azure (hence some VDI flavor in the examples), but the principles apply broadly to cloud solutions. The layers ensure we cover everything from the ground-up network connectivity to the operational model for failover. 1. Network Fabric (the global backbone) Establish high-performance, low-latency links between regions. Preferred: Use Global VNet Peering for simplified any-to-any connectivity with minimal latency over Microsoft’s backbone (ideal for point-to-point replication traffic), rather than a more complex Azure Virtual WAN unless your topology demands it. 2. Storage Foundation (the bedrock ) In any distributed computing environment, storage is the "heaviest" component. Moving compute (VDAs) is instantaneous; moving data (profiles, user layers) is governed by bandwidth and the speed of light. The success of a multi-region DaaS deployment hinges on the performance and synchronization of the underlying storage subsystem. Use storage that can handle cross-region workload needs, especially for user data or state. In case of Citrix Daas, preferred approach is Azure NetApp Files (ANF) for consistent sub-millisecond latency and high throughput. ANF provides enterprise-grade performance (critical during “login storms” or peak I/O) and features like Cool Access tiering to optimize cost, outperforming standard Azure Files for this scenario. 3. User Profile & State (solving data gravity) Enable active-active availability of user data or application state across regions. Solution: FSLogix Cloud Cache (in a VDI context) or similar distributed caching/replication tech, which allows simultaneous read/write of profile data in multiple regions. In our case, Cloud Cache insulates the user session from WAN latency by writing to a local cache and asynchronously replicating to the secondary region, overcoming the challenge of traditional file locking. The principle extends to databases or state stores: use geo-replication or distributed databases to avoid any single-region state. 4. Access & Ingress (the intelligent front door) Ensure users/customers connect to the right region and can fail over seamlessly. Preferred: Deploy a global traffic management solution under your control e.g. customer-managed NetScaler (Citrix ADC) with Global Server Load Balancing (GSLB) to direct users to the nearest available datacenter. In our design, NetScaler’s GSLB uses DNS-based geo-routing and supports Local Host Cache for Citrix, meaning even if the cloud control plane (Citrix Cloud) is unreachable, users can still connect to their desktop apps. The general point: use Azure Front Door, Traffic Manager, or third-party equivalents to steer traffic, and avoid any solution that introduces a new single point of failure in the authentication or gateway path. 5. Master Image (ensuring global consistency) : If you rely on VM images or similar artifacts, replicate them globally. Use: Azure Compute Gallery (ACG) to manage and distribute images across regions. In our case, we maintain a single “golden” image for virtual desktops: it’s built once, then the Compute Gallery replicates it from West Europe to East US (and any other region) automatically. This ensures that when we scale out or recover in Region B, we’re launching the exact same app versions and OS as Region A. Consistency here prevents failover from causing functionality regressions. 6. Operations & Cost (smart economics at scale) Run an efficient DR strategy you want readiness without paying 2x all the time. Approach: Warm Standby with autoscaling. That means the secondary region isn’t serving full traffic during normal operations (some resources can be scaled down or even deallocated), but it can scale up rapidly when needed. For our scenario, we leverage Citrix Autoscale to keep the DR site in a minimal state only a small buffer of machines is powered on, just enough to handle a sudden failover until load-based scaling brings up the rest. This “active/passive” model (or hot-warm rather than hot-hot) strikes a balance: you pay only for what you use, yet you can meet your RTO (Recovery Time Objective) because resources spin up automatically on trigger. In cloud-native terms, you might use Azure Automation or scale sets to similar effect. The key is to avoid having an idle full duplicate environment incurring full costs 24/7, while still being prepared. Each of these layers corresponds to critical architectural choices that determine your overall resiliency. Neglect any one layer, and that’s where Murphy’s Law will strike next. For example, you might perfectly replicate your data across regions, but if you forgot about network connectivity, a regional hub outage could still cut off access. Or you have every system duplicated, but if users can’t be rerouted to the backup region in time, the benefit is lost. The six-layer framework helps make sure we cover all bases. Notably, these design best practices align very closely with Azure’s Well-Architected Framework (especially the Reliability pillar), and they’re exactly the kind of prescriptive guidance we provide through programs like the Proactive Resiliency Initiative. In fact, the PRI playbook essentially prioritizes these same steps for customers: First, harden the network foundation e.g. ensure ExpressRoute gateways are zone-redundant and circuits are “multi-homed” in at least two locations (so no single datacenter failure breaks connectivity). Next, address in-region resiliency – make sure critical workloads are distributed across Availability Zones and not vulnerable to a single zone outage. (As an aside: Microsoft’s internal data shows a huge payoff here; when we configured our top Azure services for zonal resilience, we saw a 68% reduction in platform outages that lead to support incidents!) Then, enable multi-region continuity (BCDR) – for those tier-0 and tier-1 workloads, set up cross-regional failover so even a region-wide disruption won’t take you down. Multi-region is described as the complement to (not a substitute for) zonal design: it’s about surviving the “black swan” of a region-level event, and also about supporting geo-distributed users and future growth. In other words, if you follow the six-layer approach, you’re doing exactly what our structured resiliency programs recommend.319Views1like0CommentsAnnouncing Cobalt 200: Azure’s next cloud-native CPU
By Selim Bilgin, Corporate Vice President, Silicon Engineering, and Pat Stemen, Vice President, Azure Cobalt Today, we’re thrilled to announce Azure Cobalt 200, our next-generation Arm-based CPU designed for cloud-native workloads. Cobalt 200 is a milestone in our continued approach to optimize every layer of the cloud stack from silicon to software. Our design goals were to deliver full compatibility for workloads using our existing Azure Cobalt CPUs, deliver up to 50% performance improvement over Cobalt 100, and integrate with the latest Microsoft security, networking and storage technologies. Like its predecessor, Cobalt 200 is optimized for common customer workloads and delivers unique capabilities for our own Microsoft cloud products. Our first production Cobalt 200 servers are now live in our datacenters, with wider rollout and customer availability coming in 2026. Azure Cobalt 200 SoC and platform Building on Cobalt 100: Leading Price-Performance Our Azure Cobalt journey began with Cobalt 100, our first custom-built processor for cloud-native workloads. Cobalt 100 VMs have been Generally Available (GA) since October of 2024 and availability has expanded rapidly to 32 Azure datacenter regions around the world. In just one year, we have been blown away with the pace that customers have adopted the new platform, and migrated their most critical workloads to Cobalt 100 for the performance, efficiency, and price-performance benefits. Cloud analytics leaders like Databricks and Snowflake are adopting Cobalt 100 to optimize their cloud footprint. The compute performance and energy-efficiency balance of Cobalt 100-based virtual machines and containers has proven ideal for large-scale data processing workloads. Microsoft’s own cloud services have also rapidly adopted Azure Cobalt for similar benefits. Microsoft Teams achieved up to 45% better performance using Cobalt 100 than their previous compute platform. This increased performance means less servers needed for the same task, for instance Microsoft Teams media processing uses 35% fewer compute cores with Cobalt 100. Designing Compute Infrastructure for Real Workloads With this solid foundation, we set out to design a worthy successor – Cobalt 200. We faced a key challenge: traditional compute benchmarks do not represent the diversity of our customer workloads. Our telemetry from the wide range of workloads running in Azure (small microservices to globally available SaaS products) did not match common hardware performance benchmarks. Existing benchmarks tend to skew toward CPU core-focused compute patterns, leaving gaps in how real-world cloud applications behave at scale when using network and storage resources. Optimizing Azure Cobalt for customer workloads requires us to expand beyond these CPU core benchmarks to truly understand and model the diversity of customer workloads in Azure. As a result, we created a portfolio of benchmarks drawn directly from the usage patterns we see in Azure, including databases, web servers, storage caches, network transactions, and data analytics. Each of our benchmark workloads includes multiple variants for performance evaluation based on the ways our customers may use the underlying database, storage, or web serving technology. In total, we built and refined over 140 individual benchmark variants as part of our internal evaluation suite. With the help of our software teams, we created a complete digital twin simulation from the silicon up: beginning with the CPU core microarchitecture, fabric, and memory IP blocks in Cobalt 200, all the way through the server design and rack topology. Then, we used AI, statistical modelling and the power of Azure to model the performance and power consumption of the 140 benchmarks against 2,800 combinations of SoC and system design parameters: core count, cache size, memory speed, server topology, SoC power, and rack configuration. This resulted in the evaluation of over 350,000 configuration candidates of the Cobalt 200 system as part of our design process. This extensive modelling and simulation helped us to quickly iterate to find the optimal design point for Cobalt 200, delivering over 50% increased performance compared to Cobalt 100, all while continuing to deliver our most power-efficient platform in Azure. Cobalt 200: Delivering Performance and Efficiency At the heart of every Cobalt 200 server is the most advanced compute silicon in Azure: the Cobalt 200 System-on-Chip (SoC). The Cobalt 200 SoC is built around the Arm Neoverse Compute Subsystems V3 (CSS V3), the latest performance-optimized core and fabric from Arm. Each Cobalt 200 SoC includes 132 active cores with 3MB of L2 cache per-core and 192MB of L3 system cache to deliver exceptional performance for customer workloads. Power efficiency is just as important as raw performance. Energy consumption represents a significant portion of the lifetime operating cost of a cloud server. One of the unique innovations in our Azure Cobalt CPUs is individual per-core Dynamic Voltage and Frequency Scaling (DVFS). In Cobalt 200 this allows each of the 132 cores to run at a different performance level, delivering optimal power consumption no matter the workload. We are also taking advantage of the latest TSMC 3nm process, further improving power efficiency. Security is top-of-mind for all of our customers and a key part of the unique innovation in Cobalt 200. We designed and built a custom memory controller for Cobalt 200, so that memory encryption is on by default with negligible performance impact. Cobalt 200 also implements Arm’s Confidential Compute Architecture (CCA), which supports hardware-based isolation of VM memory from the hypervisor and host OS. When designing Cobalt 200, our benchmark workloads and design simulations revealed an interesting trend: several universal compute patterns emerged – compression, decompression, and encryption. Over 30% of cloud workloads had significant use of one of these common operations. Optimizing for these common operations required a different approach than just cache sizing and CPU core selection. We designed custom compression and cryptography accelerators – dedicated blocks of silicon on each Cobalt 200 SoC – solely for the purpose of accelerating these operations without sacrificing CPU cycles. These accelerators help reduce workload CPU consumption and overall costs. For example, by offloading compression and encryption tasks to the Cobalt 200 accelerator, Azure SQL is able to reduce use of critical compute resources, prioritizing them for customer workloads. Leading Infrastructure Innovation with Cobalt 200 Azure Cobalt is more than just an SoC, and we are constantly optimizing and accelerating every layer in the infrastructure. The latest Azure Boost capabilities are built into the new Cobalt 200 system, which significantly improves networking and remote storage performance. Azure Boost delivers increased network bandwidth and offloads remote storage and networking tasks to custom hardware, improving overall workload performance and reducing latency. Cobalt 200 systems also embed the Azure Integrated HSM (Hardware Security Module), providing customers with top-tier cryptographic key protection within Azure’s infrastructure, ensuring sensitive data stays secure. The Azure Integrated HSM works with Azure Key Vault for simplified management of encryption keys, offering high availability and scalability as well as meeting FIPS 140-3 Level 3 compliance. An Azure Cobalt 200 server in a validation lab Looking Forward to 2026 We are excited about the innovation and advanced technology in Cobalt 200 and look forward to seeing how our customers create breakthrough products and services. We’re busy racking and stacking Cobalt 200 servers around the world and look forward to sharing more as we get closer to wider availability next year. Check out Microsoft Ignite opening keynote Read more on what's new in Azure at Ignite Learn more about Microsoft's global infrastructure19KViews10likes0CommentsAzure VNet Flow Logs with Terraform: The Complete Migration and Traffic Analytics Guide
Migrating from NSG Flow Logs to VNet Flow Logs in Azure: Implementation with Terraform Author: Ibrahim Baig (Consultant) Executive Summary Microsoft is retiring Network Security Group (NSG) flow logs and recommends migrating to Virtual Network (VNet) flow logs. After June 30, 2025, new NSG flow logs cannot be created, and all NSG flow logs will be retired by September 30, 2027. Migrating to VNet flow logs ensures continued support and provides broader, simpler network visibility. What Changed & Key Dates - June 30, 2025: Creation of new NSG flow logs is blocked. - September 30, 2027: NSG flow logs are retired (resources deleted; historical blobs remain per retention policy). - Microsoft provides migration scripts and policy guidance for NSG→VNet flow logs. Why Migrate? (Benefits) Operational Simplicity & Coverage - Enable logging at the VNet, subnet, or NIC scope—no dependency on NSG. - Broader visibility across all workloads inside a VNet, not just NSG-governed traffic. Security & Analytics - Native integration with Traffic Analytics for enriched insights. - Monitor Azure Virtual Network Manager (AVNM) security admin rules. Continuity & Cost Parity - VNet flow logs are priced the same as NSG flow logs (with 5 GB/month free). What’s New in VNet Flow Logs - Scopes: Enable at VNet, subnet, or NIC level. - Storage: JSON logs to Azure Storage. - At-scale enablement: Built-in Azure Policy for auditing and auto-deployment. - Analytics: Traffic Analytics add-on for deep insights. - AVNM awareness: Observe centrally managed security admin rules. Traffic Analytics: Capabilities & Value Traffic Analytics (TA) is a powerful add-on for VNet flow logs, providing: - Automated Traffic Insights: Visualize traffic flows, identify top talkers, and detect anomalous patterns. - Threat Detection: Surface suspicious flows, lateral movement, and communication with malicious IPs. - Network Segmentation Validation: Confirm that segmentation policies are effective and spot unintended access. - Performance Monitoring: Analyze bandwidth usage, latency, and flow volumes for troubleshooting. - Customizable Dashboards: Drill down by subnet, region, or workload for targeted investigations. - Integration: Seamless with Azure Monitor and Log Analytics for alerting and automation. For practical recipes and advanced use cases, see https://blog.cloudtrooper.net/2024/05/08/vnet-flow-logs-recipes/. GAP: The Terraform Registry page for azurerm_network_watcher_flow_log does not yet provide an explicit VNet flow logs example. In practice, you use the same resource and set target_resource_id to the ID of the VNet (or Subnet/NIC). Registry page (latest): https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_watcher_flow_log Important notes: - Same resource block: azurerm_network_watcher_flow_log - Use target_resource_id = <resource ID of VNet/Subnet/NIC> (instead of legacy network_security_group_id) - As of 30 July 2025, creating new NSG flow logs is no longer possible (provider notes); migrate to VNet/Subnet/NIC targets. - Keep your azurerm provider up-to-date, earlier builds had validation gaps for subnet/NIC IDs; these were tracked and addressed in provider issues. Implementation Guide Option A — Terraform (Recommended for IaC) Note: Use a dedicated Storage account for flow logs, as lifecycle rules may be overwritten. terraform { required_version = ">= 1.5" required_providers { azurerm = { source = "hashicorp/azurerm" version = ">= 3.110.0" # or latest } } } provider "azurerm" { features {} } data "azurerm_network_watcher" "this" { name = "NetworkWatcher_${var.region}" resource_group_name = "NetworkWatcherRG" } resource "azurerm_network_watcher_flow_log" "vnet_flow_log" { name = "${var.vnet_name}-flowlog" network_watcher_name = data.azurerm_network_watcher.this.name resource_group_name = data.azurerm_network_watcher.this.resource_group_name target_resource_id = azurerm_virtual_network.vnet.id storage_account_id = azurerm_storage_account.flowlogs_sa.id enabled = true retention_policy { enabled = true days = 30 } traffic_analytics { enabled = true workspace_id = azurerm_log_analytics_workspace.law.workspace_id workspace_region = azurerm_log_analytics_workspace.law.location workspace_resource_id = azurerm_log_analytics_workspace.law.id interval_in_minutes = 60 } tags = { owner = "network-platform" environment = var.env } } Option B — Azure CLI az network watcher flow-log create \ --location westus \ --resource-group MyResourceGroup \ --name myVNetFlowLog \ --vnet MyVNetName \ --storage-account mystorageaccount \ --workspace "/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<LAWName>" \ --traffic-analytics true \ --interval 60 Option C — Azure Portal - Go to Network Watcher → Flow logs → + Create. - Choose Flow log type = Virtual network; select VNet/Subnet/NIC, Storage account, and optionally enable Traffic Analytics. Option D — At Scale via Azure Policy - Use built-in policies to audit and auto-deploy VNet flow logs (DeployIfNotExists). Migration Approach (NSG → VNet Flow Logs) Inventory existing NSG flow logs. Choose migration method: Microsoft script or Azure Policy. Run both in parallel temporarily to validate. Disable NSG flow logs before retirement. Challenges & Mitigations - Permissions: Ensure required roles on Log Analytics workspace. - Terraform lifecycle: Use a dedicated Storage account. - Tooling compatibility: Verify SIEM/NDR support. - Provider/API maturity: Use current azurerm provider. Validation Checklist - Storage: New blobs appear in the configured Storage account. - Traffic Analytics: Data visible in Log Analytics workspace. - AVNM: Confirm traffic allowed/denied states appear in logs. Cost Considerations - VNet flow logs ingestion: $0.50/GB after 5 GB free/month. - Traffic Analytics processing: $2.30/GB (60-min) or $3.50/GB (10-min). Traffic Analytics Deep Dive: VNet Flow Logs are stored in Azure Blob Storage. Optionally, you can enable Traffic Analytics, which will do two things: it will enrich the flow logs with additional information, and will send everything to a Log Analytics Workspace for easy querying. This “enrich and forward to Log Analytics” operation will happen in intervals, either every 10 minutes or every hour. Table Structure: NTAIPDetails This table will contain some enrichment data about public IP addresses, including whether they belong to Azure services and their region, and geolocation information for other public IPs. Here you can see a sample of what that table looks like: NTAIpDetails | distinct FlowType, PublicIpDetails, Location Table Structure: NTATopologyDetails This table contains information about different elements of your topology, including VNets, subnets, route tables, routes, NSGs, Application Gateways and much more. Here you cans see what it looks like: Table Structure: NTANetAnalytics Alright, now we are coming to more interesting things: this table is the one containing the flows we are looking for. Records in this table will contain the usual attributes you would expect such as source and destination IP, protocol, and destination port. Additionally, data will be enriched with information such as: Source and destination VM Source and destination NIC Source and destination subnet Source and destination load balancer Flow encryption (yes/no) Whether the flow is going over ExpressRoute And many more Further below you can read some scenarios with detailed queries that will show you some examples of ways you can extract information from VNet Flow Logs and Traffic Analytics. Of course, these are just some of the scenarios that came to mind on my topology, the idea is that you can get inspiration from these queries to support your individual use case. Example Scenario: Imagine you want to see with which IP addresses a given virtual machine has been talking to in the last few days: NTANetAnalytics | where TimeGenerated > ago(10d) | where SrcIp == "10.10.1.4" and strlen(DestIp)>0 | summarize TotalBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcIp, DestIp Similarly, you can play around with such KQL queries in the workspace to deep dive into the Flow Logs. References & Further Reading https://learn.microsoft.com/en-us/azure/network-watcher/nsg-flow-logs-overview https://learn.microsoft.com/en-us/azure/network-watcher/nsg-flow-logs-migrate https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-overview https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-manage https://learn.microsoft.com/en-us/cli/azure/network/watcher/flow-log?view=azure-cli-latest https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-policy https://azure.microsoft.com/en-us/pricing/details/network-watcher/ https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_watcher_flow_log https://blog.cloudtrooper.net/2024/05/08/vnet-flow-logs-recipes/1.2KViews2likes0CommentsResiliency Best Practices You Need For your Blob Storage Data
Maintaining Resiliency in Azure Blob Storage: A Guide to Best Practices Azure Blob Storage is a cornerstone of modern cloud storage, offering scalable and secure solutions for unstructured data. However, maintaining resiliency in Blob Storage requires careful planning and adherence to best practices. In this blog, I’ll share practical strategies to ensure your data remains available, secure, and recoverable under all circumstances. 1. Enable Soft Delete for Accidental Recovery (Most Important) Mistakes happen, but soft delete can be your safety net and. It allows you to recover deleted blobs within a specified retention period: Configure a soft delete retention period in Azure Storage. Regularly monitor your blob storage to ensure that critical data is not permanently removed by mistake. Enabling soft delete in Azure Blob Storage does not come with any additional cost for simply enabling the feature itself. However, it can potentially impact your storage costs because the deleted data is retained for the configured retention period, which means: The retained data contributes to the total storage consumption during the retention period. You will be charged according to the pricing tier of the data (Hot, Cool, or Archive) for the duration of retention 2. Utilize Geo-Redundant Storage (GRS) Geo-redundancy ensures your data is replicated across regions to protect against regional failures: Choose RA-GRS (Read-Access Geo-Redundant Storage) for read access to secondary replicas in the event of a primary region outage. Assess your workload’s RPO (Recovery Point Objective) and RTO (Recovery Time Objective) needs to select the appropriate redundancy. 3. Implement Lifecycle Management Policies Efficient storage management reduces costs and ensures long-term data availability: Set up lifecycle policies to transition data between hot, cool, and archive tiers based on usage. Automatically delete expired blobs to save on costs while keeping your storage organized. 4. Secure Your Data with Encryption and Access Controls Resiliency is incomplete without robust security. Protect your blobs using: Encryption at Rest: Azure automatically encrypts data using server-side encryption (SSE). Consider enabling customer-managed keys for additional control. Access Policies: Implement Shared Access Signatures (SAS) and Stored Access Policies to restrict access and enforce expiration dates. 5. Monitor and Alert for Anomalies Stay proactive by leveraging Azure’s monitoring capabilities: Use Azure Monitor and Log Analytics to track storage performance and usage patterns. Set up alerts for unusual activities, such as sudden spikes in access or deletions, to detect potential issues early. 6. Plan for Disaster Recovery Ensure your data remains accessible even during critical failures: Create snapshots of critical blobs for point-in-time recovery. Enable backup for blog & have the immutability feature enabled Test your recovery process regularly to ensure it meets your operational requirements. 7. Resource lock Adding Azure Locks to your Blob Storage account provides an additional layer of protection by preventing accidental deletion or modification of critical resources 7. Educate and Train Your Team Operational resilience often hinges on user awareness: Conduct regular training sessions on Blob Storage best practices. Document and share a clear data recovery and management protocol with all stakeholders. 8. "Critical Tip: Do Not Create New Containers with Deleted Names During Recovery" If a container or blob storage is deleted for any reason and recovery is being attempted, it’s crucial not to create a new container with the same name immediately. Doing so can significantly hinder the recovery process by overwriting backend pointers, which are essential for restoring the deleted data. Always ensure that no new containers are created using the same name during the recovery attempt to maximize the chances of successful restoration. Wrapping It Up Azure Blob Storage offers an exceptional platform for scalable and secure storage, but its resiliency depends on following best practices. By enabling features like soft delete, implementing redundancy, securing data, and proactively monitoring your storage environment, you can ensure that your data is resilient to failures and recoverable in any scenario. Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn Data redundancy - Azure Storage | Microsoft Learn Overview of Azure Blobs backup - Azure Backup | Microsoft Learn Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn1.4KViews1like1CommentOperational Excellence In AI Infrastructure Fleets: Standardized Node Lifecycle Management
Co-authors: Choudary Maddukuri and Bhushan Mehendale AI infrastructure is scaling at an unprecedented pace, and the complexity of managing it is growing just as quickly. Onboarding new hardware into hyperscale fleets can take months, slowed by fragmented tools, vendor-specific firmware, and inconsistent diagnostics. As hyperscalers expand with diverse accelerators and CPU architectures, operational friction has become a critical bottleneck. Microsoft, in collaboration with the Open Compute Project (OCP) and leading silicon partners, is addressing this challenge. By standardizing lifecycle management across heterogeneous fleets, we’ve dramatically reduced onboarding effort, improved reliability, and achieved >95% Nodes-in-Service on incredibly large fleet sizes. This blog explores how we are contributing to and leveraging open standards to transform fragmented infrastructure into scalable, vendor-neutral AI platforms. Industry Context & Problem The rapid growth of generative AI has accelerated the adoption of GPUs and accelerators from multiple vendors, alongside diverse CPU architectures such as Arm and x86. Each new hardware SKU introduces its own ecosystem of proprietary tools, firmware update processes, management interfaces, reliability mechanisms, and diagnostic workflows. This hardware diversity leads to engineering toil, delayed deployments, and inconsistent customer experiences. Without a unified approach to lifecycle management, hyperscalers face escalating operational costs, slower innovation, and reduced efficiency. Node Lifecycle Standardization: Enabling Scalable, Reliable AI Infrastructure Microsoft, through the Open Compute Project (OCP) in collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, is leading an industry-wide initiative to standardize AI infrastructure lifecycle management across GPU and CPU hardware management workstreams. Historically, onboarding each new SKU was a highly resource-intensive effort due to custom implementations and vendor-specific behaviors that required extensive Azure integration. This slowed scalability, increased engineering overhead, and limited innovation. With standardized node lifecycle processes and compliance tooling, hyperscalers can now onboard new SKUs much faster, achieving over 70% reduction in effort while enhancing overall fleet operational excellence. These efforts also enable silicon vendors to ensure interoperability across multiple cloud providers. Figure: How Standardization benefits both Hyperscalers & Suppliers. Key Benefits and Capabilities Firmware Updates: Firmware update mechanisms aligned with DMTF standards, minimize downtime and streamline fleet-wide secure deployments. Unified Manageability Interfaces: Standardized Redfish APIs and PLDM protocols create a consistent framework for out-of-band management, reducing integration overhead and ensuring predictable behavior across hardware vendors. RAS (Reliability, Availability and Serviceability) Features: Standardization enforces minimum RAS requirements across all IP blocks, including CPER (Common Platform Error Record) based error logging, crash dumps, and error recovery flows to enhance system uptime. Debug & Diagnostics: Unified APIs and standardized crash & debug dump formats reduce issue resolution time from months to days. Streamlined diagnostic workflows enable precise FRU isolation and clear service actions. Compliance Tooling: Tool contributions such as CTAM (Compliance Tool for Accelerator Manageability) and CPACT (Cloud Processor Accessibility Compliance Tool) automate compliance and acceptance testing—ensuring suppliers meet hyperscaler requirements for seamless onboarding. Technical Specifications & Contributions Through deep collaboration within the Open Compute Project (OCP) community, Microsoft and its partners have published multiple specifications that streamline SKU development, validation, and fleet operations. Summary of Key Contributions Specification Focus Area Benefit GPU Firmware Update requirements Firmware Updates Enables consistent firmware update processes across vendors GPU Management Interfaces Manageability Standardizes telemetry and control via Redfish/PLDM GPU RAS Requirements Reliability and Availability Reduces AI job interruptions caused by hardware errors CPU Debug and RAS requirements Debug and Diagnostics Achieves >95% node serviceability through unified diagnostics and debug CPU Impactless Updates requirements Impactless Updates Enables Impactless firmware updates to address security and quality issues without workload interruptions Compliance Tools Validation Automates specification compliance testing for faster hardware onboarding Embracing Open Standards: A Collaborative Shift in AI Infrastructure Management This standardized approach to lifecycle management represents a foundational shift in how AI infrastructure is maintained. By embracing open standards and collaborative innovation, the industry can scale AI deployments faster, with greater reliability and lower operational cost. Microsoft’s leadership within the OCP community—and its deep partnerships with other Hyperscalers and silicon vendors—are paving the way for scalable, interoperable, and vendor-neutral AI infrastructure across the global cloud ecosystem. To learn more about Microsoft’s datacenter innovations, check out the virtual datacenter tour at datacenters.microsoft.com.946Views0likes0Comments