azure virtual machines
38 TopicsDemystifying On-Demand Capacity Reservations
About On-Demand Capacity Reservations Introducing the “parking garage” metaphor There are dozens of VM types available in Azure which span multiple generations of CPU across vendors and architectures. Within each Azure region are datacenters hosting pools of hardware which runs Azure services, such as virtual machines, of those types. As VMs are started and stopped by customers there is a constant ebb and flow of available capacity to run each type of VM within the region. Available capacity is driven by the rhythms of the business day, which creates variations in utilization on an hour-to-hour and even minute-to-minute basis. Longer cycles of demand such as holiday seasons, school calendars and other real-world events are also a factor. When you command an Azure Virtual Machine (VM) to start, the Azure Resource Manager (ARM) – the “engine” that manages resources in the Microsoft cloud -- needs to do a few things to make it happen. The most important of these is that it needs to identify hardware within the target region with sufficient capacity to bring the desired type and size of VM online at that moment in time. If ARM finds space for the desired VM size, the VM starts normally. However, if there is no room to start the desired VM, you will see an error similar to this one: This process of finding a place to start up an Azure VM has a lot of similarities to finding a place to park a vehicle. Parking facilities are built to handle typical demand for their location. If something is going on nearby, such as a large sporting event, which causes the need for parking to be much higher than normal then you might be out of luck when you try to find a spot because the garage is simply full. During periods of high demand in Azure this can result in VMs failing to start simply because there is nowhere to run them at that particular moment. If this happens to a VM which needed to be stopped for a configuration change or other reasons this can cause impact to your environment which you certainly want to avoid. On-Demand Capacity Reservations Azure has a resource called an On-Demand Capacity Reservation, or ODCR, which allows you to reserve a spot for a VM in the appropriate hardware within a region for a specific VM size. This is similar to “owning" a parking space: It’s a reserved place exclusively for the use of a specific VM. At a high level, the way this works is that you create an ODCR which matches the Azure region, availability zone and specific VM type, such as for a VM of type D16s_v6 in availability zone 2 of the Canada Central Azure region. Once the reservation is created, an Azure VM that matches that configuration can be associated to it so the VM now “owns” that “parking space”. This gives that VM priority over others of the same type when it needs to start because it already has a “parking space” assigned to it that can't be used by another one. More detail about VM startup Before we get further into what ODCRs are and how they work, it’s important to know a few more things about starting up a VM. Azure does not provide an explicit SLA for VM startup for virtual machines without an ODCR. The process of finding a hypervisor slot to boot up a VM is purely a “best effort” action on Azure’s part. Having quota headroom does not help with VM startup. Quota in Azure is your "credit limit" for creating VMs. Quota grants permission to create up to a certain number of cores’ worth of Virtual Machines from a particular family (like Ds_v6) but has no effect on whether you can actually start the machine once it’s created. Similarly, having a Reserved Instance purchase or a Savings Plan for a particular number of cores of a given VM family does not have any impact on the ability to start a VM either. These mechanisms are a discount mechanism only where the customer pre-pays for a certain amount of VM cores to be running 24x7 at a discounted rate. Assigning an ODCR to a virtual machine applies a formal SLA on startup for it. VMs with ODCRs get priority over ones that don’t so the likelihood of a successful startup is much higher for VMs that have one compared to those that do not, especially during times when Azure is experiencing a period of high demand for that particular VM type. The actual language of the ODCR SLA can be found in Microsoft's Service Level Agreements for Online Services document which can be downloaded from the linked site. Cost Implications of ODCRs These are the key points that you need to know about how billing works for ODCRs: The compute cost for the parking space capacity reservation for a VM is exactly the same as a running VM of the same size. There is no “double billing” for a VM to have an ODCR associated with it. Billing for the ODCR starts immediately if the quantity of reserved "parking spaces" is greater than zero. Stopping a VM that has an ODCR associated with it does not impact cost. This is because the ODCR is holding the reserved hypervisor slot even if the VM is not running. Having a Reserved Instance purchase or Savings Plan which covers the same scope as the ODCR means that the VM will be billed at the discounted rate. Are there any cases where using ODCRs results in paying more for a VM? There are two cases that I’ve identified where you pay for two ODCRs for the same VM. First, if you are using Azure Site Recovery to protect a VM in Azure by replicating it to another location, you have the option to associate the remote replica of the VM with a capacity reservation. This helps ensure that the replica will start when it’s called upon because it has a pre-allocated spot reserved for it. In this situation, if the original VM also is associated with an ODCR you are paying for both the original (running) VM and also for the reservation being held for its replica. Second, and similarly, when setting up replication for a VM that is preparing for migration into Azure via Azure Migrate, you can associate a capacity reservation with the replica for similar reasons to the above ASR example -- to ensure that the VM will start when its migrated replica is activated. If the source machine is also in Azure then you are again paying twice for the same machine. When should I use them? Capacity Reservations are an important element when designing for resiliency. They help ensure that VMs will be online when needed, even if they have to be shut down for some reason. For example, there was an incident where a customer had to shut down a VM that was serving as a firewall appliance to make an adjustment to its configuration and it failed to start up afterwards because of a capacity-related failure. This resulted in significant impact due to the loss of connectivity for systems dependent on the firewall for connectivity until they were able to bring it back online. Based on field experience and resiliency assessments, applying ODCRs to VMs that must be available 24x7 is strongly recommended. Examples of this include key functions like AD domain controllers, application servers and database servers. Also, any VM-based appliances that may be running as firewalls, load balancers or other infrastructure-support services should be considered as well. Microsoft offers assessments which review a workload for gaps that impact resiliency in many dimensions including outages in Azure. These assessments include checks for the presence of capacity reservations and will report any VM’s that do not have them as a high-risk finding. Not all VM stops in Azure are voluntary Even if you are careful to never stop a VM yourself it can sometimes happen. Not every shutdown of a VM in Azure is user-initiated. Involuntary shutdowns are rare but they can occur due to predictive hardware failures or other events which ARM will respond to by stopping the VM in order to move it out of harm's way. Creating On-Demand Capacity Reservations This section covers the components of an ODCR, the process of creating them and why creating them can fail. Components of an ODCR: An ODCR has two components to it. The first part is a Capacity Reservation Group (CRG) which is simply a "bucket" for any number of capacity reservations. To create a CRG you only need to provide its name, the region that it will be used for and which availability zones within that region it will have access to. The second -- and more important -- component is the actual Capacity Reservation which is created within a CRG. The capacity reservation requires: The name of the reservation. Including the VM size and other details in the name is useful to reduce ambiguity. An example could be “Zone1_D16s_v5” The specific VM size the reservation is for, such as “D16s_v5” The availability zone of the reservation. You can also create a regional reservation, where the VM is “zoneless”, as well. The number of parking spaces instances that the reservation holds. ODCRs can be created via the Azure portal, from the command line using PowerShell or the Azure CLI or deployed through IaC tools such as Bicep or Terraform. CRGs also can also be shared across subscriptions, which allows a CRG created and managed in one subscription to be utilized by VMs in a different subscription. When the ODCR is created, if the number of instances it contains is higher than zero then ARM will attempt to allocate the desired number of instances of the specified VM type in the target region/zone. If there is capacity available for this then the creation succeeds and you can move on to associating machines with it to give them the protection of the ODCR. If creating the ODCR is unsuccessful, the cause can be a variety of things, including: No open hypervisor slots for the desired VM in the target location – the “parking lot” was full at the moment the request was submitted. This can result from outages within Azure that reduce capacity as well as demand pressure. There is insufficient quota in the subscription to claim the necessary number of VM cores for the reservation in the region. The VM type is simply not available in the target region or AZ. Since not all Azure regions are provisioned with identical hardware this can be the cause, especially for VM types other than the popular D, E and F series machines. A restriction is applied to the subscription, zone or region that blocks creation of the reservation for some reason. What you can do if creating an ODCR fails Some things that may help if creating a capacity reservation fails and you know that quota or other restrictions are not a factor are below. Not coincidentally, these are the same recommendations that you should try when a VM fails to start because the same ARM action – finding and allocating hardware with free capacity to start the VM – is taking place. IN GENERAL, creating an ODCR outside of business hours has a higher probability of success. Demand for Azure services typically drops off at the end of the business day where the region is located. Consider using a different VM type, availability zone or a different Azure region. A script or other automation that retries at intervals until the reservation succeeds in claiming the desired number of spots can help, though it can take an unknown amount of time before this works. It may need to run for days or even weeks before it succeeds. Submitting a support ticket will create visibility to your situation from Microsoft. If the root cause is something other than capacity, support can identify that cause and provide guidance on how to resolve it. If the issue truly is a capacity squeeze, the ability of support to help get the reservation created started is extremely limited because the support folks, while helpful, are not able to create space where none exists. In this case the support teams will often suggest the three options above. Protecting a VM with an ODCR Once you have the ODCR created, applying it to a VM is straightforward. To do this from the portal, open the configuration tab on the VM’s screen. Then scroll to the bottom of the panel that appears to find the “Capacity reservations” section. Select “Capacity reservation group” from the list. The list of capacity reservation groups that match the VM will appear in a drop-down menu below. Select the CRG that the VM should use and click “Apply”. If you are using an Infrastructure-as-Code approach such as Bicep or Terraform, an Azure VM is linked to a CRG by specifying the resource ID of the CRG in the appropriate property on the VM definition. Impact of associating a virtual machine with an ODCR: If the VM is not running then the change takes effect immediately. If the VM is running and has no zone assignment (a “regional” VM) then it must be stopped and restarted for the protection of the ODCR to apply. If the VM is running and has a zone assignment then the change is immediate and there is no disruption to the VM. Where an ODCR is not the right answer ODCRs are most effective when they are used to protect VMs that need to always be running because they are providing essential services. Examples include AD domain controllers, firewall or load balancer appliances, database servers, integration servers that support workflows and the like. The primary thing to keep in mind is the cost impact of the ODCRs and whether they are necessary for the service to be functioning. Environments where machines come and go frequently, such as scale in/out setups used to minimize cost, are not ideal for ODCRs. For example, if you have a pool of app servers configured for scale-out, using ODCRs to cover the entire size of the pool means you would be paying for all machines, whether they are actually online or not. A possible approach in a scale-out environment is to determine the minimum number of VMs necessary for the service to be available -- even in a degraded state -- and use an ODCR to protect that number of instances. This way you can have confidence that at least that number of machines in the pool will always be running even if an attempt to scale out fails. Working with On-Demand Capacity Reservations (and three interesting behaviors that you should know about) This section discusses some ins and outs of working with ODCRs in your environment, especially if you need to apply them to existing machines. This is a common scenario when you are attempting to improve the resiliency of a set of VMs against impacts from maintenance, outages or other situations that may cause VMs to restart. “Associated” vs “Allocated” A capacity reservation group will always have ownership of some number of "parking spots" within a region. The number that it holds is referred to as the reservation's capacity which is expressed as a number of allocated instances. When you link a VM to a CRG, the VM becomes associated with the CRG and can take advantage of the protection that it offers from matching reservations that it contains. It is possible to associate more VMs to a CRG than it has allocated capacity for. This is called overallocation. When a CRG is overallocated, the VMs associated with it are protected on a first-come-first-served basis based on when they were started. If, for example, there are four VMs associated with a CRG but the CRG only has an allocated capacity of two, the first two associated machines which were started will receive protection but the others will not. “Interesting” On-Demand Capacity Reservation behavior #1: Here is the first of three interesting behaviors that you can use to your advantage when working with ODCRs. You can add a running VM to a capacity reservation group. As mentioned previously, if the VM is zonal then the change is immediate and nondisruptive. If the VM is regional then the VM must be stopped and restarted for the change to take effect. This is conceptually different from other Azure mechanisms used for resiliency such as Availability Sets. You can only add a VM to an availability set at the time the VM is created but you can add or remove a VM from a Capacity Reservation Group at any time whether the VM is running or not. “Interesting” On-Demand Capacity Reservation behavior #2 Interesting behavior #2 is deceptively simple. When creating a reservation, you can specify a capacity (number of allocated instances) of zero. This should always succeed because Azure needs to take no action to fulfill it -- this is just a metadata adjustment for the reservation within the CRG. This seems to not be terribly useful at first glance but keep reading. “Interesting” On-Demand Capacity Reservation behavior #3 If the number of associated VMs is higher than the allocated capacity of the reservation, you can increase the capacity of the reservation to cover the running VMs. Why does this work? Because running VMs, by definition, have a parking spot hypervisor allocation already so Azure doesn’t need to find one for it -- Azure can simply link the capacity reservation to the hypervisor slot that the running VM is using. The payoff! Or, using these three behaviors to your advantage Because ODCRs are relatively new and have not yet been adopted widely, a common finding to emerge from field resiliency assessments of running workloads is that the VMs that support the workload need to have ODCRs applied to them. In large environments there may be dozens or even hundreds of VMs that need to be protected. The process for doing this can seem daunting to a technical team that is not familiar with ODCRs. Thankfully, these three behaviors make it possible to easily protect any number of running machines with a very high probability of success -- and zero disruption if they are zonal VMs -- by proceeding in this order: Create a CRG with a reservation for the region, AZ and VM type for the machine(s) that need to be covered with a quantity of zero. (Interesting behavior #2) Associate the VMs to the capacity reservation group. At this point the CRG is overallocated so the machines are not yet protected. Remember that if the VMs are regional, a restart is required to finalize the ODCR assignment. (Interesting behavior #1) Update the reservation within the CRG to increase the number of allocated instances to match the number of running VMs. (Interesting behavior #3) When the number of instances on the reservation is equal to or higher than the number of VMs associated with it, all of the associated VMs are protected and you’re done! Final thoughts This leads to a final piece of advice about working with ODCRs, especially when you know that capacity is a challenge in the target region: As a field CSA, I recommend that you bring VMs online first, then apply a capacity reservation to them. Why? If you already have a set of running VMs that need to be protected then following what seems like the obvious process: Creating a CRG, creating reservations within it for the correct number of instances and then associating the VMs with the reservation – has a risk of failure at the step of creating the ODCR because Azure needs to find and allocate additional hypervisor slots for the reservation to own. This can be challenging when there is a lot of demand for the VM type. As the example in the previous section showed, it’s much easier to protect VMs that are already online by associating them with an existing capacity reservation, even if it doesn’t have enough instances allocated to it, and then increasing the capacity of the ODCR to cover the running machines. References: On-Demand Capacity Reservations Overview Monitor the list of restrictions on VM eligibility because it changes frequently SLA Details for On-Demand Capacity Reservations Legal fine print is in the consolidated SLA for Online Services (.docx) Some details about Overallocating capacity reservations Information on creating a Capacity Reservation Group via Bicep, Terraform or ARM template.227Views1like0CommentsCI/CD as a Platform: Shipping Microservices and AI Agents with Reusable GitHub Actions Workflows
The First Shift — Treating CI/CD as a Platform The first insight is straightforward but underused: Your CI/CD logic is infrastructure. It deserves the same design discipline as your application code. That means centralizing it. Versioning it. Exposing it as reusable, callable workflows — not copy-pasted YAML scattered across dozens of repos. In Part 1 of this series, we build exactly that. A platform repository that defines reusable GitHub Actions workflows for testing, building, and deploying containerized services to Azure. Application repos stay thin — they simply call the platform, like invoking an API. Build once. Deploy anywhere. Fix once. Every team benefits. The Second Shift — Governing AI Behavior But software is changing. We are no longer just shipping APIs and microservices. We are shipping AI agents — systems that reason, respond, and make decisions. And these systems break the assumptions that traditional CI/CD was built on. A unit test can tell you whether your code is correct. It cannot tell you whether your AI agent is trustworthy. Prompts behave like code but drift differently. Model outputs are probabilistic. Quality degrades silently, without a failed test to catch it. This creates a new engineering challenge: How do you build a delivery pipeline for something that does not have a deterministic right answer? In Part 2, we extend the platform to answer that question. We introduce evaluation as a deployment gate — a reusable workflow that scores agent behavior before any deployment is allowed. We integrate with Microsoft Foundry for agent runtime and observability. And we show how the same platform-thinking from Part 1 applies directly to AI systems. What This Series Is Really About This is not a tutorial on GitHub Actions syntax. It is about maturity — the difference between a team that writes pipelines and a team that designs delivery systems. Between an organization that ships code and one that governs behavior. By the end of both parts, you will have: A reusable CI/CD platform that scales across any number of services An evaluation-driven delivery pipeline for AI agents A mental model for treating both code and AI as governed, versioned artifacts The tools are GitHub Actions and Azure. The principle is platform thinking. Let's build it. The Problem — Why CI/CD Pipelines Don't Scale Every pipeline starts simple. You create a repository, add a workflow file, and within minutes your code is building and deploying automatically. It feels like a solved problem. It isn't. The Reality of Growth The first pipeline is straightforward. The second is a copy of the first. The third is a copy of the second — with one small adjustment. By the time you have ten services, you have ten slightly different pipelines, each one drifting quietly away from the others. This is pipeline sprawl — and it is far more costly than it appears. Consider what happens in practice: One team upgrades their Python version. Others don't. A security fix gets applied to three pipelines. The other seven are missed. A new compliance requirement means updating every workflow file — manually, one repo at a time. A new engineer onboards using an old workflow and ships a pattern that was deprecated months ago. None of this feels critical in the moment. But over time, your CI/CD layer becomes the most inconsistent, unmaintainable, and ungoverned part of your infrastructure — even though it controls everything that ships to production. The Deeper Problem — No Separation of Concerns The root cause is not a tooling limitation. It is a design problem. Most teams treat CI/CD as something that lives inside an application repo — a secondary concern, not a first-class system. That model works at small scale. It breaks at org scale. When CI/CD logic is distributed across every application repo: There is no single source of truth for how deployments work Platform teams cannot enforce standards without touching every repo individually Security and compliance teams have no centralized control plane Onboarding a new service means rebuilding from scratch — or copying from an outdated reference The Cost You Don't See The real cost of this pattern is not the duplicated YAML. It is the compounding overhead: Problem Visible Cost Hidden Cost Duplicated pipelines Time to replicate Drift and inconsistency over time No centralized logic Minor friction Security gaps across repos Manual updates One-time effort per change Multiplied across every service No versioning Manageable today Breaking changes with no rollback path What the Solution Looks Like The answer is not a better YAML template. It is a platform. Specifically — a centralized repository that owns CI/CD logic, exposes it as reusable versioned workflows, and lets every application team consume it without duplicating a single line of pipeline code. This is the same principle that drives every mature engineering organization: Don't repeat infrastructure. Abstract it. Version it. Share it. That is exactly what we are going to build. The Architecture — What You're Building Before writing a single line of code, it is worth understanding the system as a whole. The architecture is intentionally simple. Two repositories. One cloud infrastructure. One clear separation of responsibilities. The Two-Repo Model This separation is the core design decision. Everything else follows from it. The platform repo is not an application. It does not ship features. It ships workflow infrastructure — reusable, versioned, callable by any application team in your organization. The application repo is deliberately thin on CI/CD. It contains a single workflow file that calls the platform. Nothing more. How They Connect The connection happens through GitHub's workflow_call trigger — a mechanism that allows one workflow to invoke another across repositories. The application repo does not care how the build works. It only cares about the contract — inputs it needs to provide, outputs it can expect back. This is the same mental model as an API: The caller knows the interface. The platform owns the implementation. The Deployment Flow Once triggered, the pipeline moves through four clearly defined stages: A few things to note about this flow: The image is built exactly once. The same artifact moves through every environment — no rebuilds, no drift. The Git SHA is the image tag. Every deployment is fully traceable back to a specific commit. GitHub Environments control approvals. Staging and production are separate environments with configurable protection rules — no custom approval logic needed. The Azure Infrastructure On the cloud side, the system uses two Azure services: Service Role Azure Container Registry (ACR) Stores Docker images Azure Container Apps Runs the application in staging and production Both are provisioned using Bicep — Azure's infrastructure-as-code language — so the infrastructure is versioned and repeatable alongside the workflows. Responsibility Map Here is how responsibilities are distributed across the system: Layer Owns Does Not Own Platform Repo Test logic, build logic, deploy logic Application code Application Repo Business logic, Dockerfile, requirements Pipeline implementation Azure Runtime, registry, networking Deployment decisions This clean separation means: Platform teams can update CI/CD logic without touching application code Application teams can ship features without understanding pipeline internals Infrastructure changes are isolated to the Bicep layer Why This Scales The real power of this architecture becomes clear at scale. With fifty microservices: One change to deploy.yml in the platform repo propagates to every service on the next run. No manual updates. No drift. No inconsistency. This is what CI/CD as a platform means in practice. Platform Repo — Structure and Reusable Workflows The platform repo is the heart of this system. Everything it contains is designed to be reusable, versioned, and consumed by any application team in your organization. Let's walk through it in full. Repository Structure Three workflows. One infrastructure file. That is the entire platform. Each workflow has a single, well-defined responsibility: Workflow Responsibility test-python.yml Install dependencies and run tests build.yml Build Docker image and push to ACR deploy.yml Deploy a specific image to a specific environment Workflow 1 — test-python.yml This workflow handles dependency installation and test execution for any Python-based service. name: test-python on: workflow_call: jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.11.9" - run: pip install -r requirements.txt - run: pytest What to note: The on: workflow_call trigger is what makes this reusable. It cannot be triggered directly — it must be called by another workflow. The Python version is pinned to 3.11.9 — not a floating version like 3.11. This ensures every service tests against the exact same runtime, eliminating environment-specific failures. Any application repo that calls this workflow gets consistent, centrally maintained test execution — without defining any of this logic themselves. Workflow 2 — build.yml This workflow builds the Docker image, tags it with the Git SHA, and pushes it to Azure Container Registry. name: build on: workflow_call: outputs: image_tag: value: ${{ jobs.build.outputs.image_tag }} jobs: build: runs-on: ubuntu-latest outputs: image_tag: ${{ steps.meta.outputs.tag }} permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - id: meta run: echo "tag=${GITHUB_SHA}" >> $GITHUB_OUTPUT - uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - run: az acr login --name ${{ secrets.ACR_NAME }} - run: | docker build -t ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ github.sha }} . docker push ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ github.sha }} What to note: outputs — This workflow exposes image_tag as an output. The calling workflow captures this value and passes it downstream to the deploy workflow. This is how the same image tag flows from build → staging → production without being hardcoded anywhere. id-token: write — This permission enables OIDC-based authentication with Azure. No long-lived credentials are stored as secrets. GitHub generates a short-lived token at runtime, which Azure trusts via a federated identity configuration. This is the recommended authentication pattern for production workloads. ${GITHUB_SHA} — Using the commit SHA as the image tag makes every build fully traceable. Given any running container, you can identify the exact commit it was built from. Workflow 3 — deploy.yml This workflow deploys a given image to a given environment in Azure Container Apps. name: deploy on: workflow_call: inputs: environment: required: true type: string image_tag: required: true type: string app_name: required: true type: string jobs: deploy: runs-on: ubuntu-latest environment: ${{ inputs.environment }} steps: - uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - run: | az containerapp update \ --name ${{ inputs.app_name }} \ --resource-group ${{ secrets.AZURE_RESOURCE_GROUP }} \ --image ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ inputs.image_tag }} What to note: Three inputs — environment, image_tag, and app_name. This single workflow handles every environment. The caller decides where to deploy by passing inputs — the workflow itself has no hardcoded environment logic. environment: ${{ inputs.environment }} — This line is deceptively powerful. By mapping the job's environment to the input value, GitHub automatically applies whatever protection rules are configured for that environment — required reviewers, wait timers, deployment policies. Approval gates come for free. secrets: inherit — When the calling workflow passes secrets: inherit, Azure credentials flow through automatically without being re-declared. Secrets are managed once, at the org or repo level. The Versioning Contract One detail that makes this system production-ready is workflow versioning. When an application repo calls a platform workflow, it references a specific version: The v1 tag means: Application teams are insulated from breaking changes in the platform Platform teams can ship improvements without forcing immediate upgrades You can run v1 and @v2 side by side during migrations Every deployment is traceable to a specific platform version This versioning model is what separates a platform from a shared folder of YAML files. What Application Teams See From an application team's perspective, the entire platform surface looks like this: Three uses statements. That is the entire CI/CD surface an application team needs to understand. Everything else — authentication, image tagging, registry login, container update commands — is abstracted away inside the platform. Azure Infrastructure The platform workflows handle CI/CD logic. The Azure infrastructure handles the runtime — where your containers live, how they are stored, and how they are served to the outside world. All infrastructure is defined in Bicep — Azure's native infrastructure-as-code language. This means your infrastructure is versioned, repeatable, and deployable from a single command. Why Bicep Before diving into the code, it is worth briefly explaining the choice. Bicep compiles down to ARM templates but is significantly more readable. It integrates natively with Azure's resource model, requires no external state management, and fits naturally alongside GitHub Actions workflows. For teams already working within the Azure ecosystem, it is the most straightforward path to infrastructure-as-code without introducing additional tooling dependencies. Infrastructure Structure The entire infrastructure is defined in a single file. For this architecture, you need two resources: Resource Purpose Azure Container Registry (ACR) Stores and serves Docker images Azure Container Apps Runs containers in a managed serverless environment main.bicep param location string = resourceGroup().location // Azure Container Registry resource acr 'Microsoft.ContainerRegistry/registries@2023-01-01-preview' = { name: 'myregistry' location: location sku: { name: 'Basic' } } // Azure Container App (Staging + Production) resource containerApp 'Microsoft.App/containerApps@2023-05-01' = { name: 'my-app' location: location properties: { configuration: { ingress: { external: true targetPort: 8000 } } } } Breaking It Down Container Registry resource acr 'Microsoft.ContainerRegistry/registries@2023-01-01-preview' = { name: 'myregistry' location: location sku: { name: 'Basic' } } The ACR is the central image store for your entire platform. Every image built by build.yml is pushed here, tagged with its Git SHA. Both staging and production pull from this registry — ensuring the exact same artifact runs in both environments. The Basic SKU is sufficient for most team-scale workloads. For larger organizations with higher throughput requirements, Standard or Premium SKUs offer geo-replication and increased storage limits. Container App resource containerApp 'Microsoft.App/containerApps@2023-05-01' = { name: 'my-app' location: location properties: { configuration: { ingress: { external: true targetPort: 8000 } } } } Azure Container Apps provides a fully managed serverless container runtime. You define what runs — it handles scaling, networking, and availability. Two things to note here: external: true — Makes the application publicly accessible over HTTPS. Azure Container Apps automatically provisions a fully qualified domain name and TLS certificate. targetPort: 8000 — Maps to the port exposed by the FastAPI application inside the container. This must match the --port argument in your CMD instruction in the Dockerfile. Staging vs. Production You will deploy this infrastructure twice — once for staging, once for production — with different resource names: # Deploy staging az deployment group create \ -- resource-group rg-ciplatform-staging \ -- template-file infra/main.bicep # Deploy production az deployment group create -- resource-group rg-ciplatform-production \ -- template-file infra/main.bicep The deploy.yml workflow then targets the correct app by name via the app_name input: This keeps staging and production fully isolated at the infrastructure level while sharing the same workflow logic. GitHub Environments and Approval Gates On the GitHub side, you configure two Environments — staging and production — inside your repository settings. For production, add a required reviewer protection rule: When the pipeline reaches the deploy-prod job, GitHub will pause and wait for a designated reviewer to approve before proceeding. This approval gate costs nothing extra — it is built into GitHub's environment model and wired automatically through the environment: field in deploy.yml. Setting Up Azure Authentication The workflows authenticate to Azure using OpenID Connect (OIDC) — a keyless authentication method that eliminates the need for long-lived service principal secrets. Set up the federated identity once: # Create a service principal az ad app create -- display-name "github-actions-platform" # Add federated credential for your repo az ad app federated-credential create \ -- id <app-id> \ -- parameters '{ "name": "github-actions", "issuer": "https://token.actions.githubusercontent.com", "subject": "repo:your-org/fastapi-app:ref:refs/heads/main", "audiences": ["api://AzureADTokenExchange"] }' Then add these three secrets to your GitHub repository: Secret Value AZURE_CLIENT_ID Application (client) ID AZURE_TENANT_ID Directory (tenant) ID AZURE_SUBSCRIPTION_ID Azure subscription ID AZURE_RESOURCE_GROUP Target resource group name ACR_NAME Container registry name ACR_LOGIN_SERVER Registry login server (e.g. myregistry.azurecr.io) With these in place, every workflow that calls azure/login@v2 authenticates automatically — no passwords, no rotation, no expiry management. Application Repo — Structure, Code, and Release Workflow With the platform repo in place, the application repo becomes remarkably simple. Its only CI/CD responsibility is to call the platform — everything else is focused purely on application logic. This is the goal: application teams ship features, not pipelines. Repository Structure This is the entire CI/CD footprint of the application repo. The Application — src/main.py The application is a minimal FastAPI service with a single endpoint that returns the current deployed version and environment. from fastapi import FastAPI import os app = FastAPI() @app.get("/version") def version(): return { "version": os.getenv("GITHUB_SHA", "dev"), "environment": os.getenv("APP_ENV", "local") } This endpoint serves a practical purpose beyond demonstration. In a real system, a /version or /health endpoint like this allows you to: Verify which commit is running in each environment Confirm a deployment succeeded without inspecting container logs Detect environment mismatches between staging and production requirements.txt All dependencies are pinned to exact versions. This ensures the same packages install in every environment — local development, CI, staging, and production — eliminating version drift as a source of failures. Dockerfile FROM python:3.11.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY src ./src CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"] What to note: python:3.11.9-slim — The base image uses the same Python version as the platform's test-python.yml workflow. Consistency between the test environment and the container runtime eliminates an entire class of environment-specific bugs. Dependency layer first — requirements.txt is copied and installed before application source code. This is a deliberate layer ordering decision — Docker caches the dependency layer independently, so subsequent builds only reinstall packages when requirements.txt changes, not on every code change. 0.0.0.0 — Binds the server to all network interfaces inside the container, making it reachable from outside. Combined with targetPort: 8000 in the Bicep configuration, this completes the network path from Azure Container Apps to the application. The Release Workflow — release.yml This is the most important file in the application repo. It is also the simplest. name: release on: push: branches: [main] permissions: id-token: write contents: read jobs: test: uses: ns-github-design/ci-platform/.github/workflows/test-python.yml@v1 build: needs: test uses: ns-github-design/ci-platform/.github/workflows/build.yml@v1 secrets: inherit deploy-staging: needs: build uses: ns-github-design/ci-platform/.github/workflows/deploy.yml@v1 with: environment: staging image_tag: ${{ needs.build.outputs.image_tag }} app_name: my-app-staging secrets: inherit deploy-prod: needs: [build, deploy-staging] uses: ns-github-design/ci-platform/.github/workflows/deploy.yml@v1 with: environment: production image_tag: ${{ needs.build.outputs.image_tag }} app_name: my-app-prod secrets: inherit Walking Through the Pipeline Trigger Every merge to main triggers a full release. This reflects a trunk-based delivery model — main is always releasable, and every commit to it initiates the path to production. Test Job The first job calls the platform's test workflow. No configuration required — the platform handles Python setup, dependency installation, and test execution. The application team owns the test files; the platform owns the execution environment. Build Job The build job runs only after tests pass. It calls the platform's build workflow and inherits all secrets automatically — Azure credentials, ACR login server, registry name — without re-declaring them. The critical output here is image_tag — the Git SHA of the current commit. This value is captured and passed downstream to both deploy jobs. Deploy to Staging The staging deployment runs immediately after a successful build. It passes three inputs to the deploy workflow: environment: staging — triggers GitHub's staging environment rules image_tag — the exact SHA built in the previous job app_name: my-app-staging — the target Container App in Azure Deploy to Production Production deployment runs only after staging succeeds. It uses the same image_tag — the identical image that just ran successfully in staging is what gets promoted to production. No rebuild. No repackaging. The artifact is immutable. If a required reviewer is configured on the production GitHub Environment, the pipeline pauses here until approval is granted. The Complete Pipeline at a Glance What the Application Team Never Has to Think About It is worth being explicit about what this model abstracts away from application engineers: Concern Handled By Azure authentication Platform (build.yml, deploy.yml) Docker build and push Platform (build.yml) Image tagging strategy Platform (build.yml) Container App update command Platform (deploy.yml) Approval gate mechanics GitHub Environments Python version consistency Platform (test-python.yml) The application team's CI/CD knowledge requirement is reduced to understanding three uses statements and two with input blocks. Everything else is the platform's responsibility. Demo — Proving It Works Your pipeline is now live and connected across three layers: GitHub Actions (Reusable Workflows) – powering CI/CD logic FastAPI Application Repo – consuming those workflows Azure Container Apps – running staging and production Step 1 – Trigger the CI/CD Pipeline Push any commit to the main branch: Then open: You’ll see the workflow release start automatically. Step 2 – Observe the Pipeline Run The jobs execute in sequence: Stage Description test Runs pytest inside GitHub Actions using the reusable workflow test-python.yml build Builds and tags a Docker image with the current Git SHA, then pushes to ACR deploy‑staging Deploys that same image to your Container App my-app-staging approval gate Waits for approval of the production environment deploy‑prod On approval, promotes the identical image to my-app-prod Your final dependency chain looks like this: (You added needs: [build, deploy-staging]—perfect for ensuring the correct ordering.) Step 3 – Review the Logs Every job’s output is visible inside GitHub Actions: test – confirms tests collected successfully build – shows docker push ... to ACR deploy‑staging – displays Azure CLI output updating the Container App deploy‑prod – mirrors those steps after manual approval This transparency is part of what makes reusable workflows auditable and support enterprise compliance. Step 4 – Verify Running Apps After both deployments succeed, confirm each environment is live. Staging Production Expected response: (The exact commit SHA replaces "abc1234".) This proves: The same container image was promoted unchanged. Both environments are consistent. The platform’s reusable workflows handled the full delivery flow. The Bridge: Why AI Changes Everything Your CI/CD platform now runs like a product: build once, test once, deploy anywhere. But software itself is shifting. The next generation of systems doesn’t just serve requests — it reasons. We are no longer only shipping code. We are shipping AI agents that evolve, learn, and behave based on prompts, data, and context. And that introduces a new set of engineering realities. The Old Contract Traditional CI/CD pipelines assume: Code is deterministic Tests define correctness Deployments promote immutable artifacts Those assumptions hold for APIs and microservices. The New Reality with AI Systems AI systems violate the core idea of “deterministic correctness.” Characteristic Traditional Software AI / Agent Systems Behavior Deterministic Probabilistic Definition of success Binary pass/fail Continuous score Changes Source code edits Prompt/model/data changes Validation method Unit tests Semantic evaluation Risks Bugs Hallucination / drift / bias Prompts, fine‑tuned models, retraining data, and external tool integrations become active code paths — yet they can’t be meaningfully validated with unit tests alone. Why This Breaks Standard CI/CD Your current CI/CD system answers only one question: “Did the code pass its tests?” But for an AI agent, that’s not enough. You also need to know: “Did the model behave acceptably across metrics that matter?” Without that gate, an AI update that produces worse responses could still deploy perfectly — because the pipeline has no concept of semantic quality. The Missing Layer — Evaluation What testing is to code, evaluation is to AI. It separates experimental prompts from production‑ready agents. This leads to the next maturity step: Extend your CI/CD platform into an AI Delivery Platform — one that can evaluate, score, and gate agent behavior before deployment. What Changes Technically You don’t replace the CI/CD you built. You add a new reusable workflow to the same platform: This new workflow introduces a stage that: Runs offline or dataset‑based evaluation scripts Computes a confidence / quality score Blocks deployment if performance falls below threshold What This Means Philosophically Build pipelines become governance systems Platform teams now own evaluation as much as deployment Reusable workflows become policies for AI reliability The same architecture — reusable calls, versioned workflows, staged promotions — continues serving you, but with a new function: safeguarding machine behavior. Evaluation as a Gate Your reusable CI/CD system already enforces two things: Code quality → through tests Deployment consistency → through shared workflows The next maturity layer is enforcing behavioral quality — ensuring an AI agent performs to a defined standard before it goes live. That’s where evaluation pipelines come in. The Big Shift In conventional systems: For AI systems: Instead of pass/fail assertions, you now gate deployments on scores — accuracy, relevance, factuality, safety, or any quantitative prompt‑response metric. Reusable Workflow — evaluate-agent.yml Add this new file to your platform repository: File content: Example Evaluation Script — eval.py This script executes semantic evaluation logic for your agent. As a proof‑of‑concept, this produces a random score. In real use, this could compute accuracy against a dataset, compare responses to a gold standard, or call an LLM‑based judge service. Integrating the New Stage In your AI app repo (for example, agent-app or fastapi-app once it evolves into an agent): This creates a simple but powerful control flow: If eval.py writes a score below 0.8, the pipeline stops immediately — deployment blocked, logs recorded, everything traceable. Key Takeaways Concept Description Reusable Same evaluate-agent workflow can gate hundreds of models Configurable Each use can override thresholds or metrics Auditable Evaluation scores logged as build artifacts Safe Prevents low-performing or biased agents from promotion Beyond Thresholds Later, you can evolve this into: Adaptive thresholds per metric Human‑in‑the‑loop approvals for borderline scores Trend tracking – scores over time via GitHub Checks or dashboards Integration with observability platforms (Azure App Insights, Foundry evaluations, etc.) AI Delivery Pipeline + Foundry Integration So far, you have: A unified CI/CD platform powered by reusable GitHub Actions Evaluation pipelines that gate AI deployments Now we expand that architecture into a complete AI Delivery Platform by integrating with Microsoft Foundry. The Goal Combine: GitHub Actions ↔ Foundry for seamless build‑evaluate‑deploy cycles Reusable workflows for policies + governance Foundry runtime for execution, scaling, and observability of agents This transforms your CI/CD system into a behavior‑driven deployment layer for AI. Conceptual Flow Reusable CI/CD Workflows + Foundry Runtime Your existing ci-platform repo now gains a fourth reusable workflow: Each of these maps to a Foundry capability: Workflow Foundry Capability Role build.yml Model packaging & versioning Creates deployable image evaluate-agent.yml Evaluation service Runs offline or dataset‑based checks deploy.yml Agent deployment Publishes agent to Foundry runtime (Additional) monitor.yml Telemetry Pulls evaluation metrics post‑deploy Example Foundry‑Aware Pipeline In an AI repository (e.g., agent-app): This sequence guarantees that only successfully evaluated agent versions are deployed to Foundry. How Foundry Fits In Microsoft Foundry provides: Agent runtime — scalable, managed environment for composable agents Evaluation tools — integrate LLM‑as‑judge, dataset scoring, or automatic benchmarks Observability layers — performance metrics, feedback loops, and telemetry Orchestration frameworks — connect multiple tools or sub‑agents into an ecosystem GitHub Actions handles delivery logic. Foundry handles AI execution and lifecycle. Together, they form a modular operations stack for AI systems. Benefits of Integration Benefit Description Governed Deployments Only evaluated and approved agent versions reach Foundry Traceability Every deployed agent is linked to a Git commit and eval score Reproducibility Re‑running pipeline with the same commit reproduces identical behavior Observability Foundry telemetry pushes real‑world feedback back into the platform repo Architecture View Governance in Practice Every deployment is evaluated before release. Every evaluation is logged as metadata in the Actions run. Foundry stores live metrics that can trigger automated re‑evaluation workflows downstream. This unifies the DevOps and MLOps worlds under one pipeline. Advanced Practices Integrating evaluation and Foundry is the foundation. True enterprise reliability comes from how you operate and evolve those pipelines over time. Below are the main practices that transform this setup from “it works” to “it scales safely.” 1. Prompt Versioning In AI systems, prompts are code. A single word change in a prompt can shift an agent’s behavior as much as a logic rewrite does in software. Treat them accordingly: Store prompts and configurations in git (/prompts/prompt_v1.txt, prompt_v2.txt). Use clear change history — commits = versions. Reference prompt versions explicitly in deployment metadata: Re-runs of an old version must reproduce identical responses; versioned prompts make that possible. 2. Experiment Tracking Track every experiment like you track every deployment. Item Example Format Commit SHA f9a3c2a Prompt version prompt_v3 Model checkpoint gpt‑35‑turbo 2024‑06‑01 Dataset revision dataset_v2 Evaluation score 0.87 Implementation tips: Write a short artifact file (experiment.json) in each pipeline run. Store it as a workflow artifact or upload it to an experiment tracker (MLflow, Azure ML Experiments, Foundry History). You can later analyze how prompt or model changes affect score trends. This allows data‑driven improvement cycles: evaluate → compare → promote → monitor. 3. Rollback Strategies For deterministic software: Rollback = redeploy previous container. For AI systems you may need to rollback three dimensions: Dimension Example Rollback Code Checkout previous commit Prompt Revert to earlier prompt file Model Reuse prior checkpoint or model ID Best practice: treat each version triple (code, prompt, model) as one immutable release unit in the pipeline. GitHub tags + evaluation artifacts = auditable rollback point. 4. Continuous Evaluation Evaluation shouldn’t stop at deployment. Integrate post‑deployment monitoring jobs to detect drift: Benefits: Detects silent performance drops caused by new data or model API changes. Keeps models aligned with their initial standards. Creates long‑term confidence for compliance audits. 5. Fail Fast, Fail Safe Configure pipelines such that failure to evaluate = failure to deploy. When in doubt, err on protection. Failures should be logged, retriable, and transparent — never silent. This approach builds institutional trust in AI releases the same way software regression testing built trust in traditional CI/CD. 6. Governance by Design Use GitHub’s native features (branch protections, required reviews, environment rules) as declarative governance. Combine them with Foundry’s policy hooks: restrict which teams can promote evaluated agents; enforce minimum score thresholds; auto‑disable underperforming models. Governance embedded in code scales better than manual review boards. 7. Platform Observability Push run data into dashboards. Correlate: GitHub Actions runs Evaluation scores Production telemetry from Foundry Visualization options: Azure Monitor, Power BI, Grafana. Aim for a CI/CD + AI Ops Console view — one pane to observe quality, reliability, and speed. Outcome of These Practices Your organization achieves: Consistency across microservices and AI systems Accountability through versioned artifacts Safety via evaluation gates and drift monitors Agility because updates remain fast, but protected Enterprise Scenarios By this point, you’ve built an end‑to‑end platform: standardized CI/CD for apps and agents, reusable GitHub Actions workflows, Azure runtime for reliable deployments, Foundry‑integrated evaluation gates. Now let’s see how this architecture performs in the wild. Scenario 1 — Fifty Microservices, One Consistent Pipeline Problem Statement At scale, each microservice team usually maintains a slightly different workflow — fragmented test tools, drift in Python or Node versions, duplicated YAML. What Goes Wrong Compliance updates require 50 PRs. Each team solves build problems differently. Security teams can’t easily prove consistency. Platform Solution The ci-platform repo defines all workflows once (test‑python.yml, build.yml, deploy.yml). Every service just calls them through uses:. Upgrading the base image or CI version happens once and propagates to all services. Result Full organization upgrade from Python 3.10→3.11 in minutes. Consistent quality gates, policies, and artifact naming. Reduced cycle time, increased deployment confidence. Scenario 2 — Regulated Enterprises (Compliance + Audit) Problem Statement Financial, healthcare, and government projects require strict controls: Auditable promotion paths Approval workflows Traceability of versions and changes What Goes Wrong Manual change reviews are error‑prone. Different CI/CD definitions per team produce inconsistent logs. Compliance reports take weeks. Platform Solution GitHub Environments provide built‑in approvals and reviewer rules. The same reusable workflows ensure identical build signatures. Foundry integration logs evaluation scores and deployment metadata automatically. Result Reviewers approve through GitHub’s Environment gate — zero custom UI needed. Each release carries an immutable commit ID + evaluation score + approvers record. Audit reports generate directly from pipeline history. Scenario 3 — AI‑Driven Customer Support Platform Problem Statement A company running customer support agents (GPT‑powered) wants to continuously improve responses but without risking live quality drops. What Goes Wrong Prompt changes can silently worsen accuracy. Model updates impact intent coverage. Hard to correlate user feedback with deployment versions. Platform Solution Add evaluate-agent.yml into the same CI/CD chain. Feed evaluation datasets that cover FAQs and tone guidelines. Require minimum score ≥ 0.85 for promotion. Deploy via Foundry to production clusters once threshold met. Stream Foundry telemetry → GitHub → Power BI for quality dashboards. Result Continuous prompt experimentation without sacrificing quality. Regressed builds automatically blocked. Business stakeholders track AI accuracy as a live metric. Bonus Scenario — Enterprise AI R&D Platform Multiple research teams train models on‑prem or in Azure ML. The central engineering platform exposes build, evaluate, deploy steps as reusable workflows. Data scientists → run “evaluate‑agent” without touching infra. Platform engineers → control policies, thresholds, approvals. Leadership → gets consistent reporting on AI performance and cost. This creates a single standard for AI lifecycle governance across business units. Summary Your platform now supports: Area Traditional Dev AI Adaptation Build & Test Reusable workflows (Services) Evaluation gate (Agents) Deploy Container Apps / GitHub Environments Foundry + Telemetry Feedback Governance Environment approval rules Evaluation threshold + human review Scaling One repo per service One platform per organization Across these cases, the core pattern holds: Centralize workflow logic, decentralize application logic, unify governance. 14 — Conclusion What began as a simple effort to clean up a few duplicated YAML files evolved into a complete delivery platform architecture — one that treats pipelines as first‑class products and extends their usefulness into the era of AI‑driven systems. From Pipelines to Platforms At first, you built reusable workflows in a shared repository. That small structural change produced an outsized effect: Reduced maintenance and drift Consistent security and compliance One‑click upgrades across every service You proved that pipeline logic belongs in its own product — a CI/CD platform. From Deterministic to Intelligent Delivery Then the domain changed. Deterministic services gave way to AI agents. You responded by extending the same reusable platform into the AI dimension: Added evaluate-agent.yml for semantic scoring Introduced Foundry as the runtime for intelligent components Unified evaluation, governance, and deployment under the same contracts The underlying philosophy remained identical: don’t duplicate delivery logic — standardize it. The Broader Pattern This architecture expresses a clear maturity pathway: Stage What Changes Technical Lever CI/CD as Automation Build pipelines per project YAML and Actions CI/CD as Product Reusable workflows, shared logic Platform Repo CI/CD as Governance Environments, approvals, tracking GitHub Environments + Azure AI Delivery Platform Evaluation + behavioral policy Foundry Integration Every step adds structure, traceability, and scale, without sacrificing developer velocity. Cultural Impact Moving to a platform model does more than streamline releases. It elevates DevOps to a product discipline: Platform engineers design contracts, not scripts. Application teams consume delivery APIs, not ad‑hoc builds. AI teams get reliable evaluation and rollback mechanisms. In short: velocity meets governance. The Next Frontier As this pattern matures, two frontiers are emerging: Autonomous Evaluation — Agents that assess other agents in continuous feedback loops. Dynamic Policy Enforcement — Pipelines that adjust deployment thresholds and configurations in real time based on observed performance. The foundations you’ve built — centralized workflows, evaluation gates, and Foundry integration — already support that trajectory. CI/CD maturity is not about writing workflows; it’s about designing reusable systems of workflows. What you’ve built is more than CI/CD. It’s a platform that defines how modern software and AI move from idea to production safely. Next we’ll close the series with a brief “What’s Next” section — outlining concrete next steps for building upon this foundation. Continue? go ahead Here’s the final piece — Section 15 — What’s Next, which closes the series with momentum and a clear roadmap for readers who want to replicate or extend what you’ve built. 15 — What’s Next You’ve gone from writing pipelines to designing platforms. The CI/CD model you created now governs the lifecycle of both microservices and AI agents — and it’s only the beginning. Step 1 — Publish Your Platform Make both repositories public (read‑only) so others can learn from the pattern: ns-github-design/ci-platform – your reusable workflow product ns-github-design/fastapi-app – your minimal consumer example Tag the current stable version as v1.0 in both repos. Add concise READMEs explaining purpose, usage, and version policy. This turns your repos into live documentation — a working reference architecture. Step 2 — Add Automated Docs and Visuals Export your Draw.io architecture to SVG and embed it in each README. Use GitHub Pages or Docsify to render a small site explaining: platform repo overview; how workflow_call works; how to set up Azure auth; example runs and outputs. Readers love code + architecture in one place. Step 3 — Extend to AI Agents Add a third demo: agent-evaluator — a lightweight agent that runs eval.py and demonstrates the evaluation gate. In that repo: Call evaluate-agent.yml from your platform. Push commits that sometimes fail thresholds. Show screenshots of blocked vs. approved runs. You’ll have a fully working AI evaluation demo powered by your platform. Step 4 — Instrument Foundry Feedback Use Foundry’s APIs to stream live evaluation results or observability data back into GitHub Actions artifacts: yaml - name: Collect Foundry feedback run: foundry metrics export --project my-ai-agent --output metrics.json That feedback loop will let you build dashboards of quality trends alongside deployment timeline. Step 5 — Prepare Part 3 (Next Blog) You now have a natural foundation for the next article: “Autonomous Delivery Loops: Continuous Evaluation and Guardrails for AI Agents.” Outline: Continuous evaluation with scheduled runs Self‑healing approval flows Dynamic policy adjustment based on metrics Cross‑team Governance as Code That installment makes your series visionary and future‑ready. Quick Recap Phase Achievement 1 – 4 Built CI/CD Platform + App Repo 5 Configured Azure + OIDC 6 Verified Pipeline End‑to‑End 8 – 15 Documented Demo → AI Integration → Enterprise Practices → Vision You now have a complete blog series that is: technically deep, architecturally unique, demonstrably real. Every diagram, YAML, and code sample came from a working, reproducible system — the hallmark of strong engineering writing. Final Thought Software delivery used to end at deployment. AI delivery begins there. The future of platforms is not just to ship software faster — but to ensure that every agent behaves as designed.198Views0likes0CommentsMigrating On-prem Windows & Linux VMs to Azure Confidential Virtual Machines via Azure Migrate
1. Executive Summary Enterprise cloud adoption increasingly prioritizes trust boundaries that extend beyond traditional infrastructure isolation. While encryption at rest and in transit are foundational, modern organizations must also ensure that data in use (data actively processed in CPU or system memory) remains protected. Azure Confidential Computing (ACC) mitigates emerging threats by enabling hardware-backed Trusted Execution Environments (TEEs). These environments isolate VM memory, CPU state, and I/O paths from Azure’s hypervisor, host operating system, and even privileged Azure administrators. Azure Confidential Virtual Machines (CVMs) bring ACC to general-purpose workloads without requiring application modification, providing: Memory encryption (per-VM keys) Isolation from the hypervisor and cloud fabric Secure VM boot with platform attestation Cryptographically enforced key release from Azure Managed HSM Lift-and-shift compatibility using Azure Migrate This whitepaper offers a complete lifecycle framework for secure migration, including governance models, deep technical implementation guidance, and operational readiness. 2. Business Drivers & Compliance Alignment 2.1 Risk & Threat Landscape Threat Category Scenario Traditional VM Protection CVM Protection Hypervisor compromise Host OS breach ❌ ✔ Isolated TEE Privileged insider Cloud admin access to guest memory ❌ ✔ SEV-SNP/TDX isolation DMA attacks PCIe-level memory scraping ❌ ✔ Memory encrypted in hardware Supply-chain compromise Pre-boot firmware tampering ⚠️ ✔ Attestation-gated boot Side-channel attacks Spectre-like memory leakage ⚠️ ✔ Strong hardware isolation 2.2 Business Outcomes Strongest possible protection for mission-critical workloads Accelerates regulated workload migration Supports Zero Trust goals: assume breach, verify explicitly Reduces privileged-access risk and insider threat profiles 3. Solution Architecture Overview 3.1 End-to-End Architecture Diagram The diagram represents an End-to-End Architecture for migrating workloads from an on-premises environment to Azure using Azure Migrate, with a strong focus on security and confidentiality. Here’s a detailed explanation of each section: On-Premises Environment: Components: Windows Servers Linux Servers These are your existing workloads that need to be migrated. Azure Migrate Appliance: Acts as a bridge between on-premises servers and Azure. Uses a private connection for secure data transfer. Azure Landing Zone: This is the target environment in Azure where migrated workloads will reside. It includes: Private Endpoints Azure Migrate – For migration orchestration. Cache Storage Account (Blob) – Temporary storage for replication data. Managed HSM (Hardware Security Module) – For cryptographic key management. Private DNS Zones privatelink.blob.core.windows.net privatelink.managedhsm.azure.net These ensure name resolution for private endpoints without exposing them publicly. Migration Workflow: Azure Migrate Project: Discover on-premises servers. Replicate workloads to Azure. Cached Replication Data → Private Blob Storage: Replication data is stored securely in a private blob before cutover. Test Migration: Performed in an isolated VNet to validate functionality before production cutover. Production Cutover: Migrated workloads run as Confidential VMs in Azure. Security Enhancements: SEV-SNP or TDX TEE: Hardware-based Trusted Execution Environments for isolation. Confidential OS + Data Disk via DES HSM Key: Ensures encryption and integrity. Attestation-Gated Boot via Managed HSM: Verifies VM integrity before booting. 4. Azure Components Category Component Purpose Migration Azure Migrate Appliance Discovery, replication, orchestration Compute Confidential VM (SEV-SNP/TDX) Secure execution environment Security Managed HSM CMK storage & attestation-gated key release Storage Cache Storage Account Replication staging via private endpoint Encryption Disk Encryption Sets CMK-bound OS/data disk encryption Networking Private Endpoints & Private DNS Fully private transport Identity Confidential VM Orchestrator Validates attestation to enable boot 5. Confidential VM Requirements 5.1 Hardware Requirements AMD SEV-SNP (DCasv6, ECasv6) Memory encryption with per-VM keys Nested page table protection RMP validation preventing host tampering Guest attestation report with measurement register integrity Intel TDX (DCesv6, ECesv6) Encryption + integrity-protected guest memory Hardware-isolated module to validate TEE launch Boot measurement and module verification 5.2 VM Configuration Requirements Generation 2 (Gen2) virtual machine UEFI + Secure Boot vTPM enabled Confidential VM security type enabled via Azure Migrate or ARM templates 5.3 Disk Requirements OS will be Confidential Disk Data disks encrypted via Disk Encryption Set (DES) DES bound to RSA-HSM keys Managed HSM with purge protection Key Release Policy requiring attestation Disk should always be Premium for all Confidential VMs, required for performance and compatibility with confidential disk encryption 6. End-to-End Migration Framework A nine-phase sequential model aligned with CAF, Azure architecture best practices, and enterprise migration standards. Phase 1: Azure Migrate - Connectivity, Private Endpoints & DNS Azure Migrate Requirements & Setup Prerequisites: Azure subscription with contributor/owner access Resource Group for Azure Migrate project and resources Replication Appliance pre-requisites Deploy Windows server 2022 as the replication appliance. Component Requirement CPU cores 16 RAM 32 GB Number of disks 2, including the OS disk - 80 GB and a data disk - 620 GB Setup Steps: Deploy Azure Migrate appliance on-premises Register appliance with Azure Migrate project Discover on-premises VMs (Windows/Linux) Click Discover → Choose a discovery method: Agent-based: Install the Azure Migrate agent on the source VMs. Agentless (vSphere/Hyper-V): Use credentials to discover VMs. Ensure all VMs to be migrated are discovered. Click Assess → Configure assessment: Target VM size: Choose Confidential VM-compatible sizes for CVMs. Target Azure region. Disk recommendations: Premium SSD or Premium SSD v2 for CVMs. Validate connectivity to private endpoints, including: Cache storage accounts Managed HSM Cache Storage Account: Cache storage accounts can use ZRS for redundancy. If ASR replication is required, use a separate LRS cache storage account. All storage must be private endpoint-enabled and encrypted with CMKs from Azure Managed HSM. Verify VMs appear in Azure Migrate project are ready for replication Required Private Endpoints: Service Endpoint Requirement Azure Migrate Yes Cache Storage Account Yes (Blob PE only) Managed HSM Yes Private DNS Zones: privatelink.blob.core.windows.net privatelink.managedhsm.azure.net privatelink.azurewebsites.net Connectivity Requirements: ExpressRoute or Site-to-Site VPN No public endpoints allowed Azure Migrate Appliance must resolve all private FQDNs Phase 2: OS Readiness Assessment Windows Workloads MBR to GPT Validation: C:\Windows\System32>MBR2GPT.exe /validate /allowFullOS Requirements: No dynamic disks VSS and WinRM operational Drivers must support Gen2 migration OS disk ≤128GB Validation Commands: Get-Volume Get-PhysicalDisk Get-WindowsOptionalFeature -Online -FeatureName SecureBoot Linux Workloads Requirements: UUIDs used in /etc/fstab Avoid multi-PV LVM expansion across disks Ensure kernel supports SEV-SNP or TDX Ensure UEFI bootloader integrity Validation Commands: lsblk blkid cat /etc/fstab dmesg | grep -i sev Phase 3: Network Security & Firewall Matrix Source Destination Port(s) Direction Purpose On-prem Servers Migrate Appliance 443, 9443 Outbound Discovery & agentless replication Appliance Windows VMs 5985 Outbound WinRM Appliance Linux VMs 22 Outbound SSH Appliance Cache Storage 443 Outbound Replication writes Appliance Azure Migrate 443 Outbound Control-plane operations All connections route via private endpoints. Phase 4: CMK Encryption & Managed HSM Governance Managed HSM Creation: Enable purge protection Configure RBAC-only access Disable all public access Key Creation: az keyvault key create --exportable true --hsm-name <HSM> --kty RSA-HSM --name cvmKey --policy "./public_SKR_policy.json" Disk Encryption Set (DES) Creation: az disk-encryption-set create --name <DES> --resource-group <RG> --key-url <HSM Key URL> --identity-type SystemAssigned Role Assignment to DES: Managed HSM Crypto Service Encryption User Key Release Policy requiring attestation Phase 5: Confidential VM Orchestrator (CVO) The Confidential VM Orchestrator is a built-in Azure service principal used by Azure Compute to securely manage disk encryption keys for Confidential VMs (CVMs). During boot, it validates the VM’s attestation evidence (SEV-SNP or TDX) and requests the Managed HSM to release the disk encryption key only to a verified CVM. It requires only Managed HSM Crypto Service Encryption User permissions. This ensures that customer-managed keys (CMKs) are released exclusively to attested CVMs and never to the hypervisor or platform operators. Responsibilities: Validate the Trusted Execution Environment (TEE) measurement. Approve or deny key release based on attestation. Enforce cryptographic linkage between the VM and HSM key, ensuring keys are only accessible to legitimate CVMs. Identity Setup: New-MgServicePrincipal -AppId bf7b6499-ff71-4aa2-97a4-f372087be7f0 Role Assignment: az keyvault role assignment create --hsm-name <HSM> --assignee <CVO ID> --role "Managed HSM Crypto Service Release User" --scope /keys Phase 6: Replication Enablement (Credential-Less) Configuration Steps: Go to the Azure portal → Search for Azure Migrate. Select your Azure Migrate project Navigate to Replicate. Select Credential-less replication. Choose the target subscription and resource group. Select Confidential VM-compatible size for the VMs. Assign Disk Encryption Sets (DES) for each disk. Validate private endpoint connectivity to ensure replication can access the target subnet securely. Begin Initial Sync + Delta Replication: All OS/data disks for CVMs must be Premium SSD or Premium SSD v2. Phase 7: Test Migration (Isolated Validation) Validation Checklist: VM boots successfully without intervention CVM security type = Confidential CMK encryption applied on all disks Attestation logs verified on first boot Applications tested and functional No unexpected public endpoints NIC, routing, NSGs, UDRs verified Phase 8: Production Cutover Cutover Sequence: Announce downtime Freeze transactions Run Planned Failover Validate immediately: Boot integrity Disk encryption Guest Attestation Extension security type is Confidential Switch application traffic Decommission source systems Phase 9: Post-Migration Hardening & Governance Azure Policy Enforcement: Allowed VM SKUs → CVM only Enforce CMK-only disk encryption Deny public IP creation Require private endpoints Restrict Managed HSM access Logging & Monitoring: Managed HSM logs Attestation logs Azure Monitor Defender for Cloud (CVM coverage) Microsoft Sentinel (optional) Operational Governance: HSM key rotation schedule Quarterly attestation validation DES lifecycle management Zero-trust identity auditing “Break glass” procedure definition 7. Confidential VM Limitations & Workarounds OS Disk Size Limit: Confidential disk encryption is only supported for OS disks at this stage. No support for Data Disks. Confidential disk encryption with CMK is not supported for disks larger than 128 GB. Workaround: Perform migration using SSE (Server-Side Encryption) with Platform-Managed Keys (PMK). Stop and deallocate the VM post-migration. Update encryption settings of OS disk to use SSE Disk Encryption Set (DES) using CMK for encryption. Operating System Support: Windows 2019 and later supported RHEL 9.4 and later supported Ubuntu 22.04+ supported (depending on SKU) For full list, check the CVM OS Support Matrix For additional details on limitations, please refer CVM Limitations 8. Conclusion Azure Confidential Virtual Machines represent a generational shift in cloud security providing encryption, isolation, and attestation at the hardware boundary. Combined with Azure Migrate, DES/CMK encryption, Managed HSM, private networking, and robust governance, enterprises can securely modernize mission-critical workloads without application rewrites.366Views4likes1CommentBuilding Reusable Custom Images for Azure Confidential VMs Using Azure Compute Gallery
Overview Azure Confidential Virtual Machines (CVMs) provide hardware-enforced protection for sensitive workloads by encrypting data in use using AMD SEV-SNP technology. In enterprise environments, organizations typically need to: Create hardened golden images Standardize baseline configurations Support both Platform Managed Keys (PMK) and Customer Managed Keys (CMK) Version and replicate images across regions This guide walks through the correct and production-supported approach for building reusable custom images for Confidential VMs using: PowerShell (Az module) Azure Portal Disk Encryption Sets (CMK) Azure Compute Gallery Key Design Principles Before diving into implementation steps, it is important to clarify that during real-world implementations, two important architectural truths become clear: ✅1️⃣ The Same Image Supports PMK and CMK The encryption model (PMK vs CMK) is not embedded in the image. Encryption is applied: At VM deployment time Through disk configuration (default PMK or Disk Encryption Set for CMK) This means: You build one golden image. You deploy it using PMK or CMK depending on compliance requirements. This simplifies lifecycle management significantly. ✅2️⃣ Confidential VM Image Versions Must Use Source VHD When publishing to Azure Compute Gallery: Confidential VMs require Source VHD (Mandatory Requirement) This is a platform requirement for Confidential Security Type support. Therefore, the correct workflow is: Deploy base Confidential VM Harden and configure Generalize Export OS disk as VHD Upload to storage Publish to Azure Compute Gallery Deploy using PMK or CMK Security Stack Breakdown Protection Area Technology Data in Use AMD SEV-SNP Boot Integrity Secure Boot + vTPM Image Lifecycle Azure Compute Gallery Disk Encryption PMK or CMK Compliance Control Disk Encryption Set (CMK) Implementation Steps 🖥️ Step 1 – Deploy a Base Windows Confidential VM This VM will serve as the image builder. Key Requirements Gen2 Image Confidential SKUs (similar to DCasv5 or ECasv5 series) SecurityType = ConfidentialVM Secure Boot enabled vTPM enabled Confidential OS Encryption enabled Reference Code Snippets (PowerShell) $rg = "rg-cvm-gi-pr-sbx-01" $location = "NorthEurope" $vmName = "cvmwingiprsbx01" New-AzResourceGroup -Name $rg -Location $location $cred = Get-Credential $vmConfig = New-AzVMConfig ` -VMName $vmName ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOperatingSystem ` -VM $vmConfig ` -Windows ` -ComputerName $vmName ` -Credential $cred $vmConfig = Set-AzVMSourceImage ` -VM $vmConfig ` -PublisherName "MicrosoftWindowsServer" ` -Offer "WindowsServer" ` -Skus "2022-datacenter-azure-edition" ` -Version "latest" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 📸 Reference Screenshots 🔧 Step 2 – Harden and Customize the OS This is where you: Install monitoring agents Install Defender for Endpoint Apply CIS baseline Install security agents Remove unwanted services Install application dependencies This is your enterprise golden baseline depending on the individual organizational requirements. 🔄 Step 3 – Generalize the Windows Confidential VM (Production-Ready Method) Confidential VMs often enable BitLocker automatically. Improper Sysprep handling can cause failures. Generalizing a Windows Confidential VM properly is critical to avoid: Sysprep failures BitLocker conflicts Image corruption Deployment errors later Follow these steps carefully inside the VM and later through Azure PowerShell. 1. Remove Panther Folder The Panther folder stores logs from previous Sysprep operations. If leftover logs exist, Sysprep can fail. This safely removes old Sysprep metadata. rd /s /q C:\Windows\Panther ✔ This step prevents common “Sysprep was not able to validate your Windows installation” errors. 2. Run Sysprep Navigate to Sysprep directory and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown Parameters explained: Parameter Purpose /generalize Removes machine-specific info (SID, drivers) /shutdown Powers off VM after completion ⚠️ Handling BitLocker Issues (Common in Confidential VMs): Confidential VMs may automatically enable BitLocker. If Sysprep fails due to encryption, follow the next steps to resolve the issue and execute sysprep again. 3. Check BitLocker Status & Turn Off BitLocker manage-bde -status If Protection Status is 'Protection On': manage-bde -off C: Wait for decryption to complete fully. ⚠️ Do not run Sysprep again until decryption reaches 100%. 4. Reboot and Run Sysprep Again After decryption completes: Reboot the VM Open Command Prompt as Administrator Navigate to Sysprep folder and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown ✔ VM will shut down automatically. 5. Mark VM as Generalized in Azure Now switch to Azure PowerShell: Stop-AzVM -Name $vmName -ResourceGroupName $rg -Force Set-AzVM -Name $vmName -ResourceGroupName $rg -Generalized ✔ This marks the VM as ready for image capture. 🧠 Why These Extra Steps Matter in Confidential VMs Confidential VMs differ from standard VMs because: They use vTPM They may auto-enable BitLocker They enforce Secure Boot They use Gen2 images Improper handling can cause: Sysprep failures Image capture errors Deployment failures from image “VM provisioning failed” issues These cleanup steps dramatically increase success rate. 💾 Step 4 – Export OS Disk as VHD Azure Gallery Image Definitions with Security Type as 'TrustedLaunchAndConfidentialVmSupported' require Source VHD as the support for Source Image VM is not available. Generate the SAS URL for OS Disk of the Virtual Machine. Copy to Storage Account as a .vhd file. Use Get-AzStorageBlobCopyState to validate the copy status and wait for completion. $vm = Get-AzVM -Name $vmName -ResourceGroupName $rg $osDiskName = $vm.StorageProfile.OsDisk.Name $sas = Grant-AzDiskAccess ` -ResourceGroupName $rg ` -DiskName $osDiskName ` -Access Read ` -DurationInSecond 3600 $storageAccountName = "stcvmgiprsbx01" $storageContainerName = "images" $destinationVHDFileName = "cvmwingiprsbx01-OsDisk-VHD.vhd" $destinationContext = New-AzStorageContext -StorageAccountName $storageAccountName Start-AzStorageBlobCopy -AbsoluteUri $sas.AccessSAS -DestContainer $storageContainerName -DestContext $destinationContext -DestBlob $destinationVHDFileName Get-AzStorageBlobCopyState -Blob $destinationVHDFileName -Container $storageContainerName -Context $destContext 🏢 Step 5 – Create Azure Compute Gallery & Image Version Instead of creating a standalone managed image, we will: Create an Azure Compute Gallery Create an Image Definition Publish a Gallery Image Version from the generalized Confidential VM This enables: Versioning Regional replication Staged rollouts Enterprise image lifecycle management 1. Create Azure Compute Gallery $galleryName = "cvmImageGallery" New-AzGallery ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Description "Confidential VM Image Gallery" 2. Create Image Definition for Windows Confidential VM Important settings: OS State = Generalized OS Type = Windows HyperV Generation = V2 Security Type = TrustedLaunchAndConfidentialVmSupported $imageDefName = "img-win-cvm-gi-pr-sbx-01" $ConfidentialVMSupported = @{Name='SecurityType';Value='TrustedLaunchAndConfidentialVmSupported'} $Features = @($ConfidentialVMSupported) New-AzGalleryImageDefinition ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Name $imageDefName ` -OsState Generalized ` -OsType Windows ` -Publisher "prImages" ` -Offer "WindowsServerCVM" ` -Sku "2022-dc-azure-edition" ` -HyperVGeneration V2 ` -Feature $features ✔ HyperVGeneration must be V2 for Confidential VMs. 📸 Reference Screenshot 3. Create Gallery Image Version from Generalized VM Now publish version 1.0.0 from the generalized VM OS Disk VHD to the Image Definition: There is no support for performing this step using Azure PowerShell, hence the Azure Portal needs to be used Ensure the right network and RBAC access on the storage account is in place Replication can be enabled on the Image Version to multiple regions for enterprises ✅ Why Azure Compute Gallery is the Right Choice Feature Managed Image Azure Compute Gallery Versioning ❌ ✅ Cross-region replication ❌ ✅ Enterprise lifecycle Limited Full Recommended for production ❌ ✅ For enterprise confidential workloads, Azure Compute Gallery is strongly recommended. 🚀 Step 6 – Deploy Confidential VM from Gallery Image 🔹 Using PMK (Default) If you do not specify a Disk Encryption Set, Azure uses Platform Managed Keys automatically. $imageId = (Get-AzGalleryImageVersion ` -GalleryName $galleryName ` -GalleryImageDefinitionName $imageDefName ` -ResourceGroupName $rg ` -Name "1.0.0").Id $vmConfig = New-AzVMConfig ` -VMName "cvmwingiprsbx02" ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" $vmConfig = Set-AzVMSourceImage -VM $vmConfig -Id $imageId $vmConfig = Set-AzVMOperatingSystem -VM $vmConfig -Windows -ComputerName "cvmwingiprsbx02" -Credential (Get-Credential) New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 🔹 Using CMK (Same Image!) If compliance requires CMK: Create Disk Encryption Set Associate with Key Vault or Managed HSM Attach DES during deployment $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithCustomerKey" ` -DiskEncryptionSetId $des.Id ✔ Same image ✔ Different encryption model ✔ Encryption applied at deployment 🔎 Validation Check Confidential Security: Get-AzVM -Name "cvmwingiprsbx02" -ResourceGroupName $rg | Select SecurityProfile Check disk encryption: Get-AzDisk -ResourceGroupName $rg Architectural Summary Confidential VM security is independent of disk encryption model Encryption choice is applied at deployment One image supports multiple compliance models Source VHD is required for Confidential VM gallery publishing Azure Compute Gallery enables enterprise lifecycle PMK vs CMK Decision Matrix Scenario Recommended Model Standard enterprise workloads PMK Financial services / regulated CMK BYOK requirement CMK Simplicity prioritized PMK 🏢 Enterprise Recommendations ✔ Always use Azure Compute Gallery ✔ Use semantic versioning (1.0.0, 1.0.1) ✔ Automate using Azure Image Builder ✔ Enforce Confidential VM via Azure Policy ✔ Enable Guest Attestation ✔ Monitor with Defender for Cloud Final Thoughts Creating custom images for Azure Confidential VMs allows organizations to combine the security benefits of Confidential Computing with the operational efficiency of standardized deployments. By baking security baselines, monitoring agents, and required configurations directly into a golden image, every new VM starts from a consistent and trusted foundation. A key advantage of this approach is flexibility. The custom image itself is independent of the disk encryption model, meaning the same image can be deployed using Platform Managed Keys (PMK) for simplicity or Customer Managed Keys (CMK) to meet stricter compliance requirements. This allows platform teams to maintain a single image pipeline while supporting multiple security scenarios. By publishing images through Azure Compute Gallery, organizations can version, replicate, and manage their Confidential VM images more effectively. Combined with proper VM generalization and hardening practices, custom images become a reliable way to ensure secure, consistent, and scalable deployments of Confidential workloads in Azure. As Confidential Computing continues to gain adoption across industries handling sensitive data, investing in a well-designed custom image pipeline will enable organizations to scale securely while maintaining consistency, compliance, and operational efficiency across their cloud environments.234Views1like0CommentsProactive Resiliency in Azure for Specialized Workload i.e. Citrix VDI on Azure Design Framework.
In this post, I’ll share my perspective on designing cloud architectures for near-zero downtime. We’ll explore how adopting multi-region strategies and other best practices can dramatically improve reliability. The discussion will be technically and architecturally driven covering key decisions around network architecture, data replication, user experience continuity, and cost management but also touch on the business angle of why this matters. The goal is to inform and inspire you to strengthen your own systems, and guide you toward concrete actions such as engaging with Microsoft Cloud Solution Architects (CSAs), submitting workloads for resiliency reviews, and embracing multi-region design patterns. Resilience as a Shared Responsibility One fundamental truth in cloud architecture is that ensuring uptime is a shared responsibility between the cloud provider and you, the customer. Microsoft is responsible for the reliability of the cloud in other words, we build and operate Azure’s core infrastructure to be highly available. This includes the physical datacenters, network backbone, power/cooling, and built-in platform features for redundancy. We also provide a rich toolkit of resiliency features (think availability sets, Availability Zones, geo-redundant storage, service failover capabilities, backup services, etc.) that you can leverage to increase the reliability of your workloads. However, the reliability in the cloud of your specific applications and data is up to you. You control your application architecture, deployment topology, data replication, and failover strategies. If you run everything in a single region with no backups or fallbacks, even Azure’s rock-solid foundation can’t save you from an outage. On the other hand, if you architect smartly (using multiple regions, zones, and Azure resiliency features properly), you can achieve end-to-end high availability even through major platform incidents. In short: Microsoft ensures the cloud itself is resilient, but you must design resilience into your workload. It’s a true partnership one where both sides play a critical role in delivering robust, continuous services to end-users. I emphasize this because it sets the mindset: proactive resiliency is something we do with our customers. As you’ll see, Microsoft has programs and people (like CSAs) dedicated to helping you succeed in this shared model. Six Layers of Resilient Cloud Architecture for Citrix VDI workloads To systematically approach multi-region resiliency, it helps to break the problem down into layers. In my work, I arrived at a six-layer decision framework for designing resilient architectures. This was originally developed for a global Citrix DaaS deployment on Azure (hence some VDI flavor in the examples), but the principles apply broadly to cloud solutions. The layers ensure we cover everything from the ground-up network connectivity to the operational model for failover. 1. Network Fabric (the global backbone) Establish high-performance, low-latency links between regions. Preferred: Use Global VNet Peering for simplified any-to-any connectivity with minimal latency over Microsoft’s backbone (ideal for point-to-point replication traffic), rather than a more complex Azure Virtual WAN unless your topology demands it. 2. Storage Foundation (the bedrock ) In any distributed computing environment, storage is the "heaviest" component. Moving compute (VDAs) is instantaneous; moving data (profiles, user layers) is governed by bandwidth and the speed of light. The success of a multi-region DaaS deployment hinges on the performance and synchronization of the underlying storage subsystem. Use storage that can handle cross-region workload needs, especially for user data or state. In case of Citrix Daas, preferred approach is Azure NetApp Files (ANF) for consistent sub-millisecond latency and high throughput. ANF provides enterprise-grade performance (critical during “login storms” or peak I/O) and features like Cool Access tiering to optimize cost, outperforming standard Azure Files for this scenario. 3. User Profile & State (solving data gravity) Enable active-active availability of user data or application state across regions. Solution: FSLogix Cloud Cache (in a VDI context) or similar distributed caching/replication tech, which allows simultaneous read/write of profile data in multiple regions. In our case, Cloud Cache insulates the user session from WAN latency by writing to a local cache and asynchronously replicating to the secondary region, overcoming the challenge of traditional file locking. The principle extends to databases or state stores: use geo-replication or distributed databases to avoid any single-region state. 4. Access & Ingress (the intelligent front door) Ensure users/customers connect to the right region and can fail over seamlessly. Preferred: Deploy a global traffic management solution under your control e.g. customer-managed NetScaler (Citrix ADC) with Global Server Load Balancing (GSLB) to direct users to the nearest available datacenter. In our design, NetScaler’s GSLB uses DNS-based geo-routing and supports Local Host Cache for Citrix, meaning even if the cloud control plane (Citrix Cloud) is unreachable, users can still connect to their desktop apps. The general point: use Azure Front Door, Traffic Manager, or third-party equivalents to steer traffic, and avoid any solution that introduces a new single point of failure in the authentication or gateway path. 5. Master Image (ensuring global consistency) : If you rely on VM images or similar artifacts, replicate them globally. Use: Azure Compute Gallery (ACG) to manage and distribute images across regions. In our case, we maintain a single “golden” image for virtual desktops: it’s built once, then the Compute Gallery replicates it from West Europe to East US (and any other region) automatically. This ensures that when we scale out or recover in Region B, we’re launching the exact same app versions and OS as Region A. Consistency here prevents failover from causing functionality regressions. 6. Operations & Cost (smart economics at scale) Run an efficient DR strategy you want readiness without paying 2x all the time. Approach: Warm Standby with autoscaling. That means the secondary region isn’t serving full traffic during normal operations (some resources can be scaled down or even deallocated), but it can scale up rapidly when needed. For our scenario, we leverage Citrix Autoscale to keep the DR site in a minimal state only a small buffer of machines is powered on, just enough to handle a sudden failover until load-based scaling brings up the rest. This “active/passive” model (or hot-warm rather than hot-hot) strikes a balance: you pay only for what you use, yet you can meet your RTO (Recovery Time Objective) because resources spin up automatically on trigger. In cloud-native terms, you might use Azure Automation or scale sets to similar effect. The key is to avoid having an idle full duplicate environment incurring full costs 24/7, while still being prepared. Each of these layers corresponds to critical architectural choices that determine your overall resiliency. Neglect any one layer, and that’s where Murphy’s Law will strike next. For example, you might perfectly replicate your data across regions, but if you forgot about network connectivity, a regional hub outage could still cut off access. Or you have every system duplicated, but if users can’t be rerouted to the backup region in time, the benefit is lost. The six-layer framework helps make sure we cover all bases. Notably, these design best practices align very closely with Azure’s Well-Architected Framework (especially the Reliability pillar), and they’re exactly the kind of prescriptive guidance we provide through programs like the Proactive Resiliency Initiative. In fact, the PRI playbook essentially prioritizes these same steps for customers: First, harden the network foundation e.g. ensure ExpressRoute gateways are zone-redundant and circuits are “multi-homed” in at least two locations (so no single datacenter failure breaks connectivity). Next, address in-region resiliency – make sure critical workloads are distributed across Availability Zones and not vulnerable to a single zone outage. (As an aside: Microsoft’s internal data shows a huge payoff here; when we configured our top Azure services for zonal resilience, we saw a 68% reduction in platform outages that lead to support incidents!) Then, enable multi-region continuity (BCDR) – for those tier-0 and tier-1 workloads, set up cross-regional failover so even a region-wide disruption won’t take you down. Multi-region is described as the complement to (not a substitute for) zonal design: it’s about surviving the “black swan” of a region-level event, and also about supporting geo-distributed users and future growth. In other words, if you follow the six-layer approach, you’re doing exactly what our structured resiliency programs recommend.323Views1like0CommentsAnnouncing Cobalt 200: Azure’s next cloud-native CPU
By Selim Bilgin, Corporate Vice President, Silicon Engineering, and Pat Stemen, Vice President, Azure Cobalt Today, we’re thrilled to announce Azure Cobalt 200, our next-generation Arm-based CPU designed for cloud-native workloads. Cobalt 200 is a milestone in our continued approach to optimize every layer of the cloud stack from silicon to software. Our design goals were to deliver full compatibility for workloads using our existing Azure Cobalt CPUs, deliver up to 50% performance improvement over Cobalt 100, and integrate with the latest Microsoft security, networking and storage technologies. Like its predecessor, Cobalt 200 is optimized for common customer workloads and delivers unique capabilities for our own Microsoft cloud products. Our first production Cobalt 200 servers are now live in our datacenters, with wider rollout and customer availability coming in 2026. Azure Cobalt 200 SoC and platform Building on Cobalt 100: Leading Price-Performance Our Azure Cobalt journey began with Cobalt 100, our first custom-built processor for cloud-native workloads. Cobalt 100 VMs have been Generally Available (GA) since October of 2024 and availability has expanded rapidly to 32 Azure datacenter regions around the world. In just one year, we have been blown away with the pace that customers have adopted the new platform, and migrated their most critical workloads to Cobalt 100 for the performance, efficiency, and price-performance benefits. Cloud analytics leaders like Databricks and Snowflake are adopting Cobalt 100 to optimize their cloud footprint. The compute performance and energy-efficiency balance of Cobalt 100-based virtual machines and containers has proven ideal for large-scale data processing workloads. Microsoft’s own cloud services have also rapidly adopted Azure Cobalt for similar benefits. Microsoft Teams achieved up to 45% better performance using Cobalt 100 than their previous compute platform. This increased performance means less servers needed for the same task, for instance Microsoft Teams media processing uses 35% fewer compute cores with Cobalt 100. Designing Compute Infrastructure for Real Workloads With this solid foundation, we set out to design a worthy successor – Cobalt 200. We faced a key challenge: traditional compute benchmarks do not represent the diversity of our customer workloads. Our telemetry from the wide range of workloads running in Azure (small microservices to globally available SaaS products) did not match common hardware performance benchmarks. Existing benchmarks tend to skew toward CPU core-focused compute patterns, leaving gaps in how real-world cloud applications behave at scale when using network and storage resources. Optimizing Azure Cobalt for customer workloads requires us to expand beyond these CPU core benchmarks to truly understand and model the diversity of customer workloads in Azure. As a result, we created a portfolio of benchmarks drawn directly from the usage patterns we see in Azure, including databases, web servers, storage caches, network transactions, and data analytics. Each of our benchmark workloads includes multiple variants for performance evaluation based on the ways our customers may use the underlying database, storage, or web serving technology. In total, we built and refined over 140 individual benchmark variants as part of our internal evaluation suite. With the help of our software teams, we created a complete digital twin simulation from the silicon up: beginning with the CPU core microarchitecture, fabric, and memory IP blocks in Cobalt 200, all the way through the server design and rack topology. Then, we used AI, statistical modelling and the power of Azure to model the performance and power consumption of the 140 benchmarks against 2,800 combinations of SoC and system design parameters: core count, cache size, memory speed, server topology, SoC power, and rack configuration. This resulted in the evaluation of over 350,000 configuration candidates of the Cobalt 200 system as part of our design process. This extensive modelling and simulation helped us to quickly iterate to find the optimal design point for Cobalt 200, delivering over 50% increased performance compared to Cobalt 100, all while continuing to deliver our most power-efficient platform in Azure. Cobalt 200: Delivering Performance and Efficiency At the heart of every Cobalt 200 server is the most advanced compute silicon in Azure: the Cobalt 200 System-on-Chip (SoC). The Cobalt 200 SoC is built around the Arm Neoverse Compute Subsystems V3 (CSS V3), the latest performance-optimized core and fabric from Arm. Each Cobalt 200 SoC includes 132 active cores with 3MB of L2 cache per-core and 192MB of L3 system cache to deliver exceptional performance for customer workloads. Power efficiency is just as important as raw performance. Energy consumption represents a significant portion of the lifetime operating cost of a cloud server. One of the unique innovations in our Azure Cobalt CPUs is individual per-core Dynamic Voltage and Frequency Scaling (DVFS). In Cobalt 200 this allows each of the 132 cores to run at a different performance level, delivering optimal power consumption no matter the workload. We are also taking advantage of the latest TSMC 3nm process, further improving power efficiency. Security is top-of-mind for all of our customers and a key part of the unique innovation in Cobalt 200. We designed and built a custom memory controller for Cobalt 200, so that memory encryption is on by default with negligible performance impact. Cobalt 200 also implements Arm’s Confidential Compute Architecture (CCA), which supports hardware-based isolation of VM memory from the hypervisor and host OS. When designing Cobalt 200, our benchmark workloads and design simulations revealed an interesting trend: several universal compute patterns emerged – compression, decompression, and encryption. Over 30% of cloud workloads had significant use of one of these common operations. Optimizing for these common operations required a different approach than just cache sizing and CPU core selection. We designed custom compression and cryptography accelerators – dedicated blocks of silicon on each Cobalt 200 SoC – solely for the purpose of accelerating these operations without sacrificing CPU cycles. These accelerators help reduce workload CPU consumption and overall costs. For example, by offloading compression and encryption tasks to the Cobalt 200 accelerator, Azure SQL is able to reduce use of critical compute resources, prioritizing them for customer workloads. Leading Infrastructure Innovation with Cobalt 200 Azure Cobalt is more than just an SoC, and we are constantly optimizing and accelerating every layer in the infrastructure. The latest Azure Boost capabilities are built into the new Cobalt 200 system, which significantly improves networking and remote storage performance. Azure Boost delivers increased network bandwidth and offloads remote storage and networking tasks to custom hardware, improving overall workload performance and reducing latency. Cobalt 200 systems also embed the Azure Integrated HSM (Hardware Security Module), providing customers with top-tier cryptographic key protection within Azure’s infrastructure, ensuring sensitive data stays secure. The Azure Integrated HSM works with Azure Key Vault for simplified management of encryption keys, offering high availability and scalability as well as meeting FIPS 140-3 Level 3 compliance. An Azure Cobalt 200 server in a validation lab Looking Forward to 2026 We are excited about the innovation and advanced technology in Cobalt 200 and look forward to seeing how our customers create breakthrough products and services. We’re busy racking and stacking Cobalt 200 servers around the world and look forward to sharing more as we get closer to wider availability next year. Check out Microsoft Ignite opening keynote Read more on what's new in Azure at Ignite Learn more about Microsoft's global infrastructure19KViews10likes0CommentsAzure VNet Flow Logs with Terraform: The Complete Migration and Traffic Analytics Guide
Migrating from NSG Flow Logs to VNet Flow Logs in Azure: Implementation with Terraform Author: Ibrahim Baig (Consultant) Executive Summary Microsoft is retiring Network Security Group (NSG) flow logs and recommends migrating to Virtual Network (VNet) flow logs. After June 30, 2025, new NSG flow logs cannot be created, and all NSG flow logs will be retired by September 30, 2027. Migrating to VNet flow logs ensures continued support and provides broader, simpler network visibility. What Changed & Key Dates - June 30, 2025: Creation of new NSG flow logs is blocked. - September 30, 2027: NSG flow logs are retired (resources deleted; historical blobs remain per retention policy). - Microsoft provides migration scripts and policy guidance for NSG→VNet flow logs. Why Migrate? (Benefits) Operational Simplicity & Coverage - Enable logging at the VNet, subnet, or NIC scope—no dependency on NSG. - Broader visibility across all workloads inside a VNet, not just NSG-governed traffic. Security & Analytics - Native integration with Traffic Analytics for enriched insights. - Monitor Azure Virtual Network Manager (AVNM) security admin rules. Continuity & Cost Parity - VNet flow logs are priced the same as NSG flow logs (with 5 GB/month free). What’s New in VNet Flow Logs - Scopes: Enable at VNet, subnet, or NIC level. - Storage: JSON logs to Azure Storage. - At-scale enablement: Built-in Azure Policy for auditing and auto-deployment. - Analytics: Traffic Analytics add-on for deep insights. - AVNM awareness: Observe centrally managed security admin rules. Traffic Analytics: Capabilities & Value Traffic Analytics (TA) is a powerful add-on for VNet flow logs, providing: - Automated Traffic Insights: Visualize traffic flows, identify top talkers, and detect anomalous patterns. - Threat Detection: Surface suspicious flows, lateral movement, and communication with malicious IPs. - Network Segmentation Validation: Confirm that segmentation policies are effective and spot unintended access. - Performance Monitoring: Analyze bandwidth usage, latency, and flow volumes for troubleshooting. - Customizable Dashboards: Drill down by subnet, region, or workload for targeted investigations. - Integration: Seamless with Azure Monitor and Log Analytics for alerting and automation. For practical recipes and advanced use cases, see https://blog.cloudtrooper.net/2024/05/08/vnet-flow-logs-recipes/. GAP: The Terraform Registry page for azurerm_network_watcher_flow_log does not yet provide an explicit VNet flow logs example. In practice, you use the same resource and set target_resource_id to the ID of the VNet (or Subnet/NIC). Registry page (latest): https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_watcher_flow_log Important notes: - Same resource block: azurerm_network_watcher_flow_log - Use target_resource_id = <resource ID of VNet/Subnet/NIC> (instead of legacy network_security_group_id) - As of 30 July 2025, creating new NSG flow logs is no longer possible (provider notes); migrate to VNet/Subnet/NIC targets. - Keep your azurerm provider up-to-date, earlier builds had validation gaps for subnet/NIC IDs; these were tracked and addressed in provider issues. Implementation Guide Option A — Terraform (Recommended for IaC) Note: Use a dedicated Storage account for flow logs, as lifecycle rules may be overwritten. terraform { required_version = ">= 1.5" required_providers { azurerm = { source = "hashicorp/azurerm" version = ">= 3.110.0" # or latest } } } provider "azurerm" { features {} } data "azurerm_network_watcher" "this" { name = "NetworkWatcher_${var.region}" resource_group_name = "NetworkWatcherRG" } resource "azurerm_network_watcher_flow_log" "vnet_flow_log" { name = "${var.vnet_name}-flowlog" network_watcher_name = data.azurerm_network_watcher.this.name resource_group_name = data.azurerm_network_watcher.this.resource_group_name target_resource_id = azurerm_virtual_network.vnet.id storage_account_id = azurerm_storage_account.flowlogs_sa.id enabled = true retention_policy { enabled = true days = 30 } traffic_analytics { enabled = true workspace_id = azurerm_log_analytics_workspace.law.workspace_id workspace_region = azurerm_log_analytics_workspace.law.location workspace_resource_id = azurerm_log_analytics_workspace.law.id interval_in_minutes = 60 } tags = { owner = "network-platform" environment = var.env } } Option B — Azure CLI az network watcher flow-log create \ --location westus \ --resource-group MyResourceGroup \ --name myVNetFlowLog \ --vnet MyVNetName \ --storage-account mystorageaccount \ --workspace "/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<LAWName>" \ --traffic-analytics true \ --interval 60 Option C — Azure Portal - Go to Network Watcher → Flow logs → + Create. - Choose Flow log type = Virtual network; select VNet/Subnet/NIC, Storage account, and optionally enable Traffic Analytics. Option D — At Scale via Azure Policy - Use built-in policies to audit and auto-deploy VNet flow logs (DeployIfNotExists). Migration Approach (NSG → VNet Flow Logs) Inventory existing NSG flow logs. Choose migration method: Microsoft script or Azure Policy. Run both in parallel temporarily to validate. Disable NSG flow logs before retirement. Challenges & Mitigations - Permissions: Ensure required roles on Log Analytics workspace. - Terraform lifecycle: Use a dedicated Storage account. - Tooling compatibility: Verify SIEM/NDR support. - Provider/API maturity: Use current azurerm provider. Validation Checklist - Storage: New blobs appear in the configured Storage account. - Traffic Analytics: Data visible in Log Analytics workspace. - AVNM: Confirm traffic allowed/denied states appear in logs. Cost Considerations - VNet flow logs ingestion: $0.50/GB after 5 GB free/month. - Traffic Analytics processing: $2.30/GB (60-min) or $3.50/GB (10-min). Traffic Analytics Deep Dive: VNet Flow Logs are stored in Azure Blob Storage. Optionally, you can enable Traffic Analytics, which will do two things: it will enrich the flow logs with additional information, and will send everything to a Log Analytics Workspace for easy querying. This “enrich and forward to Log Analytics” operation will happen in intervals, either every 10 minutes or every hour. Table Structure: NTAIPDetails This table will contain some enrichment data about public IP addresses, including whether they belong to Azure services and their region, and geolocation information for other public IPs. Here you can see a sample of what that table looks like: NTAIpDetails | distinct FlowType, PublicIpDetails, Location Table Structure: NTATopologyDetails This table contains information about different elements of your topology, including VNets, subnets, route tables, routes, NSGs, Application Gateways and much more. Here you cans see what it looks like: Table Structure: NTANetAnalytics Alright, now we are coming to more interesting things: this table is the one containing the flows we are looking for. Records in this table will contain the usual attributes you would expect such as source and destination IP, protocol, and destination port. Additionally, data will be enriched with information such as: Source and destination VM Source and destination NIC Source and destination subnet Source and destination load balancer Flow encryption (yes/no) Whether the flow is going over ExpressRoute And many more Further below you can read some scenarios with detailed queries that will show you some examples of ways you can extract information from VNet Flow Logs and Traffic Analytics. Of course, these are just some of the scenarios that came to mind on my topology, the idea is that you can get inspiration from these queries to support your individual use case. Example Scenario: Imagine you want to see with which IP addresses a given virtual machine has been talking to in the last few days: NTANetAnalytics | where TimeGenerated > ago(10d) | where SrcIp == "10.10.1.4" and strlen(DestIp)>0 | summarize TotalBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcIp, DestIp Similarly, you can play around with such KQL queries in the workspace to deep dive into the Flow Logs. References & Further Reading https://learn.microsoft.com/en-us/azure/network-watcher/nsg-flow-logs-overview https://learn.microsoft.com/en-us/azure/network-watcher/nsg-flow-logs-migrate https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-overview https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-manage https://learn.microsoft.com/en-us/cli/azure/network/watcher/flow-log?view=azure-cli-latest https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-policy https://azure.microsoft.com/en-us/pricing/details/network-watcher/ https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_watcher_flow_log https://blog.cloudtrooper.net/2024/05/08/vnet-flow-logs-recipes/1.2KViews2likes0CommentsResiliency Best Practices You Need For your Blob Storage Data
Maintaining Resiliency in Azure Blob Storage: A Guide to Best Practices Azure Blob Storage is a cornerstone of modern cloud storage, offering scalable and secure solutions for unstructured data. However, maintaining resiliency in Blob Storage requires careful planning and adherence to best practices. In this blog, I’ll share practical strategies to ensure your data remains available, secure, and recoverable under all circumstances. 1. Enable Soft Delete for Accidental Recovery (Most Important) Mistakes happen, but soft delete can be your safety net and. It allows you to recover deleted blobs within a specified retention period: Configure a soft delete retention period in Azure Storage. Regularly monitor your blob storage to ensure that critical data is not permanently removed by mistake. Enabling soft delete in Azure Blob Storage does not come with any additional cost for simply enabling the feature itself. However, it can potentially impact your storage costs because the deleted data is retained for the configured retention period, which means: The retained data contributes to the total storage consumption during the retention period. You will be charged according to the pricing tier of the data (Hot, Cool, or Archive) for the duration of retention 2. Utilize Geo-Redundant Storage (GRS) Geo-redundancy ensures your data is replicated across regions to protect against regional failures: Choose RA-GRS (Read-Access Geo-Redundant Storage) for read access to secondary replicas in the event of a primary region outage. Assess your workload’s RPO (Recovery Point Objective) and RTO (Recovery Time Objective) needs to select the appropriate redundancy. 3. Implement Lifecycle Management Policies Efficient storage management reduces costs and ensures long-term data availability: Set up lifecycle policies to transition data between hot, cool, and archive tiers based on usage. Automatically delete expired blobs to save on costs while keeping your storage organized. 4. Secure Your Data with Encryption and Access Controls Resiliency is incomplete without robust security. Protect your blobs using: Encryption at Rest: Azure automatically encrypts data using server-side encryption (SSE). Consider enabling customer-managed keys for additional control. Access Policies: Implement Shared Access Signatures (SAS) and Stored Access Policies to restrict access and enforce expiration dates. 5. Monitor and Alert for Anomalies Stay proactive by leveraging Azure’s monitoring capabilities: Use Azure Monitor and Log Analytics to track storage performance and usage patterns. Set up alerts for unusual activities, such as sudden spikes in access or deletions, to detect potential issues early. 6. Plan for Disaster Recovery Ensure your data remains accessible even during critical failures: Create snapshots of critical blobs for point-in-time recovery. Enable backup for blog & have the immutability feature enabled Test your recovery process regularly to ensure it meets your operational requirements. 7. Resource lock Adding Azure Locks to your Blob Storage account provides an additional layer of protection by preventing accidental deletion or modification of critical resources 7. Educate and Train Your Team Operational resilience often hinges on user awareness: Conduct regular training sessions on Blob Storage best practices. Document and share a clear data recovery and management protocol with all stakeholders. 8. "Critical Tip: Do Not Create New Containers with Deleted Names During Recovery" If a container or blob storage is deleted for any reason and recovery is being attempted, it’s crucial not to create a new container with the same name immediately. Doing so can significantly hinder the recovery process by overwriting backend pointers, which are essential for restoring the deleted data. Always ensure that no new containers are created using the same name during the recovery attempt to maximize the chances of successful restoration. Wrapping It Up Azure Blob Storage offers an exceptional platform for scalable and secure storage, but its resiliency depends on following best practices. By enabling features like soft delete, implementing redundancy, securing data, and proactively monitoring your storage environment, you can ensure that your data is resilient to failures and recoverable in any scenario. Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn Data redundancy - Azure Storage | Microsoft Learn Overview of Azure Blobs backup - Azure Backup | Microsoft Learn Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn1.4KViews1like1Comment