updates
29 TopicsBuild Your AI Agent in 5 Minutes with AI Toolkit for VS Code
What if building an AI agent was as easy as filling out a form? No frameworks to install. No boilerplate to copy-paste from GitHub. No YAML to debug at midnight. Just VS Code, one extension, and an idea. AI Toolkit for VS Code turns agent development into something anyone can do — whether you're a seasoned developer who wants full code control, or someone who's never touched an AI framework and just wants to see something work. Let's build an agent. Then let's explore what else this toolkit can do. Getting Set Up You need two things: VS Code — download and install if you haven't already AI Toolkit extension — open VS Code, go to Extensions (Ctrl+Shift+X), search "AI Toolkit", and install it That's it. No terminal commands. No dependencies to wrangle. When AI Toolkit installs, it brings everything it needs — including the Microsoft Foundry integration and GitHub Copilot skills for agent development. Once installed, you'll see a new AI Toolkit icon in the left sidebar. Click it. That's your home base for everything we're about to do. Build an Agent — No Code Required Open the Command Palette (Ctrl+Shift+P) and type "Create Agent". You'll see a clean panel with two options side by side: Design an Agent Without Code — visual builder, perfect for getting started Create in Code — full project scaffolding, for when you want complete control Click "Design an Agent Without Code." Agent Builder opens up. Now fill in three things: Give it a name Something descriptive. For this example: "Azure Advisor" Pick a model Click the model dropdown. You'll see a list of available models — GPT-4.1, Claude Opus 4.6, and others. Foundry models appear at the top as recommended options. Pick one. Here's a nice detail: you don't need to know whether your model uses the Chat Completions API or the Responses API. AI Toolkit detects this automatically and handles the switch behind the scenes. Write your instructions This is where you tell the agent who it is and how to behave. Think of it as a personality brief: Hit Run That's it. Click Run and start chatting with your agent in the built-in playground. Want More Control? Build in Code The no-code path is great for prototyping and prompt engineering. But when you need custom tools, business logic, or multi-agent workflows — switch to code. From the Create Agent View, choose "Create in Code with Full Control." You get two options: Scaffold from a template Pick a pre-built project structure — single agent, multi-agent, or LangGraph workflow. AI Toolkit generates a complete project with proper folder structure, configuration files, and starter code. Open it, customize it, run it. Generate with GitHub Copilot Describe your agent in plain English in Copilot Chat: "Create a customer support agent that can look up order status, process returns, and escalate to a human when the customer is upset." Copilot generates a full project — agent logic, tool definitions, system prompts, and evaluation tests. It uses the microsoft-foundry skill, the same open-source skill powering GitHub Copilot for Azure. AI Toolkit installs and keeps this skill updated automatically — you never configure it. The output is structured and production-ready. Real folder structure. Real separation of concerns. Not a single-file script. Either way, you get a project you can version-control, test, and deploy. Cool Features You Should Know About Building the agent is just the beginning. Here's where AI Toolkit gets genuinely impressive. 🔧 Add Real Tools with MCP Your agent can do more than just talk. Click Add Tool in Agent Builder to connect MCP (Model Context Protocol) servers — these give your agent real capabilities: Search the web Query a database Read files Call external APIs Interact with any service that has an MCP server You control how much freedom your agent gets. Set tool approval to Auto (tool runs immediately) or Manual (you approve each call). Perfect for when you trust a read-only search tool but want oversight on anything that takes action. You can also delete MCP servers directly from the Tool Catalog when you no longer need them — no config file editing required. 🧠 Prompt Optimizer Not sure if your instructions are good enough? Click the Improve button in Agent Builder. The Foundry Prompt Optimizer analyzes your prompt and rewrites it to be clearer, more structured, and more effective. It's like having a prompt engineering expert review your work — except it takes seconds. 🕸️ Agent Inspector When your agent runs, open Agent Inspector to see what's happening under the hood. It visualizes the entire workflow in real time — which tools are called, in what order, and how the agent makes decisions. 💬 Conversations View Agent Builder includes a Conversations tab where you can review the full history of interactions with your agent. Scroll through past conversations, compare how your agent handled different scenarios, and spot patterns in where it succeeds or struggles. 📁 Everything in One Sidebar AI Toolkit puts everything in a single My Resources panel: Recent Agents — one-click access to agents you've been working on Local Resources — your local models, agents, and tools Foundry Resources — remote agents and models (if connected) Why AI Toolkit? There are other ways to build agents. What makes this different? Everything is in VS Code. You don't context-switch between a web UI, a CLI, and an IDE. Discovery, building, testing, debugging, and deployment all happen in one place. No-code and code-first aren't separate products. They're two views of the same agent. Start in Agent Builder, click View Code, and you have a full project. Or go the other way — build in code and test in the visual playground. Copilot is deeply integrated. Not as a chatbot bolted on the side — as an actual development tool that understands agent architecture and generates production-quality scaffolding. Wrapping Up: 📥 Install: AI Toolkit on the VS Code Marketplace 📖 Learn: AI Toolkit Documentation Open VS Code. Ctrl+Shift+P. Type "Create Agent." Five minutes from now, you'll have an agent running. 🚀1.1KViews3likes2CommentsAnnouncing Microsoft Azure Network Adapter (MANA) support for Existing VM SKUs
As a leader in cloud infrastructure, Microsoft ensures that Azure’s IaaS customers always have access to the latest hardware. Our goal is to consistently deliver technology to support business critical workloads with world class efficiency, reliability, and security. Customers benefit from cutting-edge performance enhancements and features, helping them to future proof their workloads while maintaining business continuity. Azure will be deploying the Microsoft Azure Network Adapter (MANA) for existing VM Size Families. Deployment timeline to be announced by mid-to-late April. The intent is to provide the benefits of new server hardware to customers of existing VM SKUs as they work towards migrating to newer SKUs. The deployments will be based on capacity needs and won’t be restricted by region. Once the hardware is available in a region, VMs can be deployed to it as needed. Workloads on operating systems which fully support MANA will benefit from sub-second Network Interface Card (NIC) firmware upgrades, higher throughput, lower latency, increased Security and Azure Boost-enabled data path accelerations. If your workload doesn't support MANA today, you'll still be able to access Azure’s network on MANA enabled SKUs, but performance will be comparable to previous generation (non-MANA) hardware. Check out the Azure Boost Overview and the Microsoft Azure Network Adapter (MANA) overview for more detailed information and OS compatibility. To determine whether your VMs are impacted and what actions (if any) you should take, start with MANA support for existing VM SKUs. This article provides additional information about which VM Sizes are eligible to be deployed on the new MANA-enabled hardware, what actions (if any) you should take, and how to determine if the workload has been deployed on MANA-enabled hardware.7.1KViews9likes1CommentGuardrails for Generative AI: Securing Developer Workflows
Generative AI is revolutionizing software development that accelerates delivery but introduces compliance and security risks if unchecked. Tools like GitHub Copilot empower developers to write code faster, automate repetitive tasks, and even generate tests and documentation. But speed without safeguards introduces risk. Unchecked AI‑assisted development can lead to security vulnerabilities, data leakage, compliance violations, and ethical concerns. In regulated or enterprise environments, this risk multiplies rapidly as AI scales across teams. The solution? Guardrails—a structured approach to ensure AI-assisted development remains secure, responsible, and enterprise-ready. In this blog, we explore how to embed responsible AI guardrails directly into developer workflows using: Azure AI Content Safety GitHub Copilot enterprise controls Copilot Studio governance Azure AI Foundry CI/CD and ALM integration The goal: maximize developer productivity without compromising trust, security, or compliance. Key Points: Why Guardrails Matter: AI-generated code may include insecure patterns or violate organizational policies. Azure AI Content Safety: Provides APIs to detect harmful or sensitive content in prompts and outputs, ensuring compliance with ethical and legal standards. Copilot Studio Governance: Enables environment strategies, Data Loss Prevention (DLP), and role-based access to control how AI agents interact with enterprise data. Azure AI Foundry: Acts as the control plane for Generative AI turning Responsible AI from policy into operational reality. Integration with GitHub Workflows: Guardrails can be enforced in IDE, Copilot Chat, and CI/CD pipelines using GitHub Actions for automated checks. Outcome: Developers maintain productivity while ensuring secure, compliant, and auditable AI-assisted development. Why Guardrails Are Non-Negotiable AI‑generated code and prompts can unintentionally introduce: Security flaws — injection vulnerabilities, unsafe defaults, insecure patterns Compliance risks — exposure of PII, secrets, or regulated data Policy violations — copyrighted content, restricted logic, or non‑compliant libraries Harmful or biased outputs — especially in user‑facing or regulated scenarios Without guardrails, organizations risk shipping insecure code, violating governance policies, and losing customer trust. Guardrails enable teams to move fast—without breaking trust. The Three Pillars of AI Guardrails Enterprise‑grade AI guardrails operate across three core layers of the developer experience. These pillars are centrally governed and enforced through Azure AI Foundry, which provides lifecycle, evaluation, and observability controls across all three. 1. GitHub Copilot Controls (Developer‑First Safety) GitHub Copilot goes beyond autocomplete and includes built‑in safety mechanisms designed for enterprise use: Duplicate Detection: Filters code that closely matches public repositories. Custom Instructions: Enhance coding standards via .github/copilot-instructions.md. Copilot Chat: Provides contextual help for debugging and secure coding practices. Pro Tip: Use Copilot Enterprise controls to enforce consistent policies across repositories and teams. 2. Azure AI Content Safety (Prompt & Output Protection) This service adds a critical protection layer across prompts and AI outputs: Prompt Injection Detection: Blocks malicious attempts to override instructions or manipulate model behaviour. Groundedness Checks: Ensures outputs align with trusted sources and expected context. Protected Material Detection: Flags copyrighted or sensitive content. Custom Categories: Tailor filters for industry-specific or regulatory requirements. Example: A financial services app can block outputs containing PII or regulatory violations using custom safety categories. 3. Copilot Studio Governance (Enterprise‑Scale Control) For organizations building custom copilots, governance is non‑negotiable. Copilot Studio enables: Data Loss Prevention (DLP): Prevent sensitive data leaks from flowing through risky connectors or channels. Role-Based Access (RBAC): Control who can create, test, approve, deploy and publish copilots. Environment Strategy: Separate dev, test, and production environments. Testing Kits: Validate prompts, responses, and behavior before production rollout. Why it matters: Governance ensures copilots scale safely across teams and geographies without compromising compliance. Azure AI Foundry: The Platform That Operationalizes the Three Pillars While the three pillars define where guardrails are applied, Azure AI Foundry defines how they are governed, evaluated, and enforced at scale. Azure AI Foundry acts as the control plane for Generative AI—turning Responsible AI from policy into operational reality. What Azure AI Foundry Adds Centralized Guardrail Enforcement: Define guardrails once and apply them consistently across: Models, Agents, Tool calls and Outputs. Guardrails specify: Risk types (PII, prompt injection, protected material) Intervention points (input, tool call, tool response, output) Enforcement actions (annotate or block) Built‑In Evaluation & Red‑Teaming: Azure AI Foundry embeds continuous evaluation into the GenAIOps lifecycle: Pre‑deployment testing for safety, groundedness, and task adherence Adversarial testing to detect jailbreaks and misuse Post‑deployment monitoring using built‑in and custom evaluators Guardrails are measured and validated, not assumed. Observability & Auditability: Foundry integrates with Azure Monitor and Application Insights to provide: Token usage and cost visibility Latency and error tracking Safety and quality signals Trace‑level debugging for agent actions Every interaction is logged, traceable, and auditable—supporting compliance reviews and incident investigations. Identity‑First Security for AI Agents: Each AI agent operates as a first‑class identity backed by Microsoft Entra ID: No secrets embedded in prompts or code Least‑privilege access via Azure RBAC Full auditability and revocation Policy‑Driven Platform Governance: Azure AI Foundry aligns with the Azure Cloud Adoption Framework, enabling: Azure Policy enforcement for approved models and regions Cost and quota controls Integration with Microsoft Purview for compliance tracking How to Implement Guardrails in Developer Workflows Shift-Left Security Embed guardrails directly into the IDE using GitHub Copilot and Azure AI Content Safety APIs—catch issues early, when they’re cheapest to fix. Automate Compliance in CI/CD Integrate automated checks into GitHub Actions to enforce policies at pull‑request and build stages. Monitor Continuously Use Azure AI Foundry and governance dashboards to track usage, violations, and policy drift. Educate Developers Conduct readiness sessions and share best practices so developers understand why guardrails exist—not just how they’re enforced. Implementing DLP Policies in Copilot Studio Access Power Platform Admin Center Navigate to Power Platform Admin Centre Ensure you have Tenant Admin or Environment Admin role Create a DLP Policy Go to Data Policies → New Policy. Define data groups: Business (trusted connectors) Non-business Blocked (e.g., HTTP, social channels) Configure Enforcement for Copilot Studio Enable DLP enforcement for copilots using PowerShell Set-PowerVirtualAgentsDlpEnforcement ` -TenantId <tenant-id> ` -Mode Enabled Modes: Disabled (default, no enforcement) SoftEnabled (blocks updates) Enabled (full enforcement) Apply Policy to Environments Choose scope: All environments, specific environments, or exclude certain environments. Block channels (e.g., Direct Line, Teams, Omnichannel) and connectors that pose risk. Validate & Monitor Use Microsoft Purview audit logs for compliance tracking. Configure user-friendly DLP error messages with admin contact and “Learn More” links for makers. Implementing ALM Workflows in Copilot Studio Environment Strategy Use Managed Environments for structured development. Separate Dev, Test, and Prod clearly. Assign roles for makers and approvers. Application Lifecycle Management (ALM) Configure solution-aware agents for packaging and deployment. Use Power Platform pipelines for automated movement across environments. Govern Publishing Require admin approval before publishing copilots to organizational catalog. Enforce role-based access and connector governance. Integrate Compliance Controls Apply Microsoft Purview sensitivity labels and enforce retention policies. Monitor telemetry and usage analytics for policy alignment. Key Takeaways Guardrails are essential for safe, compliant AI‑assisted development. Combine GitHub Copilot productivity with Azure AI Content Safety for robust protection. Govern agents and data using Copilot Studio. Azure AI Foundry operationalizes Responsible AI across the full GenAIOps lifecycle. Responsible AI is not a blocker—it’s an enabler of scale, trust, and long‑term innovation.822Views0likes0CommentsBuilding Reusable Custom Images for Azure Confidential VMs Using Azure Compute Gallery
Overview Azure Confidential Virtual Machines (CVMs) provide hardware-enforced protection for sensitive workloads by encrypting data in use using AMD SEV-SNP technology. In enterprise environments, organizations typically need to: Create hardened golden images Standardize baseline configurations Support both Platform Managed Keys (PMK) and Customer Managed Keys (CMK) Version and replicate images across regions This guide walks through the correct and production-supported approach for building reusable custom images for Confidential VMs using: PowerShell (Az module) Azure Portal Disk Encryption Sets (CMK) Azure Compute Gallery Key Design Principles Before diving into implementation steps, it is important to clarify that during real-world implementations, two important architectural truths become clear: ✅1️⃣ The Same Image Supports PMK and CMK The encryption model (PMK vs CMK) is not embedded in the image. Encryption is applied: At VM deployment time Through disk configuration (default PMK or Disk Encryption Set for CMK) This means: You build one golden image. You deploy it using PMK or CMK depending on compliance requirements. This simplifies lifecycle management significantly. ✅2️⃣ Confidential VM Image Versions Must Use Source VHD When publishing to Azure Compute Gallery: Confidential VMs require Source VHD (Mandatory Requirement) This is a platform requirement for Confidential Security Type support. Therefore, the correct workflow is: Deploy base Confidential VM Harden and configure Generalize Export OS disk as VHD Upload to storage Publish to Azure Compute Gallery Deploy using PMK or CMK Security Stack Breakdown Protection Area Technology Data in Use AMD SEV-SNP Boot Integrity Secure Boot + vTPM Image Lifecycle Azure Compute Gallery Disk Encryption PMK or CMK Compliance Control Disk Encryption Set (CMK) Implementation Steps 🖥️ Step 1 – Deploy a Base Windows Confidential VM This VM will serve as the image builder. Key Requirements Gen2 Image Confidential SKUs (similar to DCasv5 or ECasv5 series) SecurityType = ConfidentialVM Secure Boot enabled vTPM enabled Confidential OS Encryption enabled Reference Code Snippets (PowerShell) $rg = "rg-cvm-gi-pr-sbx-01" $location = "NorthEurope" $vmName = "cvmwingiprsbx01" New-AzResourceGroup -Name $rg -Location $location $cred = Get-Credential $vmConfig = New-AzVMConfig ` -VMName $vmName ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOperatingSystem ` -VM $vmConfig ` -Windows ` -ComputerName $vmName ` -Credential $cred $vmConfig = Set-AzVMSourceImage ` -VM $vmConfig ` -PublisherName "MicrosoftWindowsServer" ` -Offer "WindowsServer" ` -Skus "2022-datacenter-azure-edition" ` -Version "latest" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 📸 Reference Screenshots 🔧 Step 2 – Harden and Customize the OS This is where you: Install monitoring agents Install Defender for Endpoint Apply CIS baseline Install security agents Remove unwanted services Install application dependencies This is your enterprise golden baseline depending on the individual organizational requirements. 🔄 Step 3 – Generalize the Windows Confidential VM (Production-Ready Method) Confidential VMs often enable BitLocker automatically. Improper Sysprep handling can cause failures. Generalizing a Windows Confidential VM properly is critical to avoid: Sysprep failures BitLocker conflicts Image corruption Deployment errors later Follow these steps carefully inside the VM and later through Azure PowerShell. 1. Remove Panther Folder The Panther folder stores logs from previous Sysprep operations. If leftover logs exist, Sysprep can fail. This safely removes old Sysprep metadata. rd /s /q C:\Windows\Panther ✔ This step prevents common “Sysprep was not able to validate your Windows installation” errors. 2. Run Sysprep Navigate to Sysprep directory and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown Parameters explained: Parameter Purpose /generalize Removes machine-specific info (SID, drivers) /shutdown Powers off VM after completion ⚠️ Handling BitLocker Issues (Common in Confidential VMs): Confidential VMs may automatically enable BitLocker. If Sysprep fails due to encryption, follow the next steps to resolve the issue and execute sysprep again. 3. Check BitLocker Status & Turn Off BitLocker manage-bde -status If Protection Status is 'Protection On': manage-bde -off C: Wait for decryption to complete fully. ⚠️ Do not run Sysprep again until decryption reaches 100%. 4. Reboot and Run Sysprep Again After decryption completes: Reboot the VM Open Command Prompt as Administrator Navigate to Sysprep folder and run sysprep command: cd %windir%\system32\sysprep sysprep.exe /generalize /shutdown ✔ VM will shut down automatically. 5. Mark VM as Generalized in Azure Now switch to Azure PowerShell: Stop-AzVM -Name $vmName -ResourceGroupName $rg -Force Set-AzVM -Name $vmName -ResourceGroupName $rg -Generalized ✔ This marks the VM as ready for image capture. 🧠 Why These Extra Steps Matter in Confidential VMs Confidential VMs differ from standard VMs because: They use vTPM They may auto-enable BitLocker They enforce Secure Boot They use Gen2 images Improper handling can cause: Sysprep failures Image capture errors Deployment failures from image “VM provisioning failed” issues These cleanup steps dramatically increase success rate. 💾 Step 4 – Export OS Disk as VHD Azure Gallery Image Definitions with Security Type as 'TrustedLaunchAndConfidentialVmSupported' require Source VHD as the support for Source Image VM is not available. Generate the SAS URL for OS Disk of the Virtual Machine. Copy to Storage Account as a .vhd file. Use Get-AzStorageBlobCopyState to validate the copy status and wait for completion. $vm = Get-AzVM -Name $vmName -ResourceGroupName $rg $osDiskName = $vm.StorageProfile.OsDisk.Name $sas = Grant-AzDiskAccess ` -ResourceGroupName $rg ` -DiskName $osDiskName ` -Access Read ` -DurationInSecond 3600 $storageAccountName = "stcvmgiprsbx01" $storageContainerName = "images" $destinationVHDFileName = "cvmwingiprsbx01-OsDisk-VHD.vhd" $destinationContext = New-AzStorageContext -StorageAccountName $storageAccountName Start-AzStorageBlobCopy -AbsoluteUri $sas.AccessSAS -DestContainer $storageContainerName -DestContext $destinationContext -DestBlob $destinationVHDFileName Get-AzStorageBlobCopyState -Blob $destinationVHDFileName -Container $storageContainerName -Context $destContext 🏢 Step 5 – Create Azure Compute Gallery & Image Version Instead of creating a standalone managed image, we will: Create an Azure Compute Gallery Create an Image Definition Publish a Gallery Image Version from the generalized Confidential VM This enables: Versioning Regional replication Staged rollouts Enterprise image lifecycle management 1. Create Azure Compute Gallery $galleryName = "cvmImageGallery" New-AzGallery ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Description "Confidential VM Image Gallery" 2. Create Image Definition for Windows Confidential VM Important settings: OS State = Generalized OS Type = Windows HyperV Generation = V2 Security Type = TrustedLaunchAndConfidentialVmSupported $imageDefName = "img-win-cvm-gi-pr-sbx-01" $ConfidentialVMSupported = @{Name='SecurityType';Value='TrustedLaunchAndConfidentialVmSupported'} $Features = @($ConfidentialVMSupported) New-AzGalleryImageDefinition ` -GalleryName $galleryName ` -ResourceGroupName $rg ` -Location $location ` -Name $imageDefName ` -OsState Generalized ` -OsType Windows ` -Publisher "prImages" ` -Offer "WindowsServerCVM" ` -Sku "2022-dc-azure-edition" ` -HyperVGeneration V2 ` -Feature $features ✔ HyperVGeneration must be V2 for Confidential VMs. 📸 Reference Screenshot 3. Create Gallery Image Version from Generalized VM Now publish version 1.0.0 from the generalized VM OS Disk VHD to the Image Definition: There is no support for performing this step using Azure PowerShell, hence the Azure Portal needs to be used Ensure the right network and RBAC access on the storage account is in place Replication can be enabled on the Image Version to multiple regions for enterprises ✅ Why Azure Compute Gallery is the Right Choice Feature Managed Image Azure Compute Gallery Versioning ❌ ✅ Cross-region replication ❌ ✅ Enterprise lifecycle Limited Full Recommended for production ❌ ✅ For enterprise confidential workloads, Azure Compute Gallery is strongly recommended. 🚀 Step 6 – Deploy Confidential VM from Gallery Image 🔹 Using PMK (Default) If you do not specify a Disk Encryption Set, Azure uses Platform Managed Keys automatically. $imageId = (Get-AzGalleryImageVersion ` -GalleryName $galleryName ` -GalleryImageDefinitionName $imageDefName ` -ResourceGroupName $rg ` -Name "1.0.0").Id $vmConfig = New-AzVMConfig ` -VMName "cvmwingiprsbx02" ` -VMSize "Standard_DC2as_v5" ` -SecurityType "ConfidentialVM" $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithPlatformKey" $vmConfig = Set-AzVMSourceImage -VM $vmConfig -Id $imageId $vmConfig = Set-AzVMOperatingSystem -VM $vmConfig -Windows -ComputerName "cvmwingiprsbx02" -Credential (Get-Credential) New-AzVM -ResourceGroupName $rg -Location $location -VM $vmConfig 🔹 Using CMK (Same Image!) If compliance requires CMK: Create Disk Encryption Set Associate with Key Vault or Managed HSM Attach DES during deployment $vmConfig = Set-AzVMOSDisk ` -VM $vmConfig ` -CreateOption FromImage ` -SecurityEncryptionType "ConfidentialVM_DiskEncryptedWithCustomerKey" ` -DiskEncryptionSetId $des.Id ✔ Same image ✔ Different encryption model ✔ Encryption applied at deployment 🔎 Validation Check Confidential Security: Get-AzVM -Name "cvmwingiprsbx02" -ResourceGroupName $rg | Select SecurityProfile Check disk encryption: Get-AzDisk -ResourceGroupName $rg Architectural Summary Confidential VM security is independent of disk encryption model Encryption choice is applied at deployment One image supports multiple compliance models Source VHD is required for Confidential VM gallery publishing Azure Compute Gallery enables enterprise lifecycle PMK vs CMK Decision Matrix Scenario Recommended Model Standard enterprise workloads PMK Financial services / regulated CMK BYOK requirement CMK Simplicity prioritized PMK 🏢 Enterprise Recommendations ✔ Always use Azure Compute Gallery ✔ Use semantic versioning (1.0.0, 1.0.1) ✔ Automate using Azure Image Builder ✔ Enforce Confidential VM via Azure Policy ✔ Enable Guest Attestation ✔ Monitor with Defender for Cloud Final Thoughts Creating custom images for Azure Confidential VMs allows organizations to combine the security benefits of Confidential Computing with the operational efficiency of standardized deployments. By baking security baselines, monitoring agents, and required configurations directly into a golden image, every new VM starts from a consistent and trusted foundation. A key advantage of this approach is flexibility. The custom image itself is independent of the disk encryption model, meaning the same image can be deployed using Platform Managed Keys (PMK) for simplicity or Customer Managed Keys (CMK) to meet stricter compliance requirements. This allows platform teams to maintain a single image pipeline while supporting multiple security scenarios. By publishing images through Azure Compute Gallery, organizations can version, replicate, and manage their Confidential VM images more effectively. Combined with proper VM generalization and hardening practices, custom images become a reliable way to ensure secure, consistent, and scalable deployments of Confidential workloads in Azure. As Confidential Computing continues to gain adoption across industries handling sensitive data, investing in a well-designed custom image pipeline will enable organizations to scale securely while maintaining consistency, compliance, and operational efficiency across their cloud environments.264Views1like0CommentsMastering Azure Queries: Skip Token and Batching for Scale
Let's be honest. As a cloud engineer or DevOps professional managing a large Azure environment, running even a simple resource inventory query can feel like drinking from a firehose. You hit API limits, face slow performance, and struggle to get the complete picture of your estate—all because the data volume is overwhelming. But it doesn't have to be this way! This blog is your practical, hands-on guide to mastering two essential techniques for handling massive data volumes in Azure: using PowerShell and Azure Resource Graph (ARG): Skip Token (for full data retrieval) and Batching (for blazing-fast performance). 📋 TABLE OF CONTENTS 🚀 GETTING STARTED │ ├─ Prerequisites: PowerShell 7+ & Az.ResourceGraph Module └─ Introduction: Why Standard Queries Fail at Scale 📖 CORE CONCEPTS │ ├─ 📑 Skip Token: The Data Completeness Tool │ ├─ What is a Skip Token? │ ├─ The Bookmark Analogy │ ├─ PowerShell Implementation │ └─ 💻 Code Example: Pagination Loop │ └─ ⚡ Batching: The Performance Booster ├─ What is Batching? ├─ Performance Benefits ├─ Batching vs. Pagination ├─ Parallel Processing in PowerShell └─ 💻 Code Example: Concurrent Queries 🔍 DEEP DIVE │ ├─ Skip Token: Generic vs. Azure-Specific └─ Azure Resource Graph (ARG) at Scale ├─ ARG Overview ├─ Why ARG Needs These Techniques └─ 💻 Combined Example: Skip Token + Batching ✅ BEST PRACTICES │ ├─ When to Use Each Technique └─ Quick Reference Guide 📚 RESOURCES └─ Official Documentation & References Prerequisites Component Requirement / Details Command / Reference PowerShell Version The batching examples use ForEach-Object -Parallel, which requires PowerShell 7.0 or later. Check version: $PSVersionTable.PSVersion Install PowerShell 7+: Install PowerShell on Windows, Linux, and macOS Azure PowerShell Module Az.ResourceGraph module must be installed. Install module: Install-Module -Name Az.ResourceGraph -Scope CurrentUser Introduction: Why Standard Queries Don't Work at Scale When you query a service designed for big environments, like Azure Resource Graph, you face two limits: Result Limits (Pagination): APIs won't send you millions of records at once. They cap the result size (often 1,000 items) and stop. Efficiency Limits (Throttling): Sending a huge number of individual requests is slow and can cause the API to temporarily block you (throttling). Skip Token helps you solve the first limit by making sure you retrieve all results. Batching solves the second by grouping your requests to improve performance. Understanding Skip Token: The Continuation Pointer What is a Skip Token? A Skip Token (or continuation token) is a unique string value returned by an Azure API when a query result exceeds the maximum limit for a single response. Think of the Skip Token as a “bookmark” that tells Azure where your last page ended — so you can pick up exactly where you left off in the next API call. Instead of getting cut off after 1,000 records, the API gives you the first 1,000 results plus the Skip Token. You use this token in the next request to get the next page of data. This process is called pagination. Skip Token in Practice with PowerShell To get the complete dataset, you must use a loop that repeatedly calls the API, providing the token each time until the token is no longer returned. PowerShell Example: Using Skip Token to Loop Pages # Define the query $Query = "Resources | project name, type, location" $PageSize = 1000 $AllResults = @() $SkipToken = $null # Initialize the token Write-Host "Starting ARG query..." do { Write-Host "Fetching next page. (Token check: $($SkipToken -ne $null))" # 1. Execute the query, using the -SkipToken parameter $ResultPage = Search-AzGraph -Query $Query -First $PageSize -SkipToken $SkipToken # 2. Add the current page results to the main array $AllResults += $ResultPage # 3. Get the token for the next page, if it exists $SkipToken = $ResultPage.SkipToken Write-Host " -> Items in this page: $($ResultPage.Count). Total retrieved: $($AllResults.Count)" } while ($SkipToken -ne $null) # Loop as long as a Skip Token is returned Write-Host "Query finished. Total resources found: $($AllResults.Count)" This do-while loop is the reliable way to ensure you retrieve every item in a large result set. Understanding Batching: Grouping Requests What is Batching? Batching means taking several independent requests and combining them into a single API call. Instead of making N separate network requests for N pieces of data, you make one request containing all N sub-requests. Batching is primarily used for performance. It improves efficiency by: Reducing Overhead: Fewer separate network connections are needed. Lowering Throttling Risk: Fewer overall API calls are made, which helps you stay under rate limits. Feature Batching Pagination (Skip Token) Goal Improve efficiency/speed. Retrieve all data completely. Input Multiple different queries. Single query, continuing from a marker. Result One response with results for all grouped queries. Partial results with a token for the next step. Note: While Azure Resource Graph's REST API supports batch requests, the PowerShell Search-AzGraph cmdlet does not expose a -Batch parameter. Instead, we achieve batching by using PowerShell's ForEach-Object -Parallel (PowerShell 7+) to run multiple queries simultaneously. Batching in Practice with PowerShell Using parallel processing in PowerShell, you can efficiently execute multiple distinct Kusto queries targeting different scopes (like subscriptions) simultaneously. Method 5 Subscriptions 20 Subscriptions Sequential ~50 seconds ~200 seconds Parallel (ThrottleLimit 5) ~15 seconds ~45 seconds PowerShell Example: Running Multiple Queries in Parallel # Define multiple queries to run together $BatchQueries = @( @{ Query = "Resources | where type =~ 'Microsoft.Compute/virtualMachines'" Subscriptions = @("SUB_A") # Query 1 Scope }, @{ Query = "Resources | where type =~ 'Microsoft.Network/publicIPAddresses'" Subscriptions = @("SUB_B", "SUB_C") # Query 2 Scope } ) Write-Host "Executing batch of $($BatchQueries.Count) queries in parallel..." # Execute queries in parallel (true batching) $BatchResults = $BatchQueries | ForEach-Object -Parallel { $QueryConfig = $_ $Query = $QueryConfig.Query $Subs = $QueryConfig.Subscriptions Write-Host "[Batch Worker] Starting query: $($Query.Substring(0, [Math]::Min(50, $Query.Length)))..." -ForegroundColor Cyan $QueryResults = @() # Process each subscription in this query's scope foreach ($SubId in $Subs) { $SkipToken = $null do { $Params = @{ Query = $Query Subscription = $SubId First = 1000 } if ($SkipToken) { $Params['SkipToken'] = $SkipToken } $Result = Search-AzGraph @Params if ($Result) { $QueryResults += $Result } $SkipToken = $Result.SkipToken } while ($SkipToken) } Write-Host " [Batch Worker] ✅ Query completed - Retrieved $($QueryResults.Count) resources" -ForegroundColor Green # Return results with metadata [PSCustomObject]@{ Query = $Query Subscriptions = $Subs Data = $QueryResults Count = $QueryResults.Count } } -ThrottleLimit 5 Write-Host "`nBatch complete. Reviewing results..." # The results are returned in the same order as the input array $VMCount = $BatchResults[0].Data.Count $IPCount = $BatchResults[1].Data.Count Write-Host "Query 1 (VMs) returned: $VMCount results." Write-Host "Query 2 (IPs) returned: $IPCount results." # Optional: Display detailed results Write-Host "`n--- Detailed Results ---" for ($i = 0; $i -lt $BatchResults.Count; $i++) { $Result = $BatchResults[$i] Write-Host "`nQuery $($i + 1):" Write-Host " Query: $($Result.Query)" Write-Host " Subscriptions: $($Result.Subscriptions -join ', ')" Write-Host " Total Resources: $($Result.Count)" if ($Result.Data.Count -gt 0) { Write-Host " Sample (first 3):" $Result.Data | Select-Object -First 3 | Format-Table -AutoSize } } Azure Resource Graph (ARG) and Scale Azure Resource Graph (ARG) is a service built for querying resource properties quickly across a large number of Azure subscriptions using the Kusto Query Language (KQL). Because ARG is designed for large scale, it fully supports Skip Token and Batching: Skip Token: ARG automatically generates and returns the token when a query exceeds its result limit (e.g., 1,000 records). Batching: ARG's REST API provides a batch endpoint for sending up to ten queries in a single request. In PowerShell, we achieve similar performance benefits using ForEach-Object -Parallel to process multiple queries concurrently. Combined Example: Batching and Skip Token Together This script shows how to use Batching to start a query across multiple subscriptions and then use Skip Token within the loop to ensure every subscription's data is fully retrieved. $SubscriptionIDs = @("SUB_A") $KQLQuery = "Resources | project id, name, type, subscriptionId" Write-Host "Starting BATCHED query across $($SubscriptionIDs.Count) subscription(s)..." Write-Host "Using parallel processing for true batching...`n" # Process subscriptions in parallel (batching) $AllResults = $SubscriptionIDs | ForEach-Object -Parallel { $SubId = $_ $Query = $using:KQLQuery $SubResults = @() Write-Host "[Batch Worker] Processing Subscription: $SubId" -ForegroundColor Cyan $SkipToken = $null $PageCount = 0 do { $PageCount++ # Build parameters $Params = @{ Query = $Query Subscription = $SubId First = 1000 } if ($SkipToken) { $Params['SkipToken'] = $SkipToken } # Execute query $Result = Search-AzGraph @Params if ($Result) { $SubResults += $Result Write-Host " [Batch Worker] Sub: $SubId - Page $PageCount - Retrieved $($Result.Count) resources" -ForegroundColor Yellow } $SkipToken = $Result.SkipToken } while ($SkipToken) Write-Host " [Batch Worker] ✅ Completed $SubId - Total: $($SubResults.Count) resources" -ForegroundColor Green # Return results from this subscription $SubResults } -ThrottleLimit 5 # Process up to 5 subscriptions simultaneously Write-Host "`n--- Batch Processing Finished ---" Write-Host "Final total resource count: $($AllResults.Count)" # Optional: Display sample results if ($AllResults.Count -gt 0) { Write-Host "`nFirst 5 resources:" $AllResults | Select-Object -First 5 | Format-Table -AutoSize } Technique Use When... Common Mistake Actionable Advice Skip Token You must retrieve all data items, expecting more than 1,000 results. Forgetting to check for the token; you only get partial data. Always use a do-while loop to guarantee you get the complete set. Batching You need to run several separate queries (max 10 in ARG) efficiently. Putting too many queries in the batch, causing the request to fail. Group up to 10 logical queries or subscriptions into one fast request. By combining Skip Token for data completeness and Batching for efficiency, you can confidently query massive Azure estates without hitting limits or missing data. These two techniques — when used together — turn Azure Resource Graph from a “good tool” into a scalable discovery engine for your entire cloud footprint. Summary: Skip Token and Batching in Azure Resource Graph Goal: Efficiently query massive Azure environments using PowerShell and Azure Resource Graph (ARG). 1. Skip Token (The Data Completeness Tool) Concept What it Does Why it Matters PowerShell Use Skip Token A marker returned by Azure APIs when results hit the 1,000-item limit. It points to the next page of data. Ensures you retrieve all records, avoiding incomplete data (pagination). Use a do-while loop with the -SkipToken parameter in Search-AzGraph until the token is no longer returned. 2. Batching (The Performance Booster) Concept What it Does Why it Matters PowerShell Use Batching Processes multiple independent queries simultaneously using parallel execution. Drastically improves query speed by reducing overall execution time and helps avoid API throttling. Use ForEach-Object -Parallel (PowerShell 7+) with -ThrottleLimit to control concurrent queries. For PowerShell 5.1, use Start-Job with background jobs. 3. Best Practice: Combine Them For maximum efficiency, combine Batching and Skip Token. Use batching to run queries across multiple subscriptions simultaneously and use the Skip Token logic within the loop to ensure every single subscription's data is fully paginated and retrieved. Result: Fast, complete, and reliable data collection across your large Azure estate. AI/ML Use Cases Powered by ARG Optimizations: Machine Learning Pipelines: Cost forecasting and capacity planning models require historical resource data—SkipToken pagination enables efficient extraction of millions of records for training without hitting API limits. Scenario: Cost forecasting models predict next month's Azure spend by analyzing historical resource usage. For example, training a model to forecast VM costs requires pulling 12 months of resource data (CPU usage, memory, SKU changes) across all subscriptions. SkipToken pagination enables extracting millions of these records efficiently without hitting API limits—a single query might return 500K+ VM records that need to be fed into Azure ML for training. Anomaly Detection: Security teams use Azure ML to monitor suspicious configurations—parallel queries across subscriptions enable real-time threat detection with low latency. Scenario: Security teams deploy models to catch unusual activity, like someone suddenly creating 50 VMs in an unusual region or changing network security rules at 3 AM. Parallel queries across all subscriptions let these detections systems scan your entire Azure environment in seconds rather than minutes, enabling real-time alerts when threats emerge. RAG Systems: Converting resource metadata to embeddings for semantic search requires processing entire inventories—pagination and batching become mandatory for handling large-scale embedding generation. Scenario: Imagine asking "Which SQL databases are using premium storage but have low IOPS?" A RAG system converts all your Azure resources into searchable embeddings (vector representations), then uses AI to understand your question and find relevant resources semantically. Building this requires processing your entire Azure inventory—potentially lots of resources—where pagination and batching become mandatory to generate embeddings without overwhelming the OpenAI API. References: Azure Resource Graph documentation Search-AzGraph PowerShell reference654Views1like2CommentsDefending the cloud: Azure neutralized a record-breaking 15 Tbps DDoS attack
On October 24, 2025, Azure DDOS Protection automatically detected and mitigated a multi-vector DDoS attack measuring 15.72 Tbps and nearly 3.64 billion packets per second (pps). This was the largest DDoS attack ever observed in the cloud and it targeted a single endpoint in Australia. By utilizing Azure’s globally distributed DDoS Protection infrastructure and continuous detection capabilities, mitigation measures were initiated. Malicious traffic was effectively filtered and redirected, maintaining uninterrupted service availability for customer workloads. The attack originated from Aisuru botnet. Aisuru is a Turbo Mirai-class IoT botnet that frequently causes record-breaking DDoS attacks by exploiting compromised home routers and cameras, mainly in residential ISPs in the United States and other countries. The attack involved extremely high-rate UDP floods targeting a specific public IP address, launched from over 500,000 source IPs across various regions. These sudden UDP bursts had minimal source spoofing and used random source ports, which helped simplify traceback and facilitated provider enforcement. Attackers are scaling with the internet itself. As fiber-to-the-home speeds rise and IoT devices get more powerful, the baseline for attack size keeps climbing. As we approach the upcoming holiday season, it is essential to confirm that all internet-facing applications and workloads are adequately protected against DDOS attacks. Additionally, do not wait for an actual attack to assess your defensive capabilities or operational readiness—conduct regular simulations to identify and address potential issues proactively. Learn more about Azure DDOS Protection at Azure DDoS Protection Overview | Microsoft Learn49KViews6likes3CommentsAnnouncing Kubernetes Center (Preview) On Azure Portal
Today, we’re excited to introduce the Kubernetes Center in the Azure portal, a new experience to simplify how customers manage, monitor, and optimize Azure Kubernetes Services environments at scale. The Kubernetes Center provides a unified view across all clusters, intelligent insights, and streamlined workflows that help platform teams stay in control while enabling developers to move fast. As Kubernetes adoption accelerates, many teams face growing challenges in managing clusters and workloads at scale. Getting a quick snapshot of what needs attention across clusters and workloads can quickly become overwhelming. Kubernetes Center is designed to change that, offering a streamlined and intuitive experience that brings everything together in one place, brings the most critical Kubernetes capabilities into a single pane of glass for unified visibility and control. What is Kubernetes Center?: Actionable insights from the start: Kubernetes Center surfaces key issues like security vulnerabilities, cluster alerts, compliance gaps, and upgrade recommendations in a single, unified view. This helps teams focus immediately on what matters most, leading to faster resolution times, improved security posture, and greater operational clarity. Streamlined management experience: By bringing together AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single experience, we’ve reduced the need to jump between services. Everything you need to manage Kubernetes on Azure is now organized in one consistent interface. Centralized Quickstarts: Whether you’re getting started or troubleshooting advanced scenarios, Kubernetes Center brings relevant documentation, learning resources, and in-context help into one place so you can spend less time searching and more time building. Azure Portal: From Distinct landing experiences for AKS, Fleet Manager, and Managed Kubernetes Namespaces: To a streamlined management experience: Get the big picture at a glance, then dive deeper with individual pages designed for effortless discovery. Centralized Quickstarts: Next Steps: Build on your momentum by exploring Kubernetes Center. Create your first AKS cluster or deploy your first application using the Deploy Your Application flow and track your progress in real time or Check out the new experience and instantly see your existing clusters in a streamlined management experience. Your feedback will help shape what comes next. Start building today with Kubernetes Center on Azure Portal! Learn more: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn FAQ: What products from Azure are included in Kubernetes Center? A. Kubernetes Center brings together all your Azure Kubernetes resources such as AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single interface for simplified operations. Create new resources or view your existing resources in Kubernetes Center. Does Kubernetes Center handle multi-cluster management? A. Kubernetes Center provides a unified interface aka single pane of glass to view and monitor all your Kubernetes resources in one place. For multi-cluster operations like upgrading Kubernetes Version, placing cluster resources on N clusters, policy management, and coordination across environments, Kubernetes Fleet Manager is the solution designed to handle that complexity at scale. It enables teams to manage clusters at scale with automation, consistency, and operational control. Does Kubernetes Center provide security and compliance insights? A. Absolutely. When Microsoft Defender for Containers is enabled, Kubernetes Center surfaces critical security vulnerabilities and compliance gaps across your clusters. Where can I find help and documentation? A. All relevant documentation, QuickStarts, and learning resources are available directly within Kubernetes Center, making it easier to get support without leaving the platform. For more information: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn What is the status of this launch? A. Kubernetes Center is currently in preview, offering core capabilities with more features planned for the general availability release. What is the roadmap for GA? A. Our roadmap includes adding new features and introducing tailored views designed for Admins and Developers. We also plan to enhance support for multi-cluster capabilities in Azure Fleet Manager, enabling smoother and more efficient operations within the Kubernetes Center.3.5KViews10likes0CommentsResiliency Best Practices You Need For your Blob Storage Data
Maintaining Resiliency in Azure Blob Storage: A Guide to Best Practices Azure Blob Storage is a cornerstone of modern cloud storage, offering scalable and secure solutions for unstructured data. However, maintaining resiliency in Blob Storage requires careful planning and adherence to best practices. In this blog, I’ll share practical strategies to ensure your data remains available, secure, and recoverable under all circumstances. 1. Enable Soft Delete for Accidental Recovery (Most Important) Mistakes happen, but soft delete can be your safety net and. It allows you to recover deleted blobs within a specified retention period: Configure a soft delete retention period in Azure Storage. Regularly monitor your blob storage to ensure that critical data is not permanently removed by mistake. Enabling soft delete in Azure Blob Storage does not come with any additional cost for simply enabling the feature itself. However, it can potentially impact your storage costs because the deleted data is retained for the configured retention period, which means: The retained data contributes to the total storage consumption during the retention period. You will be charged according to the pricing tier of the data (Hot, Cool, or Archive) for the duration of retention 2. Utilize Geo-Redundant Storage (GRS) Geo-redundancy ensures your data is replicated across regions to protect against regional failures: Choose RA-GRS (Read-Access Geo-Redundant Storage) for read access to secondary replicas in the event of a primary region outage. Assess your workload’s RPO (Recovery Point Objective) and RTO (Recovery Time Objective) needs to select the appropriate redundancy. 3. Implement Lifecycle Management Policies Efficient storage management reduces costs and ensures long-term data availability: Set up lifecycle policies to transition data between hot, cool, and archive tiers based on usage. Automatically delete expired blobs to save on costs while keeping your storage organized. 4. Secure Your Data with Encryption and Access Controls Resiliency is incomplete without robust security. Protect your blobs using: Encryption at Rest: Azure automatically encrypts data using server-side encryption (SSE). Consider enabling customer-managed keys for additional control. Access Policies: Implement Shared Access Signatures (SAS) and Stored Access Policies to restrict access and enforce expiration dates. 5. Monitor and Alert for Anomalies Stay proactive by leveraging Azure’s monitoring capabilities: Use Azure Monitor and Log Analytics to track storage performance and usage patterns. Set up alerts for unusual activities, such as sudden spikes in access or deletions, to detect potential issues early. 6. Plan for Disaster Recovery Ensure your data remains accessible even during critical failures: Create snapshots of critical blobs for point-in-time recovery. Enable backup for blog & have the immutability feature enabled Test your recovery process regularly to ensure it meets your operational requirements. 7. Resource lock Adding Azure Locks to your Blob Storage account provides an additional layer of protection by preventing accidental deletion or modification of critical resources 7. Educate and Train Your Team Operational resilience often hinges on user awareness: Conduct regular training sessions on Blob Storage best practices. Document and share a clear data recovery and management protocol with all stakeholders. 8. "Critical Tip: Do Not Create New Containers with Deleted Names During Recovery" If a container or blob storage is deleted for any reason and recovery is being attempted, it’s crucial not to create a new container with the same name immediately. Doing so can significantly hinder the recovery process by overwriting backend pointers, which are essential for restoring the deleted data. Always ensure that no new containers are created using the same name during the recovery attempt to maximize the chances of successful restoration. Wrapping It Up Azure Blob Storage offers an exceptional platform for scalable and secure storage, but its resiliency depends on following best practices. By enabling features like soft delete, implementing redundancy, securing data, and proactively monitoring your storage environment, you can ensure that your data is resilient to failures and recoverable in any scenario. Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn Data redundancy - Azure Storage | Microsoft Learn Overview of Azure Blobs backup - Azure Backup | Microsoft Learn Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn1.4KViews1like1CommentOperational Excellence In AI Infrastructure Fleets: Standardized Node Lifecycle Management
Co-authors: Choudary Maddukuri and Bhushan Mehendale AI infrastructure is scaling at an unprecedented pace, and the complexity of managing it is growing just as quickly. Onboarding new hardware into hyperscale fleets can take months, slowed by fragmented tools, vendor-specific firmware, and inconsistent diagnostics. As hyperscalers expand with diverse accelerators and CPU architectures, operational friction has become a critical bottleneck. Microsoft, in collaboration with the Open Compute Project (OCP) and leading silicon partners, is addressing this challenge. By standardizing lifecycle management across heterogeneous fleets, we’ve dramatically reduced onboarding effort, improved reliability, and achieved >95% Nodes-in-Service on incredibly large fleet sizes. This blog explores how we are contributing to and leveraging open standards to transform fragmented infrastructure into scalable, vendor-neutral AI platforms. Industry Context & Problem The rapid growth of generative AI has accelerated the adoption of GPUs and accelerators from multiple vendors, alongside diverse CPU architectures such as Arm and x86. Each new hardware SKU introduces its own ecosystem of proprietary tools, firmware update processes, management interfaces, reliability mechanisms, and diagnostic workflows. This hardware diversity leads to engineering toil, delayed deployments, and inconsistent customer experiences. Without a unified approach to lifecycle management, hyperscalers face escalating operational costs, slower innovation, and reduced efficiency. Node Lifecycle Standardization: Enabling Scalable, Reliable AI Infrastructure Microsoft, through the Open Compute Project (OCP) in collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, is leading an industry-wide initiative to standardize AI infrastructure lifecycle management across GPU and CPU hardware management workstreams. Historically, onboarding each new SKU was a highly resource-intensive effort due to custom implementations and vendor-specific behaviors that required extensive Azure integration. This slowed scalability, increased engineering overhead, and limited innovation. With standardized node lifecycle processes and compliance tooling, hyperscalers can now onboard new SKUs much faster, achieving over 70% reduction in effort while enhancing overall fleet operational excellence. These efforts also enable silicon vendors to ensure interoperability across multiple cloud providers. Figure: How Standardization benefits both Hyperscalers & Suppliers. Key Benefits and Capabilities Firmware Updates: Firmware update mechanisms aligned with DMTF standards, minimize downtime and streamline fleet-wide secure deployments. Unified Manageability Interfaces: Standardized Redfish APIs and PLDM protocols create a consistent framework for out-of-band management, reducing integration overhead and ensuring predictable behavior across hardware vendors. RAS (Reliability, Availability and Serviceability) Features: Standardization enforces minimum RAS requirements across all IP blocks, including CPER (Common Platform Error Record) based error logging, crash dumps, and error recovery flows to enhance system uptime. Debug & Diagnostics: Unified APIs and standardized crash & debug dump formats reduce issue resolution time from months to days. Streamlined diagnostic workflows enable precise FRU isolation and clear service actions. Compliance Tooling: Tool contributions such as CTAM (Compliance Tool for Accelerator Manageability) and CPACT (Cloud Processor Accessibility Compliance Tool) automate compliance and acceptance testing—ensuring suppliers meet hyperscaler requirements for seamless onboarding. Technical Specifications & Contributions Through deep collaboration within the Open Compute Project (OCP) community, Microsoft and its partners have published multiple specifications that streamline SKU development, validation, and fleet operations. Summary of Key Contributions Specification Focus Area Benefit GPU Firmware Update requirements Firmware Updates Enables consistent firmware update processes across vendors GPU Management Interfaces Manageability Standardizes telemetry and control via Redfish/PLDM GPU RAS Requirements Reliability and Availability Reduces AI job interruptions caused by hardware errors CPU Debug and RAS requirements Debug and Diagnostics Achieves >95% node serviceability through unified diagnostics and debug CPU Impactless Updates requirements Impactless Updates Enables Impactless firmware updates to address security and quality issues without workload interruptions Compliance Tools Validation Automates specification compliance testing for faster hardware onboarding Embracing Open Standards: A Collaborative Shift in AI Infrastructure Management This standardized approach to lifecycle management represents a foundational shift in how AI infrastructure is maintained. By embracing open standards and collaborative innovation, the industry can scale AI deployments faster, with greater reliability and lower operational cost. Microsoft’s leadership within the OCP community—and its deep partnerships with other Hyperscalers and silicon vendors—are paving the way for scalable, interoperable, and vendor-neutral AI infrastructure across the global cloud ecosystem. To learn more about Microsoft’s datacenter innovations, check out the virtual datacenter tour at datacenters.microsoft.com.965Views0likes0Comments