azure virtual machines
13 Topics- Operational Excellence In AI Infrastructure Fleets: Standardized Node Lifecycle ManagementCo-authors: Choudary Maddukuri and Bhushan Mehendale AI infrastructure is scaling at an unprecedented pace, and the complexity of managing it is growing just as quickly. Onboarding new hardware into hyperscale fleets can take months, slowed by fragmented tools, vendor-specific firmware, and inconsistent diagnostics. As hyperscalers expand with diverse accelerators and CPU architectures, operational friction has become a critical bottleneck. Microsoft, in collaboration with the Open Compute Project (OCP) and leading silicon partners, is addressing this challenge. By standardizing lifecycle management across heterogeneous fleets, we’ve dramatically reduced onboarding effort, improved reliability, and achieved >95% Nodes-in-Service on incredibly large fleet sizes. This blog explores how we are contributing to and leveraging open standards to transform fragmented infrastructure into scalable, vendor-neutral AI platforms. Industry Context & Problem The rapid growth of generative AI has accelerated the adoption of GPUs and accelerators from multiple vendors, alongside diverse CPU architectures such as Arm and x86. Each new hardware SKU introduces its own ecosystem of proprietary tools, firmware update processes, management interfaces, reliability mechanisms, and diagnostic workflows. This hardware diversity leads to engineering toil, delayed deployments, and inconsistent customer experiences. Without a unified approach to lifecycle management, hyperscalers face escalating operational costs, slower innovation, and reduced efficiency. Node Lifecycle Standardization: Enabling Scalable, Reliable AI Infrastructure Microsoft, through the Open Compute Project (OCP) in collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, is leading an industry-wide initiative to standardize AI infrastructure lifecycle management across GPU and CPU hardware management workstreams. Historically, onboarding each new SKU was a highly resource-intensive effort due to custom implementations and vendor-specific behaviors that required extensive Azure integration. This slowed scalability, increased engineering overhead, and limited innovation. With standardized node lifecycle processes and compliance tooling, hyperscalers can now onboard new SKUs much faster, achieving over 70% reduction in effort while enhancing overall fleet operational excellence. These efforts also enable silicon vendors to ensure interoperability across multiple cloud providers. Figure: How Standardization benefits both Hyperscalers & Suppliers. Key Benefits and Capabilities Firmware Updates: Firmware update mechanisms aligned with DMTF standards, minimize downtime and streamline fleet-wide secure deployments. Unified Manageability Interfaces: Standardized Redfish APIs and PLDM protocols create a consistent framework for out-of-band management, reducing integration overhead and ensuring predictable behavior across hardware vendors. RAS (Reliability, Availability and Serviceability) Features: Standardization enforces minimum RAS requirements across all IP blocks, including CPER (Common Platform Error Record) based error logging, crash dumps, and error recovery flows to enhance system uptime. Debug & Diagnostics: Unified APIs and standardized crash & debug dump formats reduce issue resolution time from months to days. Streamlined diagnostic workflows enable precise FRU isolation and clear service actions. Compliance Tooling: Tool contributions such as CTAM (Compliance Tool for Accelerator Manageability) and CPACT (Cloud Processor Accessibility Compliance Tool) automate compliance and acceptance testing—ensuring suppliers meet hyperscaler requirements for seamless onboarding. Technical Specifications & Contributions Through deep collaboration within the Open Compute Project (OCP) community, Microsoft and its partners have published multiple specifications that streamline SKU development, validation, and fleet operations. Summary of Key Contributions Specification Focus Area Benefit GPU Firmware Update requirements Firmware Updates Enables consistent firmware update processes across vendors GPU Management Interfaces Manageability Standardizes telemetry and control via Redfish/PLDM GPU RAS Requirements Reliability and Availability Reduces AI job interruptions caused by hardware errors CPU Debug and RAS requirements Debug and Diagnostics Achieves >95% node serviceability through unified diagnostics and debug CPU Impactless Updates requirements Impactless Updates Enables Impactless firmware updates to address security and quality issues without workload interruptions Compliance Tools Validation Automates specification compliance testing for faster hardware onboarding Embracing Open Standards: A Collaborative Shift in AI Infrastructure Management This standardized approach to lifecycle management represents a foundational shift in how AI infrastructure is maintained. By embracing open standards and collaborative innovation, the industry can scale AI deployments faster, with greater reliability and lower operational cost. Microsoft’s leadership within the OCP community—and its deep partnerships with other Hyperscalers and silicon vendors—are paving the way for scalable, interoperable, and vendor-neutral AI infrastructure across the global cloud ecosystem. To learn more about Microsoft’s datacenter innovations, check out the virtual datacenter tour at datacenters.microsoft.com.606Views0likes0Comments
- Deploying a GitLab Runner on Azure: A Step-by-Step GuideThis guide walks you through the entire process — from VM setup to running your first successful job. Step 1: Create an Azure VM Log in to the Azure Portal. Create a new VM with the following settings: Image: Ubuntu 20.04 LTS (recommended) Authentication: SSH Public Key (generate a .pem file for secure access) Once created, note the public IP address. Connect to the VM From your terminal: ssh -i "/path/to/your/key.pem" admin_name@<YOUR_VM_PUBLIC_IP> Note: Make sure to replace the above command with path to .pem file and admin name which you would have given during VM deployment. Step 2: Install Docker on the Azure VM Run the following commands to install Docker: sudo apt update && sudo apt upgrade -y sudo apt install -y docker.io sudo systemctl start docker sudo systemctl enable docker #Enable Docker to start automatically on boot sudo usermod -aG docker $USER Test Docker with: docker run hello-world A success message should appear. If you see permission denied, run: newgrp docker Note: Log out and log back in (or restart the VM) for group changes to apply. Step 3: Install GitLab Runner Download the GitLab Runner binary: Assign execution permissions: Install and start the runner as a service: #Step1 sudo chmod +x /usr/local/bin/gitlab-runner #Step2 sudo curl -L --output /usr/local/bin/gitlab-runner \ https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64 #Step3 sudo gitlab-runner install --user=azureuser sudo gitlab-runner start sudo systemctl enable gitlab-runner #Enable GitLab Runner to start automatically on boot Step 4: Register the GitLab Runner Navigate to runner section on your Gitlab to generate registration token (Gitlab -> Settings -> CI/CD -> Runners -> New Project Runner) On your Azure VM, run: sudo gitlab-runner register \ --url https://gitlab.com/ \ --registration-token <YOUR_TOKEN> \ --executor docker \ --docker-image Ubuntu:22.04 \ --description "Azure VM Runner" \ --tag-list "gitlab-runner-vm" \ --non-interactive Note: Replace the registration toke, description, tag-list as required After registration, restart the runner: sudo gitlab-runner restart Verify the runner’s status with: sudo gitlab-runner list Your runner should appear in the list. If runner does not appear, make sure to follow step 4 as described. Step 5: Add Runner Tags to Your Pipeline In .gitlab-ci.yml default: tags: - gitlab-runner-vm Step 6: Verify Pipeline Execution Create a simple job to test the runner: test-runner: tags: - gitlab-runner-vm script: - echo "Runner is working!" Troubleshooting Common Issues Permission Denied (Docker Error) Error: docker: permission denied while trying to connect to the Docker daemon socket Solution: Run newgrp docker If unresolved, restart Docker: sudo systemctl restart docker No Active Runners Online Error: This job is stuck because there are no active runners online. Solution: Check runner status: sudo gitlab-runner status If inactive, restart the runner: sudo gitlab-runner restart Ensure your runner tag in the pipelines matches the one you provided while creating runner for project Final Tips Always restart the runner after making configuration changes: sudo gitlab-runner restart Remember to periodically check the runner’s status and update its configuration as needed to keep it running smoothly. Happy coding and enjoy the enhanced capabilities of your new GitLab Runner setup!1.7KViews2likes2Comments
- Monitoring Time Drift in Azure Kubernetes Service for Regulated IndustriesIn this blog post, I will share how customers can monitor their Azure Kubernetes Service (AKS) clusters for time drifts using a custom container image, Azure managed Prometheus and Grafana. Understanding Time Sync in Cloud Environments Azure’s underlying infrastructure uses Microsoft-managed Stratum 1 time servers connected to GPS-based atomic clocks to ensure a highly accurate reference time. Linux VMs in Azure can synchronize either with their Azure host via Precision Time Protocol (PTP) devices like /dev/ptp0, or with external NTP servers over the public internet. The Azure host, being physically closer and more stable, provides a lower-latency and more reliable time source. On Azure, Linux VMs use chrony, a Linux time synchronization service. It provides superior performance under varying network conditions and includes advanced capabilities for handling drift and jitter. Terminology like "Last offset" (difference between system and reference time), "Skew" (drift rate), and "Root dispersion" (uncertainty of the time measurement) help quantify how well a system's clock is aligned. Solution Overview At the time of writing this article, it is not possible to monitor clock errors on Azure Kubernetes Service nodes directly, since node images can not be customized and are managed by Azure. Customers may ask "How do we prove our AKS workloads are keeping time accurately?" To address this, I've developed a solution that consists of a custom container image running as a DaemonSet, which generates Prometheus metrics and can be visualized on Grafana dashboards, to continuously monitor time drift across Kubernetes nodes. This solution deploys a containerized Prometheus exporter to every node in the Azure Kubernetes Service (AKS) cluster. It exposes a metric representing the node's time drift, allowing Prometheus to scrape the data and Azure Managed Grafana to visualize it. The design emphasizes security and simplicity: the container runs as a non-root user with minimal privileges, and it securely accesses the Chrony socket on the host to extract time synchronization metrics. As we walk through the solution, it is recommended that you follow along with code on GitHub. Technical Deep Dive: From Image Build to Pod Execution The custom container image is built around a Python script (chrony_exporter.py) that runs the chronyc tracking command, parses its output, and calculates a 'clock error' value. This value is calculated in the following way: clock_error = |last_offset| + root_dispersion + (0.5 × root_delay) This script then exports the result via a Prometheus-compatible HTTP endpoint. The only dependency it requires is the prometheus_client library, defined in the requirements.txt file Secure Entrypoint with Limited Root Access The container is designed to run as a non-root user. The entrypoint.sh script launches the Python exporter using sudo, which is the only command that this user is allowed to run with elevated privileges. This ensures that while root is required to query chronyc, the rest of the container operates with a strict least-privilege model: #!/bin/bash echo "Executing as non-root user: $(whoami)" sudo /app/venv/bin/python /app/chrony_exporter.py By restricting the sudoers file to a single command, this approach allows safe execution of privileged operations without exposing the container to unnecessary risk. DaemonSet with Pod Hardening and Host Socket Access The deployment is defined as a Kubernetes DaemonSet (chrony-ds.yaml), ensuring one pod runs on each AKS node. The pod has the following hardening and configuration settings: Runs as non-root (runAsUser: 1001, runAsNonRoot: true) Read-only root filesystem to minimize tampering risk and altering of scripts HostPath volume mount for /run/chrony so it can query the Chrony daemon on the node Prometheus annotations for automated metric scraping Example DaemonSet snippet: securityContext: runAsUser: 1001 runAsGroup: 1001 runAsNonRoot: true containers: - name: chrony-monitor image: <chrony-image> command: ["/bin/sh", "-c", "/app/entrypoint.sh"] securityContext: readOnlyRootFilesystem: true volumeMounts: - name: chrony-socket mountPath: /run/chrony volumes: - name: chrony-socket hostPath: path: /run/chrony type: Directory This setup gives the container controlled access to the Chrony Unix socket on the host while preventing any broader filesystem access. Configuration: Using the Azure Host as a Time Source The underlying AKS node's (Linux VM) chrony.conf file is configured to sync time from the Azure host through the PTP device (/dev/ptp0). This configuration is optimized for cloud environments and includes: refclock PHC /dev/ptp0 for direct PTP sync makestep 1.0 -1 to immediately correct large drifts on startup This ensures that time metrics reflect highly accurate local synchronization, avoiding public NTP network variability. With these layers combined—secure container build, restricted execution model, and Kubernetes-native deployment—you gain a powerful yet minimalistic time accuracy monitoring solution tailored for financial and regulated environments. Setup Instructions Prerequisites An existing AKS cluster Azure Monitor with Managed Prometheus and Grafana enabled An Azure Container Registry (ACR) to host your image Steps Clone the project repository: git clone https://github.com/Azure/chrony-tracker.git Build the Docker image locally: docker build --platform=linux/amd64 -t chrony-tracker:1.0 . Tag the image for your ACR: docker tag chrony-tracker:1.0 <youracr>.azurecr.io/chrony-tracker:1.0 Push the image to ACR: docker push <youracr>.azurecr.io/chrony-tracker:1.0 Update the DaemonSet YAML (chrony-ds.yaml) to use your ACR image: image: <youracr>.azurecr.io/chrony-tracker:1.0 Apply the DaemonSet: kubectl apply -f chrony-ds.yaml Apply the Prometheus scrape config (ConfigMap): kubectl apply -f ama-metrics-prometheus-config-configmap.yaml Delete the "ama-metrics-xxx" pods from the kube-system namespace to apply the new configurations After these steps, your AKS nodes will be monitored for clock drift. Viewing the Metric in Managed Grafana Once the DaemonSet and ConfigMap are deployed and metrics are being scraped by Managed Prometheus, you can visualize the chrony_clock_error_ms metric in Azure Managed Grafana by following these steps: Open the Azure Portal and navigate to your Azure Managed Grafana resource. Select the Grafana workspace and navigate to the Endpoint by clicking on the URL under Overview From the left-hand menu, select Metrics and then click on + New metric exploration Enter the name of the metric "chrony_clock_error_ms" under Search metrics and click Select You should now be able to view the metric To customize it and view all sources, click on the Open in explorer button Optional: Secure the Metrics Endpoint To enhance the security of the /metrics endpoint exposed by each pod, you can enable basic authentication on the exporter. This requires configuring an HTTP server inside the container with basic authentication. You would also need to update your Prometheus ConfigMap to include authentication credentials . For detailed guidance on securing scrape targets, refer to the Prometheus documentation on authentication and TLS settings. In addition it is recommended to use Private link for Kubernetes monitoring with Azure Monitor and Azure managed Prometheus Learn More If you'd like to explore this solution further or integrate it into your production workloads, the following resources provide valuable guidance: Microsoft Learn: Time sync in Linux VMs chroncy-tracker GitHub repo Azure Monitor and Prometheus Integration Author Dotan Paz Sr. Cloud Solutions Architect, Microsoft838Views0likes0Comments
- Resiliency Best Practices You Need For your Blob Storage DataMaintaining Resiliency in Azure Blob Storage: A Guide to Best Practices Azure Blob Storage is a cornerstone of modern cloud storage, offering scalable and secure solutions for unstructured data. However, maintaining resiliency in Blob Storage requires careful planning and adherence to best practices. In this blog, I’ll share practical strategies to ensure your data remains available, secure, and recoverable under all circumstances. 1. Enable Soft Delete for Accidental Recovery (Most Important) Mistakes happen, but soft delete can be your safety net and. It allows you to recover deleted blobs within a specified retention period: Configure a soft delete retention period in Azure Storage. Regularly monitor your blob storage to ensure that critical data is not permanently removed by mistake. Enabling soft delete in Azure Blob Storage does not come with any additional cost for simply enabling the feature itself. However, it can potentially impact your storage costs because the deleted data is retained for the configured retention period, which means: The retained data contributes to the total storage consumption during the retention period. You will be charged according to the pricing tier of the data (Hot, Cool, or Archive) for the duration of retention 2. Utilize Geo-Redundant Storage (GRS) Geo-redundancy ensures your data is replicated across regions to protect against regional failures: Choose RA-GRS (Read-Access Geo-Redundant Storage) for read access to secondary replicas in the event of a primary region outage. Assess your workload’s RPO (Recovery Point Objective) and RTO (Recovery Time Objective) needs to select the appropriate redundancy. 3. Implement Lifecycle Management Policies Efficient storage management reduces costs and ensures long-term data availability: Set up lifecycle policies to transition data between hot, cool, and archive tiers based on usage. Automatically delete expired blobs to save on costs while keeping your storage organized. 4. Secure Your Data with Encryption and Access Controls Resiliency is incomplete without robust security. Protect your blobs using: Encryption at Rest: Azure automatically encrypts data using server-side encryption (SSE). Consider enabling customer-managed keys for additional control. Access Policies: Implement Shared Access Signatures (SAS) and Stored Access Policies to restrict access and enforce expiration dates. 5. Monitor and Alert for Anomalies Stay proactive by leveraging Azure’s monitoring capabilities: Use Azure Monitor and Log Analytics to track storage performance and usage patterns. Set up alerts for unusual activities, such as sudden spikes in access or deletions, to detect potential issues early. 6. Plan for Disaster Recovery Ensure your data remains accessible even during critical failures: Create snapshots of critical blobs for point-in-time recovery. Enable backup for blog & have the immutability feature enabled Test your recovery process regularly to ensure it meets your operational requirements. 7. Resource lock Adding Azure Locks to your Blob Storage account provides an additional layer of protection by preventing accidental deletion or modification of critical resources 7. Educate and Train Your Team Operational resilience often hinges on user awareness: Conduct regular training sessions on Blob Storage best practices. Document and share a clear data recovery and management protocol with all stakeholders. 8. "Critical Tip: Do Not Create New Containers with Deleted Names During Recovery" If a container or blob storage is deleted for any reason and recovery is being attempted, it’s crucial not to create a new container with the same name immediately. Doing so can significantly hinder the recovery process by overwriting backend pointers, which are essential for restoring the deleted data. Always ensure that no new containers are created using the same name during the recovery attempt to maximize the chances of successful restoration. Wrapping It Up Azure Blob Storage offers an exceptional platform for scalable and secure storage, but its resiliency depends on following best practices. By enabling features like soft delete, implementing redundancy, securing data, and proactively monitoring your storage environment, you can ensure that your data is resilient to failures and recoverable in any scenario. Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn Data redundancy - Azure Storage | Microsoft Learn Overview of Azure Blobs backup - Azure Backup | Microsoft Learn Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn1.1KViews1like0Comments
- Azure Extended Zones: Optimizing Performance, Compliance, and AccessibilityAzure Extended Zones are small-scale Azure extensions located in specific metros or jurisdictions to support low-latency and data residency workloads. They enable users to run latency-sensitive applications close to end users while maintaining compliance with data residency requirements, all within the Azure ecosystem.3KViews2likes0Comments
- (Part-2) Leverage Bicep: Standard model to Automate Azure IaaS deploymentSubjects. Those deeply interested in IaC using Azure. Those who understand the basics of Azure Resource Manager Templates and want to work deeply with Bicep. Those who understand the names of services and functions used in Azure IaaS and have experience in building automation. Agenda. How about Bicep Difference between ARM templates and Bicep Basic functionality Bicep Development Environment Sample Code and Explanation Traps and Avoidance Notes. Azure services are evolving every day. This content is based on what we have confirmed as of April 2023.7KViews1like0Comments
- (Part-1) Leverage Bicep: Standard model to Automate Azure IaaS deploymentSubjects. Those deeply interested in IaC using Azure. Those who understand the basics of Azure Resource Manager Templates and want to work deeply with Bicep. Those who understand the names of services and functions used in Azure IaaS and have experience in building automation. Agenda. How about Bicep Difference between ARM templates and Bicep Basic functionality Bicep Development Environment Sample Code and Explanation Traps and Avoidance Notes. Azure services are evolving every day. This content is based on what we have confirmed as of April 2023.8.6KViews4likes0Comments
- (Part-3) Leverage Bicep: Standard model to Automate Azure IaaS deploymentSubjects. Those deeply interested in IaC using Azure. Those who understand the basics of Azure Resource Manager Templates and want to work deeply with Bicep. Those who understand the names of services and functions used in Azure IaaS and have experience in building automation. Agenda. How about Bicep Difference between ARM templates and Bicep Basic functionality Bicep Development Environment Sample Code and Explanation Traps and Avoidance Notes. Azure services are evolving every day. This content is based on what we have confirmed as of April 2023.7.6KViews1like1Comment