azure virtual machines
16 TopicsAnnouncing Cobalt 200: Azure’s next cloud-native CPU
By Selim Bilgin, Corporate Vice President, Silicon Engineering, and Pat Stemen, Vice President, Azure Cobalt Today, we’re thrilled to announce Azure Cobalt 200, our next-generation Arm-based CPU designed for cloud-native workloads. Cobalt 200 is a milestone in our continued approach to optimize every layer of the cloud stack from silicon to software. Our design goals were to deliver full compatibility for workloads using our existing Azure Cobalt CPUs, deliver up to 50% performance improvement over Cobalt 100, and integrate with the latest Microsoft security, networking and storage technologies. Like its predecessor, Cobalt 200 is optimized for common customer workloads and delivers unique capabilities for our own Microsoft cloud products. Our first production Cobalt 200 servers are now live in our datacenters, with wider rollout and customer availability coming in 2026. Azure Cobalt 200 SoC and platform Building on Cobalt 100: Leading Price-Performance Our Azure Cobalt journey began with Cobalt 100, our first custom-built processor for cloud-native workloads. Cobalt 100 VMs have been Generally Available (GA) since October of 2024 and availability has expanded rapidly to 32 Azure datacenter regions around the world. In just one year, we have been blown away with the pace that customers have adopted the new platform, and migrated their most critical workloads to Cobalt 100 for the performance, efficiency, and price-performance benefits. Cloud analytics leaders like Databricks and Snowflake are adopting Cobalt 100 to optimize their cloud footprint. The compute performance and energy-efficiency balance of Cobalt 100-based virtual machines and containers has proven ideal for large-scale data processing workloads. Microsoft’s own cloud services have also rapidly adopted Azure Cobalt for similar benefits. Microsoft Teams achieved up to 45% better performance using Cobalt 100 than their previous compute platform. This increased performance means less servers needed for the same task, for instance Microsoft Teams media processing uses 35% fewer compute cores with Cobalt 100. Designing Compute Infrastructure for Real Workloads With this solid foundation, we set out to design a worthy successor – Cobalt 200. We faced a key challenge: traditional compute benchmarks do not represent the diversity of our customer workloads. Our telemetry from the wide range of workloads running in Azure (small microservices to globally available SaaS products) did not match common hardware performance benchmarks. Existing benchmarks tend to skew toward CPU core-focused compute patterns, leaving gaps in how real-world cloud applications behave at scale when using network and storage resources. Optimizing Azure Cobalt for customer workloads requires us to expand beyond these CPU core benchmarks to truly understand and model the diversity of customer workloads in Azure. As a result, we created a portfolio of benchmarks drawn directly from the usage patterns we see in Azure, including databases, web servers, storage caches, network transactions, and data analytics. Each of our benchmark workloads includes multiple variants for performance evaluation based on the ways our customers may use the underlying database, storage, or web serving technology. In total, we built and refined over 140 individual benchmark variants as part of our internal evaluation suite. With the help of our software teams, we created a complete digital twin simulation from the silicon up: beginning with the CPU core microarchitecture, fabric, and memory IP blocks in Cobalt 200, all the way through the server design and rack topology. Then, we used AI, statistical modelling and the power of Azure to model the performance and power consumption of the 140 benchmarks against 2,800 combinations of SoC and system design parameters: core count, cache size, memory speed, server topology, SoC power, and rack configuration. This resulted in the evaluation of over 350,000 configuration candidates of the Cobalt 200 system as part of our design process. This extensive modelling and simulation helped us to quickly iterate to find the optimal design point for Cobalt 200, delivering over 50% increased performance compared to Cobalt 100, all while continuing to deliver our most power-efficient platform in Azure. Cobalt 200: Delivering Performance and Efficiency At the heart of every Cobalt 200 server is the most advanced compute silicon in Azure: the Cobalt 200 System-on-Chip (SoC). The Cobalt 200 SoC is built around the Arm Neoverse Compute Subsystems V3 (CSS V3), the latest performance-optimized core and fabric from Arm. Each Cobalt 200 SoC includes 132 active cores with 3MB of L2 cache per-core and 192MB of L3 system cache to deliver exceptional performance for customer workloads. Power efficiency is just as important as raw performance. Energy consumption represents a significant portion of the lifetime operating cost of a cloud server. One of the unique innovations in our Azure Cobalt CPUs is individual per-core Dynamic Voltage and Frequency Scaling (DVFS). In Cobalt 200 this allows each of the 132 cores to run at a different performance level, delivering optimal power consumption no matter the workload. We are also taking advantage of the latest TSMC 3nm process, further improving power efficiency. Security is top-of-mind for all of our customers and a key part of the unique innovation in Cobalt 200. We designed and built a custom memory controller for Cobalt 200, so that memory encryption is on by default with negligible performance impact. Cobalt 200 also implements Arm’s Confidential Compute Architecture (CCA), which supports hardware-based isolation of VM memory from the hypervisor and host OS. When designing Cobalt 200, our benchmark workloads and design simulations revealed an interesting trend: several universal compute patterns emerged – compression, decompression, and encryption. Over 30% of cloud workloads had significant use of one of these common operations. Optimizing for these common operations required a different approach than just cache sizing and CPU core selection. We designed custom compression and cryptography accelerators – dedicated blocks of silicon on each Cobalt 200 SoC – solely for the purpose of accelerating these operations without sacrificing CPU cycles. These accelerators help reduce workload CPU consumption and overall costs. For example, by offloading compression and encryption tasks to the Cobalt 200 accelerator, Azure SQL is able to reduce use of critical compute resources, prioritizing them for customer workloads. Leading Infrastructure Innovation with Cobalt 200 Azure Cobalt is more than just an SoC, and we are constantly optimizing and accelerating every layer in the infrastructure. The latest Azure Boost capabilities are built into the new Cobalt 200 system, which significantly improves networking and remote storage performance. Azure Boost delivers increased network bandwidth and offloads remote storage and networking tasks to custom hardware, improving overall workload performance and reducing latency. Cobalt 200 systems also embed the Azure Integrated HSM (Hardware Security Module), providing customers with top-tier cryptographic key protection within Azure’s infrastructure, ensuring sensitive data stays secure. The Azure Integrated HSM works with Azure Key Vault for simplified management of encryption keys, offering high availability and scalability as well as meeting FIPS 140-3 Level 3 compliance. An Azure Cobalt 200 server in a validation lab Looking Forward to 2026 We are excited about the innovation and advanced technology in Cobalt 200 and look forward to seeing how our customers create breakthrough products and services. We’re busy racking and stacking Cobalt 200 servers around the world and look forward to sharing more as we get closer to wider availability next year. Check out Microsoft Ignite opening keynote Read more on what's new in Azure at Ignite Learn more about Microsoft's global infrastructure13KViews8likes0CommentsAzure VNet Flow Logs with Terraform: The Complete Migration and Traffic Analytics Guide
Migrating from NSG Flow Logs to VNet Flow Logs in Azure: Implementation with Terraform Author: Ibrahim Baig (Consultant) Executive Summary Microsoft is retiring Network Security Group (NSG) flow logs and recommends migrating to Virtual Network (VNet) flow logs. After June 30, 2025, new NSG flow logs cannot be created, and all NSG flow logs will be retired by September 30, 2027. Migrating to VNet flow logs ensures continued support and provides broader, simpler network visibility. What Changed & Key Dates - June 30, 2025: Creation of new NSG flow logs is blocked. - September 30, 2027: NSG flow logs are retired (resources deleted; historical blobs remain per retention policy). - Microsoft provides migration scripts and policy guidance for NSG→VNet flow logs. Why Migrate? (Benefits) Operational Simplicity & Coverage - Enable logging at the VNet, subnet, or NIC scope—no dependency on NSG. - Broader visibility across all workloads inside a VNet, not just NSG-governed traffic. Security & Analytics - Native integration with Traffic Analytics for enriched insights. - Monitor Azure Virtual Network Manager (AVNM) security admin rules. Continuity & Cost Parity - VNet flow logs are priced the same as NSG flow logs (with 5 GB/month free). What’s New in VNet Flow Logs - Scopes: Enable at VNet, subnet, or NIC level. - Storage: JSON logs to Azure Storage. - At-scale enablement: Built-in Azure Policy for auditing and auto-deployment. - Analytics: Traffic Analytics add-on for deep insights. - AVNM awareness: Observe centrally managed security admin rules. Traffic Analytics: Capabilities & Value Traffic Analytics (TA) is a powerful add-on for VNet flow logs, providing: - Automated Traffic Insights: Visualize traffic flows, identify top talkers, and detect anomalous patterns. - Threat Detection: Surface suspicious flows, lateral movement, and communication with malicious IPs. - Network Segmentation Validation: Confirm that segmentation policies are effective and spot unintended access. - Performance Monitoring: Analyze bandwidth usage, latency, and flow volumes for troubleshooting. - Customizable Dashboards: Drill down by subnet, region, or workload for targeted investigations. - Integration: Seamless with Azure Monitor and Log Analytics for alerting and automation. For practical recipes and advanced use cases, see https://blog.cloudtrooper.net/2024/05/08/vnet-flow-logs-recipes/. GAP: The Terraform Registry page for azurerm_network_watcher_flow_log does not yet provide an explicit VNet flow logs example. In practice, you use the same resource and set target_resource_id to the ID of the VNet (or Subnet/NIC). Registry page (latest): https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_watcher_flow_log Important notes: - Same resource block: azurerm_network_watcher_flow_log - Use target_resource_id = <resource ID of VNet/Subnet/NIC> (instead of legacy network_security_group_id) - As of 30 July 2025, creating new NSG flow logs is no longer possible (provider notes); migrate to VNet/Subnet/NIC targets. - Keep your azurerm provider up-to-date, earlier builds had validation gaps for subnet/NIC IDs; these were tracked and addressed in provider issues. Implementation Guide Option A — Terraform (Recommended for IaC) Note: Use a dedicated Storage account for flow logs, as lifecycle rules may be overwritten. terraform { required_version = ">= 1.5" required_providers { azurerm = { source = "hashicorp/azurerm" version = ">= 3.110.0" # or latest } } } provider "azurerm" { features {} } data "azurerm_network_watcher" "this" { name = "NetworkWatcher_${var.region}" resource_group_name = "NetworkWatcherRG" } resource "azurerm_network_watcher_flow_log" "vnet_flow_log" { name = "${var.vnet_name}-flowlog" network_watcher_name = data.azurerm_network_watcher.this.name resource_group_name = data.azurerm_network_watcher.this.resource_group_name target_resource_id = azurerm_virtual_network.vnet.id storage_account_id = azurerm_storage_account.flowlogs_sa.id enabled = true retention_policy { enabled = true days = 30 } traffic_analytics { enabled = true workspace_id = azurerm_log_analytics_workspace.law.workspace_id workspace_region = azurerm_log_analytics_workspace.law.location workspace_resource_id = azurerm_log_analytics_workspace.law.id interval_in_minutes = 60 } tags = { owner = "network-platform" environment = var.env } } Option B — Azure CLI az network watcher flow-log create \ --location westus \ --resource-group MyResourceGroup \ --name myVNetFlowLog \ --vnet MyVNetName \ --storage-account mystorageaccount \ --workspace "/subscriptions/<subId>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<LAWName>" \ --traffic-analytics true \ --interval 60 Option C — Azure Portal - Go to Network Watcher → Flow logs → + Create. - Choose Flow log type = Virtual network; select VNet/Subnet/NIC, Storage account, and optionally enable Traffic Analytics. Option D — At Scale via Azure Policy - Use built-in policies to audit and auto-deploy VNet flow logs (DeployIfNotExists). Migration Approach (NSG → VNet Flow Logs) Inventory existing NSG flow logs. Choose migration method: Microsoft script or Azure Policy. Run both in parallel temporarily to validate. Disable NSG flow logs before retirement. Challenges & Mitigations - Permissions: Ensure required roles on Log Analytics workspace. - Terraform lifecycle: Use a dedicated Storage account. - Tooling compatibility: Verify SIEM/NDR support. - Provider/API maturity: Use current azurerm provider. Validation Checklist - Storage: New blobs appear in the configured Storage account. - Traffic Analytics: Data visible in Log Analytics workspace. - AVNM: Confirm traffic allowed/denied states appear in logs. Cost Considerations - VNet flow logs ingestion: $0.50/GB after 5 GB free/month. - Traffic Analytics processing: $2.30/GB (60-min) or $3.50/GB (10-min). Traffic Analytics Deep Dive: VNet Flow Logs are stored in Azure Blob Storage. Optionally, you can enable Traffic Analytics, which will do two things: it will enrich the flow logs with additional information, and will send everything to a Log Analytics Workspace for easy querying. This “enrich and forward to Log Analytics” operation will happen in intervals, either every 10 minutes or every hour. Table Structure: NTAIPDetails This table will contain some enrichment data about public IP addresses, including whether they belong to Azure services and their region, and geolocation information for other public IPs. Here you can see a sample of what that table looks like: NTAIpDetails | distinct FlowType, PublicIpDetails, Location Table Structure: NTATopologyDetails This table contains information about different elements of your topology, including VNets, subnets, route tables, routes, NSGs, Application Gateways and much more. Here you cans see what it looks like: Table Structure: NTANetAnalytics Alright, now we are coming to more interesting things: this table is the one containing the flows we are looking for. Records in this table will contain the usual attributes you would expect such as source and destination IP, protocol, and destination port. Additionally, data will be enriched with information such as: Source and destination VM Source and destination NIC Source and destination subnet Source and destination load balancer Flow encryption (yes/no) Whether the flow is going over ExpressRoute And many more Further below you can read some scenarios with detailed queries that will show you some examples of ways you can extract information from VNet Flow Logs and Traffic Analytics. Of course, these are just some of the scenarios that came to mind on my topology, the idea is that you can get inspiration from these queries to support your individual use case. Example Scenario: Imagine you want to see with which IP addresses a given virtual machine has been talking to in the last few days: NTANetAnalytics | where TimeGenerated > ago(10d) | where SrcIp == "10.10.1.4" and strlen(DestIp)>0 | summarize TotalBytes=sum(BytesDestToSrc+BytesSrcToDest) by SrcIp, DestIp Similarly, you can play around with such KQL queries in the workspace to deep dive into the Flow Logs. References & Further Reading https://learn.microsoft.com/en-us/azure/network-watcher/nsg-flow-logs-overview https://learn.microsoft.com/en-us/azure/network-watcher/nsg-flow-logs-migrate https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-overview https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-manage https://learn.microsoft.com/en-us/cli/azure/network/watcher/flow-log?view=azure-cli-latest https://learn.microsoft.com/en-us/azure/network-watcher/vnet-flow-logs-policy https://azure.microsoft.com/en-us/pricing/details/network-watcher/ https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_watcher_flow_log https://blog.cloudtrooper.net/2024/05/08/vnet-flow-logs-recipes/548Views1like0CommentsResiliency Best Practices You Need For your Blob Storage Data
Maintaining Resiliency in Azure Blob Storage: A Guide to Best Practices Azure Blob Storage is a cornerstone of modern cloud storage, offering scalable and secure solutions for unstructured data. However, maintaining resiliency in Blob Storage requires careful planning and adherence to best practices. In this blog, I’ll share practical strategies to ensure your data remains available, secure, and recoverable under all circumstances. 1. Enable Soft Delete for Accidental Recovery (Most Important) Mistakes happen, but soft delete can be your safety net and. It allows you to recover deleted blobs within a specified retention period: Configure a soft delete retention period in Azure Storage. Regularly monitor your blob storage to ensure that critical data is not permanently removed by mistake. Enabling soft delete in Azure Blob Storage does not come with any additional cost for simply enabling the feature itself. However, it can potentially impact your storage costs because the deleted data is retained for the configured retention period, which means: The retained data contributes to the total storage consumption during the retention period. You will be charged according to the pricing tier of the data (Hot, Cool, or Archive) for the duration of retention 2. Utilize Geo-Redundant Storage (GRS) Geo-redundancy ensures your data is replicated across regions to protect against regional failures: Choose RA-GRS (Read-Access Geo-Redundant Storage) for read access to secondary replicas in the event of a primary region outage. Assess your workload’s RPO (Recovery Point Objective) and RTO (Recovery Time Objective) needs to select the appropriate redundancy. 3. Implement Lifecycle Management Policies Efficient storage management reduces costs and ensures long-term data availability: Set up lifecycle policies to transition data between hot, cool, and archive tiers based on usage. Automatically delete expired blobs to save on costs while keeping your storage organized. 4. Secure Your Data with Encryption and Access Controls Resiliency is incomplete without robust security. Protect your blobs using: Encryption at Rest: Azure automatically encrypts data using server-side encryption (SSE). Consider enabling customer-managed keys for additional control. Access Policies: Implement Shared Access Signatures (SAS) and Stored Access Policies to restrict access and enforce expiration dates. 5. Monitor and Alert for Anomalies Stay proactive by leveraging Azure’s monitoring capabilities: Use Azure Monitor and Log Analytics to track storage performance and usage patterns. Set up alerts for unusual activities, such as sudden spikes in access or deletions, to detect potential issues early. 6. Plan for Disaster Recovery Ensure your data remains accessible even during critical failures: Create snapshots of critical blobs for point-in-time recovery. Enable backup for blog & have the immutability feature enabled Test your recovery process regularly to ensure it meets your operational requirements. 7. Resource lock Adding Azure Locks to your Blob Storage account provides an additional layer of protection by preventing accidental deletion or modification of critical resources 7. Educate and Train Your Team Operational resilience often hinges on user awareness: Conduct regular training sessions on Blob Storage best practices. Document and share a clear data recovery and management protocol with all stakeholders. 8. "Critical Tip: Do Not Create New Containers with Deleted Names During Recovery" If a container or blob storage is deleted for any reason and recovery is being attempted, it’s crucial not to create a new container with the same name immediately. Doing so can significantly hinder the recovery process by overwriting backend pointers, which are essential for restoring the deleted data. Always ensure that no new containers are created using the same name during the recovery attempt to maximize the chances of successful restoration. Wrapping It Up Azure Blob Storage offers an exceptional platform for scalable and secure storage, but its resiliency depends on following best practices. By enabling features like soft delete, implementing redundancy, securing data, and proactively monitoring your storage environment, you can ensure that your data is resilient to failures and recoverable in any scenario. Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn Data redundancy - Azure Storage | Microsoft Learn Overview of Azure Blobs backup - Azure Backup | Microsoft Learn Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn1.2KViews1like1CommentOperational Excellence In AI Infrastructure Fleets: Standardized Node Lifecycle Management
Co-authors: Choudary Maddukuri and Bhushan Mehendale AI infrastructure is scaling at an unprecedented pace, and the complexity of managing it is growing just as quickly. Onboarding new hardware into hyperscale fleets can take months, slowed by fragmented tools, vendor-specific firmware, and inconsistent diagnostics. As hyperscalers expand with diverse accelerators and CPU architectures, operational friction has become a critical bottleneck. Microsoft, in collaboration with the Open Compute Project (OCP) and leading silicon partners, is addressing this challenge. By standardizing lifecycle management across heterogeneous fleets, we’ve dramatically reduced onboarding effort, improved reliability, and achieved >95% Nodes-in-Service on incredibly large fleet sizes. This blog explores how we are contributing to and leveraging open standards to transform fragmented infrastructure into scalable, vendor-neutral AI platforms. Industry Context & Problem The rapid growth of generative AI has accelerated the adoption of GPUs and accelerators from multiple vendors, alongside diverse CPU architectures such as Arm and x86. Each new hardware SKU introduces its own ecosystem of proprietary tools, firmware update processes, management interfaces, reliability mechanisms, and diagnostic workflows. This hardware diversity leads to engineering toil, delayed deployments, and inconsistent customer experiences. Without a unified approach to lifecycle management, hyperscalers face escalating operational costs, slower innovation, and reduced efficiency. Node Lifecycle Standardization: Enabling Scalable, Reliable AI Infrastructure Microsoft, through the Open Compute Project (OCP) in collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, is leading an industry-wide initiative to standardize AI infrastructure lifecycle management across GPU and CPU hardware management workstreams. Historically, onboarding each new SKU was a highly resource-intensive effort due to custom implementations and vendor-specific behaviors that required extensive Azure integration. This slowed scalability, increased engineering overhead, and limited innovation. With standardized node lifecycle processes and compliance tooling, hyperscalers can now onboard new SKUs much faster, achieving over 70% reduction in effort while enhancing overall fleet operational excellence. These efforts also enable silicon vendors to ensure interoperability across multiple cloud providers. Figure: How Standardization benefits both Hyperscalers & Suppliers. Key Benefits and Capabilities Firmware Updates: Firmware update mechanisms aligned with DMTF standards, minimize downtime and streamline fleet-wide secure deployments. Unified Manageability Interfaces: Standardized Redfish APIs and PLDM protocols create a consistent framework for out-of-band management, reducing integration overhead and ensuring predictable behavior across hardware vendors. RAS (Reliability, Availability and Serviceability) Features: Standardization enforces minimum RAS requirements across all IP blocks, including CPER (Common Platform Error Record) based error logging, crash dumps, and error recovery flows to enhance system uptime. Debug & Diagnostics: Unified APIs and standardized crash & debug dump formats reduce issue resolution time from months to days. Streamlined diagnostic workflows enable precise FRU isolation and clear service actions. Compliance Tooling: Tool contributions such as CTAM (Compliance Tool for Accelerator Manageability) and CPACT (Cloud Processor Accessibility Compliance Tool) automate compliance and acceptance testing—ensuring suppliers meet hyperscaler requirements for seamless onboarding. Technical Specifications & Contributions Through deep collaboration within the Open Compute Project (OCP) community, Microsoft and its partners have published multiple specifications that streamline SKU development, validation, and fleet operations. Summary of Key Contributions Specification Focus Area Benefit GPU Firmware Update requirements Firmware Updates Enables consistent firmware update processes across vendors GPU Management Interfaces Manageability Standardizes telemetry and control via Redfish/PLDM GPU RAS Requirements Reliability and Availability Reduces AI job interruptions caused by hardware errors CPU Debug and RAS requirements Debug and Diagnostics Achieves >95% node serviceability through unified diagnostics and debug CPU Impactless Updates requirements Impactless Updates Enables Impactless firmware updates to address security and quality issues without workload interruptions Compliance Tools Validation Automates specification compliance testing for faster hardware onboarding Embracing Open Standards: A Collaborative Shift in AI Infrastructure Management This standardized approach to lifecycle management represents a foundational shift in how AI infrastructure is maintained. By embracing open standards and collaborative innovation, the industry can scale AI deployments faster, with greater reliability and lower operational cost. Microsoft’s leadership within the OCP community—and its deep partnerships with other Hyperscalers and silicon vendors—are paving the way for scalable, interoperable, and vendor-neutral AI infrastructure across the global cloud ecosystem. To learn more about Microsoft’s datacenter innovations, check out the virtual datacenter tour at datacenters.microsoft.com.784Views0likes0CommentsDeploying a GitLab Runner on Azure: A Step-by-Step Guide
This guide walks you through the entire process — from VM setup to running your first successful job. Step 1: Create an Azure VM Log in to the Azure Portal. Create a new VM with the following settings: Image: Ubuntu 20.04 LTS (recommended) Authentication: SSH Public Key (generate a .pem file for secure access) Once created, note the public IP address. Connect to the VM From your terminal: ssh -i "/path/to/your/key.pem" admin_name@<YOUR_VM_PUBLIC_IP> Note: Make sure to replace the above command with path to .pem file and admin name which you would have given during VM deployment. Step 2: Install Docker on the Azure VM Run the following commands to install Docker: sudo apt update && sudo apt upgrade -y sudo apt install -y docker.io sudo systemctl start docker sudo systemctl enable docker #Enable Docker to start automatically on boot sudo usermod -aG docker $USER Test Docker with: docker run hello-world A success message should appear. If you see permission denied, run: newgrp docker Note: Log out and log back in (or restart the VM) for group changes to apply. Step 3: Install GitLab Runner Download the GitLab Runner binary: Assign execution permissions: Install and start the runner as a service: #Step1 sudo chmod +x /usr/local/bin/gitlab-runner #Step2 sudo curl -L --output /usr/local/bin/gitlab-runner \ https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64 #Step3 sudo gitlab-runner install --user=azureuser sudo gitlab-runner start sudo systemctl enable gitlab-runner #Enable GitLab Runner to start automatically on boot Step 4: Register the GitLab Runner Navigate to runner section on your Gitlab to generate registration token (Gitlab -> Settings -> CI/CD -> Runners -> New Project Runner) On your Azure VM, run: sudo gitlab-runner register \ --url https://gitlab.com/ \ --registration-token <YOUR_TOKEN> \ --executor docker \ --docker-image Ubuntu:22.04 \ --description "Azure VM Runner" \ --tag-list "gitlab-runner-vm" \ --non-interactive Note: Replace the registration toke, description, tag-list as required After registration, restart the runner: sudo gitlab-runner restart Verify the runner’s status with: sudo gitlab-runner list Your runner should appear in the list. If runner does not appear, make sure to follow step 4 as described. Step 5: Add Runner Tags to Your Pipeline In .gitlab-ci.yml default: tags: - gitlab-runner-vm Step 6: Verify Pipeline Execution Create a simple job to test the runner: test-runner: tags: - gitlab-runner-vm script: - echo "Runner is working!" Troubleshooting Common Issues Permission Denied (Docker Error) Error: docker: permission denied while trying to connect to the Docker daemon socket Solution: Run newgrp docker If unresolved, restart Docker: sudo systemctl restart docker No Active Runners Online Error: This job is stuck because there are no active runners online. Solution: Check runner status: sudo gitlab-runner status If inactive, restart the runner: sudo gitlab-runner restart Ensure your runner tag in the pipelines matches the one you provided while creating runner for project Final Tips Always restart the runner after making configuration changes: sudo gitlab-runner restart Remember to periodically check the runner’s status and update its configuration as needed to keep it running smoothly. Happy coding and enjoy the enhanced capabilities of your new GitLab Runner setup!1.9KViews2likes2CommentsMonitoring Time Drift in Azure Kubernetes Service for Regulated Industries
In this blog post, I will share how customers can monitor their Azure Kubernetes Service (AKS) clusters for time drifts using a custom container image, Azure managed Prometheus and Grafana. Understanding Time Sync in Cloud Environments Azure’s underlying infrastructure uses Microsoft-managed Stratum 1 time servers connected to GPS-based atomic clocks to ensure a highly accurate reference time. Linux VMs in Azure can synchronize either with their Azure host via Precision Time Protocol (PTP) devices like /dev/ptp0, or with external NTP servers over the public internet. The Azure host, being physically closer and more stable, provides a lower-latency and more reliable time source. On Azure, Linux VMs use chrony, a Linux time synchronization service. It provides superior performance under varying network conditions and includes advanced capabilities for handling drift and jitter. Terminology like "Last offset" (difference between system and reference time), "Skew" (drift rate), and "Root dispersion" (uncertainty of the time measurement) help quantify how well a system's clock is aligned. Solution Overview At the time of writing this article, it is not possible to monitor clock errors on Azure Kubernetes Service nodes directly, since node images can not be customized and are managed by Azure. Customers may ask "How do we prove our AKS workloads are keeping time accurately?" To address this, I've developed a solution that consists of a custom container image running as a DaemonSet, which generates Prometheus metrics and can be visualized on Grafana dashboards, to continuously monitor time drift across Kubernetes nodes. This solution deploys a containerized Prometheus exporter to every node in the Azure Kubernetes Service (AKS) cluster. It exposes a metric representing the node's time drift, allowing Prometheus to scrape the data and Azure Managed Grafana to visualize it. The design emphasizes security and simplicity: the container runs as a non-root user with minimal privileges, and it securely accesses the Chrony socket on the host to extract time synchronization metrics. As we walk through the solution, it is recommended that you follow along with code on GitHub. Technical Deep Dive: From Image Build to Pod Execution The custom container image is built around a Python script (chrony_exporter.py) that runs the chronyc tracking command, parses its output, and calculates a 'clock error' value. This value is calculated in the following way: clock_error = |last_offset| + root_dispersion + (0.5 × root_delay) This script then exports the result via a Prometheus-compatible HTTP endpoint. The only dependency it requires is the prometheus_client library, defined in the requirements.txt file Secure Entrypoint with Limited Root Access The container is designed to run as a non-root user. The entrypoint.sh script launches the Python exporter using sudo, which is the only command that this user is allowed to run with elevated privileges. This ensures that while root is required to query chronyc, the rest of the container operates with a strict least-privilege model: #!/bin/bash echo "Executing as non-root user: $(whoami)" sudo /app/venv/bin/python /app/chrony_exporter.py By restricting the sudoers file to a single command, this approach allows safe execution of privileged operations without exposing the container to unnecessary risk. DaemonSet with Pod Hardening and Host Socket Access The deployment is defined as a Kubernetes DaemonSet (chrony-ds.yaml), ensuring one pod runs on each AKS node. The pod has the following hardening and configuration settings: Runs as non-root (runAsUser: 1001, runAsNonRoot: true) Read-only root filesystem to minimize tampering risk and altering of scripts HostPath volume mount for /run/chrony so it can query the Chrony daemon on the node Prometheus annotations for automated metric scraping Example DaemonSet snippet: securityContext: runAsUser: 1001 runAsGroup: 1001 runAsNonRoot: true containers: - name: chrony-monitor image: <chrony-image> command: ["/bin/sh", "-c", "/app/entrypoint.sh"] securityContext: readOnlyRootFilesystem: true volumeMounts: - name: chrony-socket mountPath: /run/chrony volumes: - name: chrony-socket hostPath: path: /run/chrony type: Directory This setup gives the container controlled access to the Chrony Unix socket on the host while preventing any broader filesystem access. Configuration: Using the Azure Host as a Time Source The underlying AKS node's (Linux VM) chrony.conf file is configured to sync time from the Azure host through the PTP device (/dev/ptp0). This configuration is optimized for cloud environments and includes: refclock PHC /dev/ptp0 for direct PTP sync makestep 1.0 -1 to immediately correct large drifts on startup This ensures that time metrics reflect highly accurate local synchronization, avoiding public NTP network variability. With these layers combined—secure container build, restricted execution model, and Kubernetes-native deployment—you gain a powerful yet minimalistic time accuracy monitoring solution tailored for financial and regulated environments. Setup Instructions Prerequisites An existing AKS cluster Azure Monitor with Managed Prometheus and Grafana enabled An Azure Container Registry (ACR) to host your image Steps Clone the project repository: git clone https://github.com/Azure/chrony-tracker.git Build the Docker image locally: docker build --platform=linux/amd64 -t chrony-tracker:1.0 . Tag the image for your ACR: docker tag chrony-tracker:1.0 <youracr>.azurecr.io/chrony-tracker:1.0 Push the image to ACR: docker push <youracr>.azurecr.io/chrony-tracker:1.0 Update the DaemonSet YAML (chrony-ds.yaml) to use your ACR image: image: <youracr>.azurecr.io/chrony-tracker:1.0 Apply the DaemonSet: kubectl apply -f chrony-ds.yaml Apply the Prometheus scrape config (ConfigMap): kubectl apply -f ama-metrics-prometheus-config-configmap.yaml Delete the "ama-metrics-xxx" pods from the kube-system namespace to apply the new configurations After these steps, your AKS nodes will be monitored for clock drift. Viewing the Metric in Managed Grafana Once the DaemonSet and ConfigMap are deployed and metrics are being scraped by Managed Prometheus, you can visualize the chrony_clock_error_ms metric in Azure Managed Grafana by following these steps: Open the Azure Portal and navigate to your Azure Managed Grafana resource. Select the Grafana workspace and navigate to the Endpoint by clicking on the URL under Overview From the left-hand menu, select Metrics and then click on + New metric exploration Enter the name of the metric "chrony_clock_error_ms" under Search metrics and click Select You should now be able to view the metric To customize it and view all sources, click on the Open in explorer button Optional: Secure the Metrics Endpoint To enhance the security of the /metrics endpoint exposed by each pod, you can enable basic authentication on the exporter. This requires configuring an HTTP server inside the container with basic authentication. You would also need to update your Prometheus ConfigMap to include authentication credentials . For detailed guidance on securing scrape targets, refer to the Prometheus documentation on authentication and TLS settings. In addition it is recommended to use Private link for Kubernetes monitoring with Azure Monitor and Azure managed Prometheus Learn More If you'd like to explore this solution further or integrate it into your production workloads, the following resources provide valuable guidance: Microsoft Learn: Time sync in Linux VMs chroncy-tracker GitHub repo Azure Monitor and Prometheus Integration Author Dotan Paz Sr. Cloud Solutions Architect, Microsoft986Views0likes0CommentsAzure Extended Zones: Optimizing Performance, Compliance, and Accessibility
Azure Extended Zones are small-scale Azure extensions located in specific metros or jurisdictions to support low-latency and data residency workloads. They enable users to run latency-sensitive applications close to end users while maintaining compliance with data residency requirements, all within the Azure ecosystem.3KViews2likes0Comments