azure hardware infrastructure
19 TopicsAnnouncing Kubernetes Center (Preview) On Azure Portal
Today, we’re excited to introduce the Kubernetes Center in the Azure portal, a new experience to simplify how customers manage, monitor, and optimize Azure Kubernetes Services environments at scale. The Kubernetes Center provides a unified view across all clusters, intelligent insights, and streamlined workflows that help platform teams stay in control while enabling developers to move fast. As Kubernetes adoption accelerates, many teams face growing challenges in managing clusters and workloads at scale. Getting a quick snapshot of what needs attention across clusters and workloads can quickly become overwhelming. Kubernetes Center is designed to change that, offering a streamlined and intuitive experience that brings everything together in one place, brings the most critical Kubernetes capabilities into a single pane of glass for unified visibility and control. What is Kubernetes Center?: Actionable insights from the start: Kubernetes Center surfaces key issues like security vulnerabilities, cluster alerts, compliance gaps, and upgrade recommendations in a single, unified view. This helps teams focus immediately on what matters most, leading to faster resolution times, improved security posture, and greater operational clarity. Streamlined management experience: By bringing together AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single experience, we’ve reduced the need to jump between services. Everything you need to manage Kubernetes on Azure is now organized in one consistent interface. Centralized Quickstarts: Whether you’re getting started or troubleshooting advanced scenarios, Kubernetes Center brings relevant documentation, learning resources, and in-context help into one place so you can spend less time searching and more time building. Azure Portal: From Distinct landing experiences for AKS, Fleet Manager, and Managed Kubernetes Namespaces: To a streamlined management experience: Get the big picture at a glance, then dive deeper with individual pages designed for effortless discovery. Centralized Quickstarts: Next Steps: Build on your momentum by exploring Kubernetes Center. Create your first AKS cluster or deploy your first application using the Deploy Your Application flow and track your progress in real time or Check out the new experience and instantly see your existing clusters in a streamlined management experience. Your feedback will help shape what comes next. Start building today with Kubernetes Center on Azure Portal! Learn more: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn FAQ: What products from Azure are included in Kubernetes Center? A. Kubernetes Center brings together all your Azure Kubernetes resources such as AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single interface for simplified operations. Create new resources or view your existing resources in Kubernetes Center. Does Kubernetes Center handle multi-cluster management? A. Kubernetes Center provides a unified interface aka single pane of glass to view and monitor all your Kubernetes resources in one place. For multi-cluster operations like upgrading Kubernetes Version, placing cluster resources on N clusters, policy management, and coordination across environments, Kubernetes Fleet Manager is the solution designed to handle that complexity at scale. It enables teams to manage clusters at scale with automation, consistency, and operational control. Does Kubernetes Center provide security and compliance insights? A. Absolutely. When Microsoft Defender for Containers is enabled, Kubernetes Center surfaces critical security vulnerabilities and compliance gaps across your clusters. Where can I find help and documentation? A. All relevant documentation, QuickStarts, and learning resources are available directly within Kubernetes Center, making it easier to get support without leaving the platform. For more information: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn What is the status of this launch? A. Kubernetes Center is currently in preview, offering core capabilities with more features planned for the general availability release. What is the roadmap for GA? A. Our roadmap includes adding new features and introducing tailored views designed for Admins and Developers. We also plan to enhance support for multi-cluster capabilities in Azure Fleet Manager, enabling smoother and more efficient operations within the Kubernetes Center.3.1KViews10likes0CommentsAnnouncing Cobalt 200: Azure’s next cloud-native CPU
By Selim Bilgin, Corporate Vice President, Silicon Engineering, and Pat Stemen, Vice President, Azure Cobalt Today, we’re thrilled to announce Azure Cobalt 200, our next-generation Arm-based CPU designed for cloud-native workloads. Cobalt 200 is a milestone in our continued approach to optimize every layer of the cloud stack from silicon to software. Our design goals were to deliver full compatibility for workloads using our existing Azure Cobalt CPUs, deliver up to 50% performance improvement over Cobalt 100, and integrate with the latest Microsoft security, networking and storage technologies. Like its predecessor, Cobalt 200 is optimized for common customer workloads and delivers unique capabilities for our own Microsoft cloud products. Our first production Cobalt 200 servers are now live in our datacenters, with wider rollout and customer availability coming in 2026. Azure Cobalt 200 SoC and platform Building on Cobalt 100: Leading Price-Performance Our Azure Cobalt journey began with Cobalt 100, our first custom-built processor for cloud-native workloads. Cobalt 100 VMs have been Generally Available (GA) since October of 2024 and availability has expanded rapidly to 32 Azure datacenter regions around the world. In just one year, we have been blown away with the pace that customers have adopted the new platform, and migrated their most critical workloads to Cobalt 100 for the performance, efficiency, and price-performance benefits. Cloud analytics leaders like Databricks and Snowflake are adopting Cobalt 100 to optimize their cloud footprint. The compute performance and energy-efficiency balance of Cobalt 100-based virtual machines and containers has proven ideal for large-scale data processing workloads. Microsoft’s own cloud services have also rapidly adopted Azure Cobalt for similar benefits. Microsoft Teams achieved up to 45% better performance using Cobalt 100 than their previous compute platform. This increased performance means less servers needed for the same task, for instance Microsoft Teams media processing uses 35% fewer compute cores with Cobalt 100. Designing Compute Infrastructure for Real Workloads With this solid foundation, we set out to design a worthy successor – Cobalt 200. We faced a key challenge: traditional compute benchmarks do not represent the diversity of our customer workloads. Our telemetry from the wide range of workloads running in Azure (small microservices to globally available SaaS products) did not match common hardware performance benchmarks. Existing benchmarks tend to skew toward CPU core-focused compute patterns, leaving gaps in how real-world cloud applications behave at scale when using network and storage resources. Optimizing Azure Cobalt for customer workloads requires us to expand beyond these CPU core benchmarks to truly understand and model the diversity of customer workloads in Azure. As a result, we created a portfolio of benchmarks drawn directly from the usage patterns we see in Azure, including databases, web servers, storage caches, network transactions, and data analytics. Each of our benchmark workloads includes multiple variants for performance evaluation based on the ways our customers may use the underlying database, storage, or web serving technology. In total, we built and refined over 140 individual benchmark variants as part of our internal evaluation suite. With the help of our software teams, we created a complete digital twin simulation from the silicon up: beginning with the CPU core microarchitecture, fabric, and memory IP blocks in Cobalt 200, all the way through the server design and rack topology. Then, we used AI, statistical modelling and the power of Azure to model the performance and power consumption of the 140 benchmarks against 2,800 combinations of SoC and system design parameters: core count, cache size, memory speed, server topology, SoC power, and rack configuration. This resulted in the evaluation of over 350,000 configuration candidates of the Cobalt 200 system as part of our design process. This extensive modelling and simulation helped us to quickly iterate to find the optimal design point for Cobalt 200, delivering over 50% increased performance compared to Cobalt 100, all while continuing to deliver our most power-efficient platform in Azure. Cobalt 200: Delivering Performance and Efficiency At the heart of every Cobalt 200 server is the most advanced compute silicon in Azure: the Cobalt 200 System-on-Chip (SoC). The Cobalt 200 SoC is built around the Arm Neoverse Compute Subsystems V3 (CSS V3), the latest performance-optimized core and fabric from Arm. Each Cobalt 200 SoC includes 132 active cores with 3MB of L2 cache per-core and 192MB of L3 system cache to deliver exceptional performance for customer workloads. Power efficiency is just as important as raw performance. Energy consumption represents a significant portion of the lifetime operating cost of a cloud server. One of the unique innovations in our Azure Cobalt CPUs is individual per-core Dynamic Voltage and Frequency Scaling (DVFS). In Cobalt 200 this allows each of the 132 cores to run at a different performance level, delivering optimal power consumption no matter the workload. We are also taking advantage of the latest TSMC 3nm process, further improving power efficiency. Security is top-of-mind for all of our customers and a key part of the unique innovation in Cobalt 200. We designed and built a custom memory controller for Cobalt 200, so that memory encryption is on by default with negligible performance impact. Cobalt 200 also implements Arm’s Confidential Compute Architecture (CCA), which supports hardware-based isolation of VM memory from the hypervisor and host OS. When designing Cobalt 200, our benchmark workloads and design simulations revealed an interesting trend: several universal compute patterns emerged – compression, decompression, and encryption. Over 30% of cloud workloads had significant use of one of these common operations. Optimizing for these common operations required a different approach than just cache sizing and CPU core selection. We designed custom compression and cryptography accelerators – dedicated blocks of silicon on each Cobalt 200 SoC – solely for the purpose of accelerating these operations without sacrificing CPU cycles. These accelerators help reduce workload CPU consumption and overall costs. For example, by offloading compression and encryption tasks to the Cobalt 200 accelerator, Azure SQL is able to reduce use of critical compute resources, prioritizing them for customer workloads. Leading Infrastructure Innovation with Cobalt 200 Azure Cobalt is more than just an SoC, and we are constantly optimizing and accelerating every layer in the infrastructure. The latest Azure Boost capabilities are built into the new Cobalt 200 system, which significantly improves networking and remote storage performance. Azure Boost delivers increased network bandwidth and offloads remote storage and networking tasks to custom hardware, improving overall workload performance and reducing latency. Cobalt 200 systems also embed the Azure Integrated HSM (Hardware Security Module), providing customers with top-tier cryptographic key protection within Azure’s infrastructure, ensuring sensitive data stays secure. The Azure Integrated HSM works with Azure Key Vault for simplified management of encryption keys, offering high availability and scalability as well as meeting FIPS 140-3 Level 3 compliance. An Azure Cobalt 200 server in a validation lab Looking Forward to 2026 We are excited about the innovation and advanced technology in Cobalt 200 and look forward to seeing how our customers create breakthrough products and services. We’re busy racking and stacking Cobalt 200 servers around the world and look forward to sharing more as we get closer to wider availability next year. Check out Microsoft Ignite opening keynote Read more on what's new in Azure at Ignite Learn more about Microsoft's global infrastructure14KViews8likes0CommentsResiliency Best Practices You Need For your Blob Storage Data
Maintaining Resiliency in Azure Blob Storage: A Guide to Best Practices Azure Blob Storage is a cornerstone of modern cloud storage, offering scalable and secure solutions for unstructured data. However, maintaining resiliency in Blob Storage requires careful planning and adherence to best practices. In this blog, I’ll share practical strategies to ensure your data remains available, secure, and recoverable under all circumstances. 1. Enable Soft Delete for Accidental Recovery (Most Important) Mistakes happen, but soft delete can be your safety net and. It allows you to recover deleted blobs within a specified retention period: Configure a soft delete retention period in Azure Storage. Regularly monitor your blob storage to ensure that critical data is not permanently removed by mistake. Enabling soft delete in Azure Blob Storage does not come with any additional cost for simply enabling the feature itself. However, it can potentially impact your storage costs because the deleted data is retained for the configured retention period, which means: The retained data contributes to the total storage consumption during the retention period. You will be charged according to the pricing tier of the data (Hot, Cool, or Archive) for the duration of retention 2. Utilize Geo-Redundant Storage (GRS) Geo-redundancy ensures your data is replicated across regions to protect against regional failures: Choose RA-GRS (Read-Access Geo-Redundant Storage) for read access to secondary replicas in the event of a primary region outage. Assess your workload’s RPO (Recovery Point Objective) and RTO (Recovery Time Objective) needs to select the appropriate redundancy. 3. Implement Lifecycle Management Policies Efficient storage management reduces costs and ensures long-term data availability: Set up lifecycle policies to transition data between hot, cool, and archive tiers based on usage. Automatically delete expired blobs to save on costs while keeping your storage organized. 4. Secure Your Data with Encryption and Access Controls Resiliency is incomplete without robust security. Protect your blobs using: Encryption at Rest: Azure automatically encrypts data using server-side encryption (SSE). Consider enabling customer-managed keys for additional control. Access Policies: Implement Shared Access Signatures (SAS) and Stored Access Policies to restrict access and enforce expiration dates. 5. Monitor and Alert for Anomalies Stay proactive by leveraging Azure’s monitoring capabilities: Use Azure Monitor and Log Analytics to track storage performance and usage patterns. Set up alerts for unusual activities, such as sudden spikes in access or deletions, to detect potential issues early. 6. Plan for Disaster Recovery Ensure your data remains accessible even during critical failures: Create snapshots of critical blobs for point-in-time recovery. Enable backup for blog & have the immutability feature enabled Test your recovery process regularly to ensure it meets your operational requirements. 7. Resource lock Adding Azure Locks to your Blob Storage account provides an additional layer of protection by preventing accidental deletion or modification of critical resources 7. Educate and Train Your Team Operational resilience often hinges on user awareness: Conduct regular training sessions on Blob Storage best practices. Document and share a clear data recovery and management protocol with all stakeholders. 8. "Critical Tip: Do Not Create New Containers with Deleted Names During Recovery" If a container or blob storage is deleted for any reason and recovery is being attempted, it’s crucial not to create a new container with the same name immediately. Doing so can significantly hinder the recovery process by overwriting backend pointers, which are essential for restoring the deleted data. Always ensure that no new containers are created using the same name during the recovery attempt to maximize the chances of successful restoration. Wrapping It Up Azure Blob Storage offers an exceptional platform for scalable and secure storage, but its resiliency depends on following best practices. By enabling features like soft delete, implementing redundancy, securing data, and proactively monitoring your storage environment, you can ensure that your data is resilient to failures and recoverable in any scenario. Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn Data redundancy - Azure Storage | Microsoft Learn Overview of Azure Blobs backup - Azure Backup | Microsoft Learn Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn1.2KViews1like1CommentReimagining AI at scale: NVIDIA GB300 NVL72 on Azure
By Gohar Waqar, CVP of Cloud Hardware Infrastructure Engineering, Microsoft Microsoft was the first hyperscaler to deploy the NVIDIA GB300 NVL72 infrastructure at scale – with a fully integrated platform engineered to deliver unprecedented compute density in a single rack to meet the demands of agentic AI workloads. Each GB300 NVL72 rack packs 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace™ CPUs with up to ~136 kW of IT load, enabled by Microsoft’s custom liquid cooling heat exchanger unit (HXU) system. Using a systems approach to architect GB300 clusters, Azure’s new NDv6 GB300 VMs include robust infrastructure innovation across every layer of the stack, including smart rack management for fleet health, innovative cooling systems, and efficient deployment features that make scaling high-density AI clusters easier than ever. With purpose-built hardware engineered for a unified platform – from silicon to systems to software – Azure’s deployment of NVIDIA GB300 NVL72 is a clear representation of Microsoft’s commitment to raising the bar on accelerated computing, enabling training of multitrillion-parameter models and high throughput on inference workloads. Unique features of NVIDIA GB300 NVL72 system on Microsoft Azure Ultra-dense AI rack - The GB300 rack integrates 72 NVIDIA Blackwell Ultra GPUs (each with 288 GB HBM3e each) and 36 Grace CPUs, effectively delivering supercomputer-class performance in a single rack. Advanced liquid cooling - Each rack uses direct-to-chip liquid cooling. In air-cooled data centers, external liquid cooling heat exchanger unit (HXU) radiator units in each rack dissipate ~136 kW to room air. In facilities with chilled water, the rack connects directly to facility water. Smart rack management - The system is equipped with an embedded controller that monitors power, temperature, coolant flow, and leak sensors in real time. It can auto-throttle or shut down components if conditions go out-of-range and provide full telemetry for remote fleet diagnostics. Fully integrated security and offload features: Our unique design also includes the Azure Integrated Hardware Security Module (HSM) chip and Azure Boost offload accelerator for advanced I/O and security performance. Scalable datacenter deployment - GB300 arrives as an integrated rack (compute trays, NVIDIA NVLink™ fabric, cooling, and power shelves pre-installed). Deployment is streamlined – just requiring connectivity power and cooling, performance of initial checks, and the rack self-regulates its cooling and power distribution. Purpose-built architecture designed for rapid deployment and scale At its core, GB300 is built to maximize AI compute density within a standard data center footprint. It is a single-rack AI inference and training cluster with unprecedented component density. Compared to the previous generation (NVIDIA GB200 NVL72), it introduces higher-performance GPUs (from ~1.2 kW to ~1.4 kW each with more HBM3e memory), a ~50% boost in NVFP4 throughput and a revamped power/cooling design to handle ~20% greater thermal and power load. The liquid cooling system for the GPU module is enhanced with a new cold plate and improved leak detection assembly for safe, high-density operation. Innovations in our purpose-built Azure Boost accelerator for I/O offload unlock higher bandwidth, while our custom Datacenter-secure Control Module (DC-SCM) introduces a secure, modular control plane built on a hardware root of trust, backed by the Azure Integrated Hardware Security Module (HSM). Together, these advancements enable fleet-wide manageability, strengthening security and operational resilience at scale meeting the demands of hyperscale environments. Cooling systems designed for deployability and global resiliency To dissipate ~136 kW of heat per rack, GB300 relies on direct liquid cooling for all major components. To offer resiliency and wide deployability across Microsoft’s datacenter footprint, our cooling designs support both facility-water and air-cooled environments. Both approaches use a closed coolant loop inside the rack with a treated water-glycol fluid. Leak detection cables line each tray, and the base of the rack is equipped with smart management protocols to address potential leaks. Using this method, liquid cooling is highly efficient and reliable – it allows GB300 to run with warmer coolant temperatures than traditional datacenter water, improving overall power usage effectiveness (PUE). Smart management, fleet health & diagnostics Each GB300 rack is a “smart IT rack” with an embedded management controller that oversees its operation. This controller is supported by a rack control module that serves as the brain of the rack, providing comprehensive monitoring and automation for power, cooling, and health diagnostics. By delivering an integrated “single pane of glass” view for each rack’s health, the GB300 makes management at scale feasible despite the complexity. This rack self-regulates its power and thermal environment once installed, adjusting fans or pump speeds automatically, and isolates faults – reducing the manual effort to keep the cluster running optimally so customers can focus on the workloads, with confidence that the infrastructure is continuously self-monitoring and safeguarding itself. In addition to this, the rack control module monitors and moderates GPU peak power consumption and other power management scenarios. These robust design choices reflect the fleet-first mindset – maximizing uptime and easier diagnostics in large deployments. Efficient and streamlined deployment As Microsoft scales thousands of GB300 racks for increased AI supercomputing capacity, fast and repeatable deployment is critical. GB300 introduces a new era of high-density AI infrastructure, tightly integrating cutting-edge hardware (Grace CPUs, Blackwell Ultra GPUs, and NVLink connectivity) with innovations both in power delivery and liquid cooling. Crucially, it does so with an eye toward operational excellence: built-in management, health diagnostics, and deployment-friendly design mean that scaling up AI clusters with GB300 can be done rapidly and reliably. With its unprecedented compute density, intelligent self-management, and flexible cooling options, the GB300 platform enables organizations to scale rapidly with the latest AI supercomputer hardware while maintaining the reliability and serviceability expected in Azure’s promise to customers. GB300 unlocks next-level AI performance delivered in a package engineered for real-world efficiency and fleet-scale success.1.9KViews7likes0CommentsOperational Excellence In AI Infrastructure Fleets: Standardized Node Lifecycle Management
Co-authors: Choudary Maddukuri and Bhushan Mehendale AI infrastructure is scaling at an unprecedented pace, and the complexity of managing it is growing just as quickly. Onboarding new hardware into hyperscale fleets can take months, slowed by fragmented tools, vendor-specific firmware, and inconsistent diagnostics. As hyperscalers expand with diverse accelerators and CPU architectures, operational friction has become a critical bottleneck. Microsoft, in collaboration with the Open Compute Project (OCP) and leading silicon partners, is addressing this challenge. By standardizing lifecycle management across heterogeneous fleets, we’ve dramatically reduced onboarding effort, improved reliability, and achieved >95% Nodes-in-Service on incredibly large fleet sizes. This blog explores how we are contributing to and leveraging open standards to transform fragmented infrastructure into scalable, vendor-neutral AI platforms. Industry Context & Problem The rapid growth of generative AI has accelerated the adoption of GPUs and accelerators from multiple vendors, alongside diverse CPU architectures such as Arm and x86. Each new hardware SKU introduces its own ecosystem of proprietary tools, firmware update processes, management interfaces, reliability mechanisms, and diagnostic workflows. This hardware diversity leads to engineering toil, delayed deployments, and inconsistent customer experiences. Without a unified approach to lifecycle management, hyperscalers face escalating operational costs, slower innovation, and reduced efficiency. Node Lifecycle Standardization: Enabling Scalable, Reliable AI Infrastructure Microsoft, through the Open Compute Project (OCP) in collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, is leading an industry-wide initiative to standardize AI infrastructure lifecycle management across GPU and CPU hardware management workstreams. Historically, onboarding each new SKU was a highly resource-intensive effort due to custom implementations and vendor-specific behaviors that required extensive Azure integration. This slowed scalability, increased engineering overhead, and limited innovation. With standardized node lifecycle processes and compliance tooling, hyperscalers can now onboard new SKUs much faster, achieving over 70% reduction in effort while enhancing overall fleet operational excellence. These efforts also enable silicon vendors to ensure interoperability across multiple cloud providers. Figure: How Standardization benefits both Hyperscalers & Suppliers. Key Benefits and Capabilities Firmware Updates: Firmware update mechanisms aligned with DMTF standards, minimize downtime and streamline fleet-wide secure deployments. Unified Manageability Interfaces: Standardized Redfish APIs and PLDM protocols create a consistent framework for out-of-band management, reducing integration overhead and ensuring predictable behavior across hardware vendors. RAS (Reliability, Availability and Serviceability) Features: Standardization enforces minimum RAS requirements across all IP blocks, including CPER (Common Platform Error Record) based error logging, crash dumps, and error recovery flows to enhance system uptime. Debug & Diagnostics: Unified APIs and standardized crash & debug dump formats reduce issue resolution time from months to days. Streamlined diagnostic workflows enable precise FRU isolation and clear service actions. Compliance Tooling: Tool contributions such as CTAM (Compliance Tool for Accelerator Manageability) and CPACT (Cloud Processor Accessibility Compliance Tool) automate compliance and acceptance testing—ensuring suppliers meet hyperscaler requirements for seamless onboarding. Technical Specifications & Contributions Through deep collaboration within the Open Compute Project (OCP) community, Microsoft and its partners have published multiple specifications that streamline SKU development, validation, and fleet operations. Summary of Key Contributions Specification Focus Area Benefit GPU Firmware Update requirements Firmware Updates Enables consistent firmware update processes across vendors GPU Management Interfaces Manageability Standardizes telemetry and control via Redfish/PLDM GPU RAS Requirements Reliability and Availability Reduces AI job interruptions caused by hardware errors CPU Debug and RAS requirements Debug and Diagnostics Achieves >95% node serviceability through unified diagnostics and debug CPU Impactless Updates requirements Impactless Updates Enables Impactless firmware updates to address security and quality issues without workload interruptions Compliance Tools Validation Automates specification compliance testing for faster hardware onboarding Embracing Open Standards: A Collaborative Shift in AI Infrastructure Management This standardized approach to lifecycle management represents a foundational shift in how AI infrastructure is maintained. By embracing open standards and collaborative innovation, the industry can scale AI deployments faster, with greater reliability and lower operational cost. Microsoft’s leadership within the OCP community—and its deep partnerships with other Hyperscalers and silicon vendors—are paving the way for scalable, interoperable, and vendor-neutral AI infrastructure across the global cloud ecosystem. To learn more about Microsoft’s datacenter innovations, check out the virtual datacenter tour at datacenters.microsoft.com.798Views0likes0CommentsNext Generation HXU: Doubling Cooling Power for the AI Era
Next Generation HXU: Doubling Cooling Power for the AI Era AI is rewriting the rules of computing, and datacenter cooling is no exception. As workloads grow hotter and more dense, traditional air cooling simply can’t keep up. Enter our next generation Heat Exchanger Unit (HXU), designed to deliver 2X the cooling capacity in the same footprint while boosting reliability for mission-critical AI deployments. Enabling Liquid-Cooling for AI Systems in Air-Cooled Datacenters Hyperscale datacenters are at the heart of the AI revolution, but they face a critical challenge: power density. Modern AI accelerators can push racks beyond 200 kW, far exceeding the limits of legacy air-cooled infrastructure. Retrofitting entire facilities for liquid cooling is costly and disruptive, leaving operators searching for solutions that balance performance, efficiency, and compatibility. Delivering A New Standard In Cooling and Reliability For AI Infrastructure By doubling the cooling capacity, it achieves twice the thermal performance of our first-generation HXU—without increasing the physical footprint. This leap in efficiency ensures that even the most demanding workloads remain optimized. Reliability and availability are at the core of our next generation HXU. With redundant pumps, fans, and power feeds, each unit is designed to offer >99.9% uptime, safeguarding critical AI operations against unexpected disruptions. Looking ahead, our next generation HXU is built to scale with the future of AI. Supporting rack densities greater than 240 kW, it enables next-generation AI accelerators without requiring major facility overhauls. In short, our next generation HXU combines performance, resilience, and scalability to power the AI era. Compact, Powerful Design Form Factor: Same two-tile width as our last generation HXU for seamless deployment in existing aisles. Cooling Capacity: Greater than 240 kW per rack in existing air-cooled datacenters. Architecture: Modular design for pumps, fans, and filters. Quick-disconnect couplings and segmented manifolds for flexible configurations. Leak detection and drip pans for enhanced coolant management. Reliability & Security Redundancy: Multi-pump and dual-power configurations for high availability. Cybersecurity: Secure boot, NIST SP 800-53, CIS controls and ISO/IEC 27001 compliance for firmware integrity. Telemetry: Predictive analytics for proactive maintenance. Paving The Path Forward on High-Density Cooling The next-generation HXU marks a major leap forward in sustainable, high-density cooling for AI datacenters. By doubling cooling capacity and improving reliability—all within the same footprint—Microsoft is enabling the next wave of AI innovation without compromising efficiency or uptime. Join the conversation: Explore our open-source design contributions through the Open Compute Project Foundation and help shape the future of datacenter cooling.489Views0likes0CommentsCaliptra 2.1: An Open-Source Silicon Root of Trust With Enhanced Protection of Data At-Rest
Introducing Caliptra 2.1: an open-source silicon Root of Trust subsystem, providing enhanced protection of data at-rest. Building upon Caliptra 1.0, which included capabilities for identity and measurement, Caliptra 2.1 represents a significant leap forward. It provides a complete RoT security subsystem, quantum resilient cryptography, and extensions to hardware-based key management, delivering defense in depth capabilities. The Caliptra 2.1 subsystem represents a foundational element for securing devices, anchoring through hardware a trusted chain for protection, detection, and recovery.1.9KViews1like0CommentsBehind the Azure AI Foundry: Essential Azure Infrastructure & Cost Insights
What is Azure AI Foundry? Azure AI Studio is now renamed to Azure AI Foundry. Azure AI Foundry is a unified AI development platform where teams can efficiently manage AI projects, deploy and test generative AI models, integrate data for prompt engineering, define workflows, and implement content security filters. This powerful tool enhances AI solutions with a wide range of functionalities. It is a one stop shop for all you need for AI development. Azure AI Hubs are collaborative workspaces for developing and managing AI solutions. To use AI Foundry's features effectively, you need at least one Azure AI Hub. An Azure AI Hub can host multiple projects. Each project includes the tools and resources needed to create a specific AI solution. For example, you can set up a project to help data scientists and developers work together on building a custom Copilot business application. You can use Azure AI Foundry to create an Azure AI Hub, or you can create a hub while creating a new project. This creates an AI Hub resource in your Azure subscription in the resource group you specify, providing a workspace for collaborative AI development. https://ai.azure.com/ Azure Infrastructure Azure AI Foundry environment utilizes Azure's robust AI infrastructure to facilitate the development, deployment, and management of AI models across various scenarios. Below is the list of Azure Infrastructure required to deploy the environment. Make sure the below resource providers are enabled for your subscription to deploy these Azure resources. Azure resource Resource provider Kind Purpose Azure AI Foundry Microsoft.CognitiveServices/account AIServices To enable intelligent and efficient interaction across Agents, Evaluations, Azure OpenAI, Speech, Vision, Language, and Content Understanding services—leveraging Azure’s AI capabilities to deliver comprehensive solutions for multimodal understanding, decision-making, and automation. Azure AI Foundry project Microsoft.CognitiveServices/account/project AIServices Subresource to the above Azure AI Foundry Hub Microsoft.MachineLearningServices/ workspace Hub This resource type, associated with the Azure Machine Learning service workspace, serves as a central hub for managing machine learning experiments, models, and data. It provides capabilities for creating, organizing, and collaborating on AI projects. Azure AI Foundry Project Microsoft.MachineLearningServices/ workspace Project Within an Azure AI Studio Hub, you can create projects. These projects allow you to organize your work, collaborate with others, and track experiments related to specific tasks or use cases. Essentially, it provides a structured environment for your AI development. Azure AI OpenAI Service Microsoft.CognitiveServices/account AI Services An Azure AI Services as the model-as-a-service endpoint provider including GPT-4/4o and ADA Text Embeddings models. Azure AI Search Microsoft.Search/searchServices Search Service Creates indexes on your data and provides search capabilities for your projects. Azure Storage Account Microsoft.Storage/storageAccounts Storage It is associated with the Azure AI Foundry workspace. Stores artifacts for your projects (e.g., flows and evaluations). Azure Key Vault Microsoft.KeyVault/vaults Key Vault It is associated with the Azure AI Foundry workspace. Stores secrets like connection strings for resource connections. Azure Container Registry(optional) Microsoft.ContainerRegistry/ registries Container Registry Stores Docker images created when using custom runtime for prompt flow. Azure Application Insights Microsoft.Insights/components Monitoring An Azure Application Insights instance associated with the Azure AI Foundry workspace. Used for application-level logging in deployed prompts. Log Analytics Workspace (optional) Microsoft.OperationalInsights/ workspaces Monitoring Used for log storage and analysis. Event Grid Microsoft.Storage/storage accounts/ providers/extensiontopics Event Grid System Topic Event Grid automates workflows by triggering actions in response to events across Azure services, ensuring dynamic and efficient operations in an Azure AI solution AI Foundry Environment Azure Portal View AI Foundry Portal View All dependent resources are connected to the hub and can some resources (Azure OpenAI and Azure AI Search) can be shared across projects. Pricing Since Azure AI Foundry is assembled from multiple Azure services ,pricing would depend on architectural decisions and usage. When building your own Azure AI solution, it's essential to consider the associated costs that it accrues in Azure AI Foundry. Below are the areas where the costs incur: 1.Compute Hours and Tokens: Unlike fixed monthly costs, Azure AI hubs, Azure OpenAI and projects are billed based on compute hours and tokens used. Be mindful of resource utilization to avoid unexpected charges. 2.Networking Costs: By default, the hub network configuration is public. But if you want to secure the Azure AI Hub there is costs associated with data transfer. 3.Additional Resources: Beyond AI services, consider other Azure resources like Azure Key Vault, Storage, Application Insights, and Event Grids. These services charge based on transactions and data volume. AI Foundry Cost Pane Now in Azure Pricing Calculator you can directly find the upfront monthly cost of the resources under Example Scenarios tab in Azure AI Foundry scenario. This cost calculation feature is GA now. You can also use cost management and Azure resource tags to help with a detailed resource-level cost breakdown. Please note while adding vector search in AI Search requires an Azure OpenAI embedding model. Azure OpenAI embedding model, text-embedding-ada-002 (Version 2), will be deployed if not already. Adding vector embeddings will incur usage to your account. Vector search is available as part of all Azure AI Search tiers in all regions at no extra charge. If you require to group costs of these different services together, it is recommend creating hubs in one or more dedicated resource groups and subscriptions in your Azure environment. You can navigate to your resource group cost estimation from view cost of resources in Azure AI Foundry. Azure Pricing Calculator To learn more about the pricing of Azure AI Foundry pricing click here -Azure AI Foundry - Pricing | Microsoft Azure Conclusion Azure AI Foundry enables a path forward for enterprises serious about AI transformation, not just experiments, but scalable, governable, cost predictable, and responsible AI Systems by leveraging the robust infrastructure of Azure Cloud. This integration helps maintain and cater to business goals while simultaneously providing a competitive edge in an AI-driven market. Resources and getting started with Azure AI Azure AI Portfolio Explore Azure AI. Azure AI Infrastructure Microsoft AI at Scale. Azure AI Infrastructure. Azure OpenAI Service Azure OpenAI Service documentation. Explore the playground and customization in Azure AI Foundry Portal . Copilot Studio Copilot Learning Hub Step 1: Understand Copilot Step 2: Adopt Copilot Step 3: Extend Copilot Step 4: Build Copilot Stay up to date on Copilot -What's new in Copilot Studio GPT 4.5 Model Request MS form link. Please note this is now limited to US region only as Azure AI Infrastructure is undergoing significant advancements, continually evolving to meet the demands of modern technology and innovation. Copilot & AI Agents1.5KViews2likes0CommentsMt Diablo - Disaggregated Power Fueling the Next Wave of AI Platforms
AI platforms have quickly shifted the industry from rack powers near 20 kilowatts to a hundred kilowatts and beyond in just the span of a few years. To enable the largest accelerator pod size within a physical rack domain, and enable scalability between platforms, we are moving to a disaggregated power rack architecture. Our disaggregated power rack is known as Mt Diablo and comes in both 48 Volt and 400 Volt flavors. This shift enables us to leverage more of the server rack for AI accelerators and at the same time gives us the flexibility to scale the power to meet the needs of today’s platforms and the platforms of the future. This forward thinking strategy enables us to move faster and foster collaboration to power the world’s most complex AI systems.13KViews2likes5Comments