msignite
2 TopicsAzure CycleCloud 8.8 and CCWS 1.2 at SC25 and Ignite
Azure CycleCloud 8.8: Advancing HPC & AI Workloads with Smarter Health Checks Azure CycleCloud continues to evolve as the backbone for orchestrating high-performance computing (HPC) and AI workloads in the cloud. With the release of CycleCloud 8.8, users gain access to a suite of new features designed to streamline cluster management, enhance health monitoring, and future-proof their HPC environments. Key Features in CycleCloud 8.8 1. ARM64 HPC Support The platform expands its hardware compatibility with ARM64 HPC support, opening new possibilities for energy-efficient and cost-effective compute clusters. This includes access to the newer generation of GB200 VMs as well as general ARM64 support, enabling new AI workloads at a scale never possible before 2. Slurm Topology-Aware Scheduling The integration of topology-aware scheduling for Slurm clusters allows CycleCloud users to optimize job placement based on network and hardware topology. This leads to improved performance for tightly coupled HPC workloads and better utilization of available resources. 3. Nvidia MNNVL and IMEX Support With expanded support for Nvidia MNNVL and IMEX, CycleCloud 8.8 ensures compatibility with the latest GPU technologies. This enables users to leverage cutting-edge hardware for AI training, inference, and scientific simulations. 4. HealthAgent: Event-Driven Health Monitoring and Alerting A standout feature in this release is the enhanced HealthAgent, which delivers event-driven health monitoring and alerting. CycleCloud now proactively detects issues across clusters, nodes, and interconnects, providing real-time notifications and actionable insights. This improvement is a game-changer for maintaining uptime and reliability in large-scale HPC deployments. Node Healthagent supports both impactful healthchecks which can only run while nodes are idle as well as non-impactful healthchecks that can run throughout the lifecycle of a job. This allows CycleCloud to alert on issues that not only happen while nodes are starting, but also issues that may result from failures for long-running nodes. Later releases of CycleCloud will also include automatic remediation for common failures, so stay tuned! 5. Enterprise Linux 9 and Ubuntu 24 support One common request has been wider support for the various Enterprise Linux (EL) 9 variants, including RHEL9, AlmaLinux 9, and Rocky Linux 9. CycleCloud 8.8 introduces support for those distributions as well as the latest Ubuntu HPC release. Why These Features Matter The CycleCloud 8.8 release marks a significant leap forward for organizations running HPC and AI workloads in Azure. The improved health check support—anchored by HealthAgent and automated remediation—means less downtime, faster troubleshooting, and greater confidence in cloud-based research and innovation. Whether you’re managing scientific simulations, AI model training, or enterprise analytics, CycleCloud’s latest features help you build resilient, scalable, and future-ready HPC environments. Key Features in CycleCloud Workspace for Slurm 1.2 Along with the release of CycleCloud 8.8 comes a new CycleCloud Workspace for Slurm (CCWS) release. This release includes the General Availability of features that were previously in preview, such as Open OnDemand, Cendio ThinLinc, and managed Grafana monitoring capabilities. In addition to previously announced features, CCWS 1.2 also includes support for a new Hub and Spoke deployment model. This allows customers to retain a central hub of shared resources that can be re-used between cluster deployments with "disposable" spoke clusters that branch from the hub. Hub and Spoke deployments enable customers who need to re-deploy clusters in order to upgrade their operating system, deploy new versions of software, or even reconfigure the overall architecture of Slurm clusters. Come visit us at SC25 and MS Ignite To learn more about these features, come visit us at the Microsoft booth at #SC25 in St. Louis, MO and #Microsoft #Ignite in San Francisco this week!Azure ND GB300 v6 now Generally Available - Hyper-optimized for Generative and Agentic AI workloads
We are pleased to announce the General Availability (GA) of ND GB300 v6 virtual machines, delivering the next leap in AI infrastructure. On 10/09, we shared the delivery of the first at-scale production cluster with more than 4,600 NVIDIA GB300 NVL72, featuring NVIDIA Blackwell Ultra GPUs connected through the next-generation NVIDIA InfiniBand network. We have now deployed tens of thousands of GB300 GPUs for production customer workloads and expect to scale to hundreds of thousands. Built on NVIDIA GB300 NVL72 systems, these VMs redefine performance for frontier model training, large-scale inference, multimodal reasoning, and agentic AI. The ND GB300 v6 series enables customers to: Deploy trillion-parameter models with unprecedented throughput. Accelerate inference for long-context and multimodal workloads. Scale seamlessly at high bandwidth for large scale training workloads. In recent benchmarks, ND GB300 v6 achieved over 1.1 million tokens per second on Llama 2 70B inference workloads - a 27% uplift over ND GB200 v6. This performance breakthrough enables customers to serve long-context, multimodal, and agentic AI models with unmatched speed and efficiency. With the general availability of ND GB300 v6 VMs, Microsoft strengthens its long-standing collaboration with NVIDIA by leading the market in delivering the latest GPU innovations, reaffirming our commitment to world-class AI infrastructure. The ND v6 GB300 systems are built in a rack-scale design, with each rack hosting 18 VMs for a total of 72 GPUs interconnected by high-speed NVLINK. Each VM has 2 NVIDIA Grace CPUs and 4 Blackwell Ultra GPUs. Each NVLINK connect rack contains: 72 NVIDIA Blackwell Ultra GPUs (with 36 NVIDIA Grace CPUs). 800 gigabits per second (Gbp/s) per GPU cross-rack scale-out bandwidth via next-generation NVIDIA Quantum-X800 InfiniBand (2x ND GB200 v6). 130 terabytes (TB) per second of NVIDIA NVLink bandwidth within rack. 37TB of fast memory. (~20 TB HBM3e + ~17TB LPDDR) Up to 1,440 petaflops (PFLOPS) of FP4 Tensor Core performance. (1.5x ND GB200 v6) Together, NVLINK and XDR InfiniBand enable GB300 systems to behave as a unified compute and memory pool, minimizing latency, maximizing bandwidth, and dramatically improving scalability. Within a rack, NVLink enables coherent memory access and fast synchronization for tightly coupled workloads. Across racks, XDR InfiniBand ensures ultra-low latency, high-throughput communication with SHARP offloading—maintaining sub-100 µs latency for cross-node collectives. Azure provides an end-to-end AI platform that enables customers to build, deploy, and scale AI workloads efficiently on GB300 infrastructure. Services like Azure CycleCloud and Azure Batch simplify the setup and management of HPC and AI environments, allowing organizations to dynamically adjust resources, integrate leading schedulers, and run containerized workloads at massive scale. With tools such as CycleCloud Workspace for Slurm, users can create and configure clusters without prior expertise, while Azure Batch handles millions of parallel tasks, ensuring cost and resource efficiency for large-scale training. For cloud-native AI, Azure Kubernetes Service (AKS) offers rapid deployment and management of containerized workloads, complemented by platform-specific optimizations for observability and reliability. Whether using Kubernetes or custom stacks, Azure delivers a unified suite of services to maximize performance and scalability. Learn More & Get Started https://azure.microsoft.com/en-us/blog/microsoft-azure-delivers-the-first-large-scale-cluster-with-nvidia-gb300-nvl72-for-openai-workloads/ https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/breaking-the-million-token-barrier-the-technical-achievement-of-azure-nd-gb300-v/4466080 NVIDIA Blog: Azure’s GB300 NVL72 Supercomputing Cluster Azure VM Sizes Overview