Blog Post
Performance at Scale: The Role of Interconnects in Azure HPC & AI Infrastructure
Could you please post the actual network bandwidth utilization of those workloads at scale ? Network microbenchmarks are useful to validate scale up and scale out and are a marketing tool to justify the high bandwidth between accelerators. However real distributed and scalable (strong scaling) applications have way lower network bandwidth requirements, hence an opportunity to lean on networking infrastructure at large scale. 1/8th of the max bandwidth is sufficient for scale out. That is only 1x NIC @ 400G for every 8 GPUs is sufficient at very large scale. Maybe 2, one per group of CPU+4GPUs.
Also could you comment on reliability challenges observed at large scale (8k GPUs) where Mean Time Between Failures is a few hours? What processes are in place to recover fast on distributed training at large scale for a 1month training of a foundational model that is disrupted every few hours due to unavoidable hw/sw failures ?
Reducing number of network ports not only is cheaper but proportionally more network reliable (less NICs, transceivers, cables, switches) without compromising the scalability of the applications.
Finally, please post your training throughput numbers with and without sharp at scale so it can be quantified SHARP value if any.