Our mission overall has been to democratize access to supercomputing. We’ve always believed that people could do great things with access to high performance compute. So, while hosting the AI revolution was never the plan, its very existence has validated our strategy of sitting at the intersection of two movements in IT, cloud computing and Beowulf clusters, to tackle the biggest obstacles in access to supercomputing.
Cloud Computing
When it comes to resource allocation, the cloud is about adapting to customer demand, on demand, as efficiently as possible.
It needs to satisfy startup who needs to start small but also scale up seamlessly as their business takes off or a retailer who needs more capacity for the holiday rush but doesn’t want to pay for it year-round. Cloud providers manage these huge changes, across time as well as across customers, with virtualization.
Instead of installing new hardware for each deployment, the cloud runs hypervisor software that can take powerful servers and carve them up into smaller virtual machines on the fly. That can let it create these virtual machines, move them between servers, and destroy them in seconds.
It’s through these operations that the cloud can offer virtually unlimited flexibility to customers as a differentiating factor.
Beowulf Clusters
Since 1994, when researchers at NASA built their Beowulf supercomputer by connecting a bunch of PCs together over Ethernet, clustering together commodity hardware has increasingly become the way supercomputers are built.
The main driver for this shift is economies of scale. Custom designs tend to make more efficient use of each processor, but if I can buy raspberry pi’s at half as much per FLOPS, then even reaching 60% efficiency makes it a win.
A system’s efficiency in this context is the ratio of how quickly it can do useful work on a real job compared to just adding up the FLOPS across all its parts. Beowulf clusters tend to perform worse on this metric, and that’s mainly due to custom silicon being more optimized for targeted workloads, as well as network bottlenecks often leaving Beowulf nodes stuck waiting for data.
As Beowulf-style clusters gained traction, specialized accelerators and networking technologies popped up to boost their efficiency. Accelerators like GPUs targeted specific types of computation, while new network protocols like InfiniBand/RDMA optimized for the static, closed nature of these “backend” networks and the communication patterns they carry.
Azure HPC
Besides bringing costs down in general, Beowulf clusters are modular enough for us to run back the same virtualization playbook from cloud computing. Instead of using a hypervisor to carve out sets of cores, we use network partitions to carve out sets of nodes.
To a customer, this means creating a group of “RDMA-enabled” VMs together with a specific setting, which tells us to put these VMs on the same network partition.
And that's what lets us offer supercomputing in the cloud, which brings with it interesting opportunities for customers.
On their own cluster, if they want a job done in half the time, they might pay twice as much for a bigger cluster. In the cloud, that cost would stay flat because they’re only renting the bigger cluster for half as long. In a sense, it’s holiday rush bursting taken to the extreme, and it's made Azure HPC popular in industries where R&D relies heavily on supercomputing and time to market is everything.
In 2022, that could have been the end of the story, but with the rise of AI, we now serve an industry that uses supercomputing not only in R&D (training) but also in production (inference), making us an attractive platform for this generation of startups.
And that, in a nutshell, how Azure HPC works, serving customers small and large the latest server and accelerator technologies, tied together with a high-performance backend interconnect and offered through the Azure Cloud.
I felt compelled to write this because within Azure, we’re often referred to as “InfiniBand Team” or “Azure GPU” or “AI Platform”, and while I appreciate that those three are and likely will continue to be a wildly successful combination of network, accelerator, and use-case, I like to think you could swap any of them out, and we’d still be Azure HPC.