Hybrid, Distributed Machine-Learning Platform using Azure HPC Cache and Azure Kubernetes Service

Former Employee

Apr 28, 2020

Traditional distributed machine learning (ML) workloads have required that the underlying training and validation data needs to be(from a bandwidth and latency perspective) close to the compute to ensure that the underlying I/O is not a major bottleneck in the machine learning (ML) training process. Especially for ML training that takes advantage of accelerators such as GPUs and our recently announced Graphcore IPUs the need for a high-performance IO layer is key for efficient use of these accelerators.

At Microsoft Ignite in November 2019, we announced the general availability of Azure HPC Cache, a high-performance cache layer which provides low-latency POSIX-compliant access to data from Azure Blob storage. HPC Cache also allows hydration of data from on-premises resources. Seamless hydrating of the cache layer from on-premises storage resources makes it possible to access and run Azure-based ML compute efficiently.

As illustrated above, the user can orchestrate the instantiation of the HPC Cache from a specific NFS export data path without exporting the entire NFS partition, but rather just the path that needs to be read for the specific workload. Three performance tiers are currently available for HPC Cache to provide 2, 4, and 8 Gibibytes (GiB) per second throughput. Unique to scalable performance storage solutions, the HPC Cache can be stopped and started to minimize costs during idle periods.

HPC Cache is fully supported in the Azure Kubernetes Service (AKS) using the standard PersistentVolume/PersistentVolumeClaims. AKS, through the cluster autoscaler in our Azure Virtual Machine Scale Sets (VMSS) will allow the compute to elastically scale to run the workload.

Deployment

Instantiating the high-performance cache layer from an on-premises NFS server requires a network topology which will extend resolution of the on-premises resources into Azure. Technical details are c outlined in deploying an ExpressRoute. Alternatively, for quick proof of concept, you could set up a client VPN into an Azure Virtual Network (VNet). With a specific VNet assigned, search for HPC Cache in the resources tab and fill in the relevant details as illustrated below.

Next, choose the cache sizing parameters:

Finally, review and deploy the HPC Cache:

You can download the example deployment files from the Github repo.

While the HPC Cache is being deployed, we can deploy the AKS cluster. The AKS cluster will be deployed in the same VNet as the HPC Cache. We can launch an example cluster of the Standard_NC24s_v3 which utilizes the NVIDIA Tesla V100 GPU. For this workflow example, the Kubeflow MPI-Operator will be used to run the standard Imagenet2012 ResNet model.

To instantiate an AKS cluster, the command below is executed.

az aks create --resource-group <resource_group> --name <cluster_name> --node-count 1 --enable-addons monitoring --generate-ssh-keys --node-vm-size Standard_NC24s_v3 --vm-set-type VirtualMachineScaleSets --enable-cluster-autoscaler --min-count 1 --max-count 4 --vnet-subnet-id <vnet_in_scope> --zones 1

After the cluster is up, we will pull down the credentials for the kubectl binary to authenticate against.

az aks get-credentials --resource-group <resource_group> --name <cluster_name>

First we will apply the NVIDIA K8s plugin so that we can recruit and schedule the GPU resources across the cluster.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml

Next, we will apply the Kubeflow MPIOperator which you get from this GitHub:

kubectl apply -f deploy/mpi-operator.yaml

Next, we will apply the yaml files for defining the HPCCache mounts as our Persistent Volume and Persistent Volume Claims in our namespace.

kubectl apply -f pv-hpccache-k8s.yaml 
kubectl apply -f pvc-hpccache-k8s.yaml

With the deployment completed. We can now execute the main mpi-operator workflow. Here we define the mpi-launcher spec as well as the worker spec. We can run the basic benchmarking routines from the mpi-operator github. But staging the data in the HPC Cache would be the first step.

To execute the workflow:

kubectl create -f tensorflow-benchmarks-imagenet.yaml

In a few minutes we will see from the output of kubectl get pods -o wide that the cluster autoscaler added additional nodes to the cluster and invoked the workflow. You can see the progress of the ML training by tailing the launcher pod.

NAME                                   READY   STATUS             RESTARTS   AGE   IP            NODE                                NOMINATED NODE   READINESS GATES 

attach-pvc                             0/1     CrashLoopBackOff   14         50m   10.244.0.16   aks-nodepool1-36563695-vmss000000   <none>           <none> 

tensorflow-benchmarks-launcher-74gc8   1/1     Running            0          11m   10.244.0.18   aks-nodepool1-36563695-vmss000000   <none>           <none> 

tensorflow-benchmarks-worker-0         1/1     Running            0          11m   10.244.0.17   aks-nodepool1-36563695-vmss000000   <none>           <none> 

tensorflow-benchmarks-worker-1         1/1     Running            0          11m   10.244.2.4    aks-nodepool1-36563695-vmss000002   <none>           <none> 

tensorflow-benchmarks-worker-2         1/1     Running            0          11m   10.244.3.4    aks-nodepool1-36563695-vmss000001   <none>           <none> 

tensorflow-benchmarks-worker-3         1/1     Running            0          11m   10.244.1.4    aks-nodepool1-36563695-vmss000003   <none>           <none>

Resources cleanup

We can quickly delete the cluster after executing the workflow by running:

az aks delete --resource-group <resource_group> --name <cluster_name>

Conclusion

By instantiating a hybrid serverless compute infrastructure, it is possible to blur the lines between on-premises storage and managed cloud compute. In response to events on-premises, the realization of recruiting additional scale and elasticity when the need arises now possible. Some of the applications that this architecture can be successfully deployed for include acquiring video feeds from autonomous driving campaigns or analyzing data from on-premises genomics sequencing.