Blog Post

Azure High Performance Computing (HPC) Blog
29 MIN READ

Running GPU accelerated workloads with NVIDIA GPU Operator on AKS

wolfgangdesalvador's avatar
Feb 23, 2024

Dr. Wolfgang De Salvador - EMEA GBB HPC/AI Infrastructure Senior Specialist

Dr. Kai Neuffer - Principal Program Manager, Industry and Partner Sales - Energy Industry

 

Resources and references used in this article:


As of today, several options are available to run GPU accelerated HPC/AI workloads on Azure, ranging from training to inferencing.

Looking specifically at AI workloads, the most direct and managed way to access GPU resources and related orchestration capabilities for training is represented by Azure Machine Learning distributed training capabilities as well as the related deployment for inferencing options.

At the same time, specific HPC/AI workloads require a high degree of customization and granular control over the compute-resources configuration, including the operating system, the system packages, the HPC/AI software stack and the drivers. This is the case, for example, described in previous blog posts by our benchmarking team  for training of NVIDIA NeMo Megatron model or for MLPerf Training v3.0

In these types of scenarios, it is critical to have the possibility to fine tune the configuration of the host at the operating system level, to precisely match the ideal configuration for getting the most value out of the compute resources.

On Azure, HPC/AI workload orchestration on GPUs is supported on several Azure services, including Azure CycleCloud, Azure Batch and Azure Kuberenetes Services

 

Focus of the blog post

The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator.

The guide will be based on the documentation already available in Azure Learn for configuring GPU nodes or multi-instance GPU profile nodes, as well as on the NVIDIA GPU Operator documentation.

However, the main scope of the article is to present a methodology to manage totally the GPU configuration leveraging on NVIDIA GPU Operator native features, including:

  • Driver versions and customer drivers bundles
  • Time-slicing for GPU oversubscription
  • MIG profiles for supported-GPUs, without the need of defining exclusively the behavior at node pool creation time

 

Deploying a vanilla AKS cluster

The standard way of deploying a Vanilla AKS cluster is to follow the standard procedure described in Azure documentation.

Please be aware that this command will create an AKS cluster with:

  • Kubenet as Network CNI
  • AKS cluster will have a public endpoint
  • Local accounts with Kubernetes RBAC

In general, we strongly recommend for production workloads to look the main security concepts for AKS cluster.

  • Use Azure CNI
  • Evaluate using Private AKS Cluster to limit API exposure to the public internet
  • Evaluate using Azure RBAC with Entra ID accounts or Kubernetes RBAC with Entra ID accounts

This will be out of scope for the present demo, but please be aware that this cluster is meant for NVIDIA GPU Operator demo purposes only.

Using Azure CLI we can create an AKS cluster with this procedure (replace the values between arrows with your preferred values):

 

export RESOURCE_GROUP_NAME=<YOUR_RG_NAME>
export AKS_CLUSTER_NAME=<YOUR_AKS_CLUSTER_NAME>
export LOCATION=<YOUR_LOCATION>

## Following line to be used only if Resource Group is not available
az create group --resource-group $RESOURCE_GROUP_NAME --location $LOCATION

az aks create --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME --node-count 2 --generate-ssh-keys 

 

 

Connecting to the cluster

To connect to the AKS cluster, several ways are documented in Azure documentation.

Our favorite approach is using a Linux Ubuntu VM with Azure CLI installed.

This would allow us to run (be aware that in the login command you may be required to use --tenant <TENTANT_ID> in case you have access to multiple tenants or --identity if the VM is on Azure and you rely on an Azure Managed Identity) in case:

 

## Add --tenant <TENANT_ID> in case of multiple tenants
## Add --identity in case of using a managed identity on the VM
az login 
az aks install-cli
az aks get-credentials --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME

 

After this is completed, you should be able to perform standard kubectl commands like:

 

kubectl get nodes

root@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-25743550-vmss000000   Ready    agent   2d19h   v1.27.7
aks-nodepool1-25743550-vmss000001   Ready    agent   2d19h   v1.27.7

 

Command line will be perfectly fine for all the operations in the blog post. However, if you would like to have a TUI experience, we suggest to use k9s, which can be easily installed on Linux following the installation instructions. For Ubuntu, you can install current version at the time of Blog post creation with:

 

wget "https://github.com/derailed/k9s/releases/download/v0.31.9/k9s_linux_amd64.deb"
dpkg -i k9s_linux_amd64.deb

 

k9s allows to easily interact with the different resources of AKS cluster directly from a terminal user interface. It can be launched with k9s command. Detailed documentation on how to navigate on the different resources (Pods, DaemonSets, Nodes) can be found on the official k9s documentation page.

 

Attaching an Azure Container registry to the Azure Kubernetes Cluster (only required for MIG and NVIDIA GPU Driver CRD)

In case you will be using MIG or NVIDIA GPU Driver CRD, it is necessary to create a private Azure Container Registry and attaching that to the AKS cluster.

 

export ACR_NAME=<ACR_NAME_OF_YOUR_CHOICE>

az acr create --resource-group $RESOURCE_GROUP_NAME \
  --name $ACR_NAME --sku Basic

az aks update --name $AKS_CLUSTER_NAME --resource-group  $RESOURCE_GROUP_NAME --attach-acr $ACR_NAME

 

You will be able to perform pull and push operations from this Container Registry through Docker using this command on a VM with the container engine installed, provided that the VM has a managed identity with AcrPull/AcrPush permissions :

 

az acr login --name $ACR_NAME

 

 

About taints for AKS GPU nodes

It is important to understand deeply the concept of taints and tolerations for GPU nodes in AKS. This is critical for two reasons:

  • In case spot instances are used in the AKS cluster, they will be applied the taint  
    kubernetes.azure.com/scalesetpriority=spot:NoSchedule​
  • In some cases, it may be useful to add on the AKS cluster a dedicated taint for GPU SKUs, like
    sku=gpu:NoSchedule​

    The utility of this taint is mainly related to the fact that, as compared to on-premises and bare-metal Kubernetes clusters, in AKS nodepools are usually allowed to scale down to 0 instances. This means that once the AKS auto-scaler should take a decision on the basis of a “nvidia.com/gpu” resource request, it may struggle in understanding what is the right node pool to scale-up

However, the latter point can also be addressed in a more elegant and specific way using a affinity declaration for Jobs or Pods spec requesting GPUs, like for example:

 

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - Standard_NC4as_T4_v3

 

 

Creating the first GPU pool

The currently created AKS cluster has as a default only a node pool with 2 nodes of Standard_DS2_v2 VMs.

In order to test NVIDIA GPU Operator and run some GPU accelerated workload, we should add a GPU node pool.

It is critical, in case the management of the NVIDIA stack is meant to be managed with GPU operator that the node is created with the tag:

 

SkipGPUDriverInstall=true

 

 This can be done using Azure Cloud Shell, for example using an NC4as_T4_v3 and setting the autoscaling from 0 up to 1 node:

 

az aks nodepool add \
    --resource-group $RESOURCE_GROUP_NAME \
    --cluster-name $AKS_CLUSTER_NAME \
    --name nc4ast4 \
    --node-taints sku=gpu:NoSchedule \
    --node-vm-size Standard_NC4as_T4_v3 \
    --enable-cluster-autoscaler \
    --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True

 

 In order to deploy in Spot mode, the following flags should be added to Azure CLI:

 

--priority Spot --eviction-policy Delete --spot-max-price -1

 

Recently, a preview feature has been released that is allowing to skip the creation of the tags:

 

# Register the aks-preview extension
az extension add --name aks-preview

# Update the aks-preview extension
az extension update --name aks-preview

az aks nodepool add \
    --resource-group $RESOURCE_GROUP_NAME \
    --cluster-name $AKS_CLUSTER_NAME \
    --name nc4ast4 \
    --node-taints sku=gpu:NoSchedule \
    --node-vm-size Standard_NC4as_T4_v3 \
    --enable-cluster-autoscaler \
    --min-count 0 --max-count 1 --node-count 0 --skip-gpu-driver-install

 

At the end of the process you should get the appropriate node pool defined in the portal and in status “Succeeded”:

 

az aks nodepool list --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME -o table
Name       OsType    KubernetesVersion    VmSize                Count    MaxPods    ProvisioningState    Mode
---------  --------  -------------------  --------------------  -------  ---------  -------------------  ------
nodepool1  Linux     1.27.7               Standard_DS2_v2       2        110        Succeeded            System
nc4ast4    Linux     1.27.7               Standard_NC4as_T4_v3  0        110        Succeeded            User

 

 

Install NVIDIA GPU operator

On the machine with kubectl configured and with context configured above for connection to the AKS cluster, run the following to install helm:

 

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
    && chmod 700 get_helm.sh \
    && ./get_helm.sh

 

To fine tune the node feature recognition, we will install Node Feature Discovery separately from NVIDIA Operator. NVIDIA Operator requires that the label feature.node.kubernetes.io/pci-10de.present=true is applied to the nodes. Moreover, it is important to tune the node discovery plugin so that it will be scheduled even on Spot instances of the Kubernetes cluster and on instances where the taint sku: gpu is applied

 

helm install --wait --create-namespace  -n gpu-operator node-feature-discovery node-feature-discovery --create-namespace    --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts   --set-json master.config.extraLabelNs='["nvidia.com"]'   --set-json worker.tolerations='[{ "effect": "NoSchedule", "key": "sku", "operator": "Equal", "value": "gpu"},{"effect": "NoSchedule", "key": "kubernetes.azure.com/scalesetpriority", "value":"spot", "operator": "Equal"},{"effect": "NoSchedule", "key": "mig", "value":"notReady", "operator": "Equal"}]'

 

After enabling Node Feature Discovery, it is important to create a custom rule to precisely match NVIDIA GPUs on the nodes. This can be done creating a file called nfd-gpu-rule.yaml containing the following:

 

apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
  name: nfd-gpu-rule
spec:
   rules:
   - name: "nfd-gpu-rule"
     labels:
        "feature.node.kubernetes.io/pci-10de.present": "true"
     matchFeatures:
        - feature: pci.device
          matchExpressions:
            vendor: {op: In, value: ["10de"]}

 

After this file is created, we should apply this to the AKS cluster:

 

kubectl apply -n gpu-operator -f nfd-gpu-rule.yaml

 

After this step, it is necessary to add NVIDIA Helm repository:

 

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia    && helm repo update

 

And now the next step will be installing the GPU operator, remembering the tainting also for the GPU Operator DaemonSets and also to disable the deployment of the Node Feature Discovery (nfd) that has been done in the previous step:

 

helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator --set-json daemonsets.tolerations='[{ "effect": "NoSchedule", "key": "sku", "operator": "Equal", "value": "gpu"},{"effect": "NoSchedule", "key": "kubernetes.azure.com/scalesetpriority", "value":"spot", "operator": "Equal"},{"effect": "NoSchedule", "key": "mig", "value":"notReady", "operator": "Equal"}]' --set nfd.enabled=false

 

 

Running the first GPU example

Once the configuration has been completed, it is now time to check the functionality of the GPU operator submitting the first GPU accelerated Job on AKS. In this stage we will use as a reference the standard TensorFlow example that is also documented in the official AKS Azure Learn pages.

Create a file called gpu-accelerated.yaml with this content:

 

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    app: samples-tf-mnist-demo
  name: samples-tf-mnist-demo
spec:
  template:
    metadata:
      labels:
        app: samples-tf-mnist-demo
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - Standard_NC4as_T4_v3
      containers:
      - name: samples-tf-mnist-demo
        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
        args: ["--max_steps", "500"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
          - mountPath: /tmp
            name: scratch
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: OnFailure
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule" 
      volumes:
        - name: scratch
          hostPath:
            # directory location on host
            path: /mnt/tmp
            type: DirectoryOrCreate
            # this field is optional

 

This job can be submitted with the following command:

 

kubectl apply -f gpu-accelerated.yaml

 

 After approximately one minute the node should be automatically provisioned:

 

root@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-nc4ast4-81279986-vmss000003     Ready    agent   2m38s   v1.27.7
aks-nodepool1-25743550-vmss000000   Ready    agent   4d16h   v1.27.7
aks-nodepool1-25743550-vmss000001   Ready    agent   4d16h   v1.27.7

 

We can check that Node Feature Discovery has properly labeled the node:

 

root@aks-gpu-playground-rg-jumpbox:~# kubectl describe nodes  aks-nc4ast4-81279986-vmss000003 | grep pci-
feature.node.kubernetes.io/pci-0302_10de.present=true
feature.node.kubernetes.io/pci-10de.present=true

 

NVIDIA GPU Operator DaemonSets will start preparing the node and after driver installation, NVIDIA Container toolkit and all the related validation is completed.

Once node preparation is completed, the GPU operator will add an allocatable GPU resource to the node:

 

kubectl describe nodes aks-nc4ast4-81279986-vmss000003 
…
Allocatable:
  cpu:                3860m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             24487780Ki
  nvidia.com/gpu:     1
  pods:               110
…

 

We can follow the process with the kubectl logs commands:

 


root@aks-gpu-playground-rg-jumpbox:~# kubectl get pods
NAME                          READY   STATUS    RESTARTS   AGE
samples-tf-mnist-demo-tmpr4   1/1     Running   0          11m

root@aks-gpu-playground-rg-jumpbox:~# kubectl logs samples-tf-mnist-demo-tmpr4 --follow
2024-02-18 11:51:31.479768: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2024-02-18 11:51:31.806125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0001:00:00.0
totalMemory: 15.57GiB freeMemory: 15.47GiB
2024-02-18 11:51:31.806157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5)
2024-02-18 11:54:56.216820: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1201
Accuracy at step 10: 0.7364
…..
Accuracy at step 490: 0.9559
Adding run metadata for 499

 

 

Time-slicing configuration

An extremely useful feature of NVIDIA GPU Operator is represented by time-slicing. Time-slicing allows to share a physical GPU available on a node with multiple pods. Of course, this is just a time scheduling partition and not a physical GPU partition. It basically means that the different GPU processes that will be run by the different Pods will receive a proportional time of GPU compute time. However, if a Pod is particularly requiring in terms of GPU processing, it will impact significantly the other Pods sharing the GPU.

In the official NVIDIA GPU operator there are different ways to configure time-slicing. Here, considering that one of the benefits of a cloud environment is the possibility of having multiple different node pools, each with different GPU or configuration, we will focus on a fine-grained definition of the time-slicing at the node pool level.

The steps to enable time-slicing are three:

  • Label the nodes to allow them to be referred in the time-slicing configuration
  • Creating the time-slicing ConfigMap
  • Enabling time-slicing based on the ConfigMap in the GPU operator cluster policy

As a first step, the nodes should be labelled with the key “nvidia.com/device-plugin.config”.

For example, let’s label our node array from Azure CLI:

 

az aks nodepool update --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --nodepool-name nc4ast4 --labels "nvidia.com/device-plugin.config=tesla-t4-ts2

 

After this step, let’s create the ConfigMap object required to allow for a time-slicing 2 on this node pool in a file called time-slicing-config.yaml:

 

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  tesla-t4-ts2: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 2

 

Let’s apply the configuration in the GPU operator namespace:

 

kubectl apply -f time-slicing-config.yaml -n gpu-operator

 

Finally, let’s update the cluster policy to enable the time-slicing configuration:

 

kubectl patch clusterpolicy/cluster-policy \
    -n gpu-operator --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'

 

Now, let’s try to resubmit the job already used in the first step in two replicas, creating a file called gpu-accelerated-time-slicing.yaml:

 

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    app: samples-tf-mnist-demo-ts
  name: samples-tf-mnist-demo-ts
spec:
  completions: 2
  parallelism: 2
  completionMode: Indexed
  template:
    metadata:
      labels:
        app: samples-tf-mnist-demo-ts
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - Standard_NC4as_T4_v3
      containers:
      - name: samples-tf-mnist-demo
        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
        args: ["--max_steps", "500"]
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: OnFailure
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule" 

 

Let’s submit the job with the standard syntax:

 

kubectl apply -f gpu-accelerated-time-slicing.yaml

 

Now, after the node has been provisioned, we will find that it will get two GPU resources allocatable and will at the same time take the two Pods running concurrently at the same time.

 

kubectl describe node aks-nc4ast4-81279986-vmss000004
...
Allocatable:
  cpu:                3860m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             24487780Ki
  nvidia.com/gpu:     2
  pods:               110
.....
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  default                     samples-tf-mnist-demo-ts-0-4tdcf            0 (0%)        0 (0%)      0 (0%)           0 (0%)         29s
  default                     samples-tf-mnist-demo-ts-1-67hn4            0 (0%)        0 (0%)      0 (0%)           0 (0%)         29s
  gpu-operator                gpu-feature-discovery-lksj7                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         59s
  gpu-operator                node-feature-discovery-worker-wbbct         0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m11s
  gpu-operator                nvidia-container-toolkit-daemonset-8nmx7    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m24s
  gpu-operator                nvidia-dcgm-exporter-76rs8                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m24s
  gpu-operator                nvidia-device-plugin-daemonset-btwz7        0 (0%)        0 (0%)      0 (0%)           0 (0%)         55s
  gpu-operator                nvidia-driver-daemonset-8dkkh               0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m6s
  gpu-operator                nvidia-operator-validator-s7294             0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m24s
  kube-system                 azure-ip-masq-agent-fjm5d                   100m (2%)     500m (12%)  50Mi (0%)        250Mi (1%)     9m18s
  kube-system                 cloud-node-manager-9wpsm                    50m (1%)      0 (0%)      50Mi (0%)        512Mi (2%)     9m18s
  kube-system                 csi-azuredisk-node-ckqw6                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (1%)     9m18s
  kube-system                 csi-azurefile-node-xmfbd                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (2%)     9m18s
  kube-system                 kube-proxy-7l856                            100m (2%)     0 (0%)      0 (0%)           0 (0%)         9m18s

 

A few remarks about time-slicing:

  • It is critical, in this specific scenario, to benchmark and characterize your GPU workload. Time-slicing is just a method to maximize resource utilization, but is not the solution to multiply available resources. It is suggested that a careful benchmarking of GPU usage and GPU memory usage is carried out to identify if time-slicing is a valid solution. For example, if the average load of a specific GPU process is around 30%, a time-slicing of 2 or 3 could be evaluated
  • Of course, also CPU and RAM resources should be considered in the equation
  • In AKS it is extremely important to note that once time-slicing configuration is changed for a specific nodepool which has no resource allocated, it is not immediately evident in the next autoscaler operation.

Let’s imagine for example a nodepool scaled-down to zero that has no time-slicing applied. Let’s assume to configure it with time-slicing equal to 2. Submitting a request for 2 GPU resources may still allocate 2 nodes.

This because the autoscaler has in its memory that each node provides only 1 allocatable GPU. In all subsequent operations, once a node will be correctly exposing 2 GPUs as allocatable for the first time, AKS autoscaler will acknowledge that and it will act accordingly in future autoscaling operations.

 

Multi-Instance GPU (MIG)

NVIDIA Multi-instance GPU allows for GPU partitioning on Ampere and Hopper architecture. This means allowing an available GPU to be partitioned at hardware level (and not at time-slicing level). This means that Pods can have access to a dedicated hardware portion of the GPU resources which is delimited at an hardware level.

In Kubernetes there are two strategies available for MIG, more specifically single and mixed.

In single strategy, the nodes expose a standard “nvidia.com/gpu” set of resources.

In mixed strategy, the nodes expose the specific MIG profiles as resources, like in the example below:

 

Allocatable:
nvidia.com/mig-1g.5gb:   1
nvidia.com/mig-2g.10gb:  1
nvidia.com/mig-3g.20gb:  1

 

In order to use MIG, you could follow standard AKS documentation. However, we would like to propose here a method relying totally on NVIDIA GPU operator.

As a first step, it is necessary to allow reboot of nodes to get MIG configuration enabled:

 

kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge -p '{"spec": {"migManager": {"env": [{"name": "WITH_REBOOT", "value": "true"}]}}}'

 

Let’s start creating a node pools powered by a GPU supporting MIG on Azure, like on the SKU Standard_NC24ads_A100_v4 and let’s label the node with one of the MIG profiles available for A100 80 GiB.

 

az aks nodepool add \
    --resource-group $RESOURCE_GROUP_NAME \
    --cluster-name $AKS_CLUSTER_NAME \
    --name nc24a100v4 \
    --node-taints sku=gpu:NoSchedule \
    --node-vm-size Standard_NC24ads_A100_v4 \
    --enable-cluster-autoscaler \
    --min-count 0 --max-count 1 --node-count 0 --skip-gpu-driver-install --labels "nvidia.com/mig.config"="all-1g.10gb"

 

There is another important detail to consider in this stage with AKS, meaning that the auto-scaling of the nodes will bring-up nodes with a standard GPU configuration, without MIG activated. This means, that NVIDIA GPU operator will install the drivers and then mig-manager will activate the proper MIG configuration profile and reboot. Between these two phases there is a small time window where the GPU resources are exposed by the node and this could potentially trigger a job execution.

To support this scenario, it is important to consider on AKS the need of an additional DaemonSet that prevents any Pod to be scheduled during the MIG configuration. This is available in a dedicated repository.

To deploy the DaemonSet:

 

export NAMESPACE=gpu-operator
export ACR_NAME=<YOUR_ACR_NAME>
git clone https://github.com/wolfgang-desalvador/aks-mig-monitor.git
cd aks-mig-monitor
sed -i "s/<ACR_NAME>/$ACR_NAME/g" mig-monitor-daemonset.yaml
sed -i "s/<NAMESPACE>/$NAMESPACE/g" mig-monitor-roles.yaml
docker build . -t $ACR_NAME/aks-mig-monitor
docker push $ACR_NAME/aks-mig-monitor
kubectl apply -f mig-monitor-roles.yaml -n $NAMESPACE
kubectl apply -f mig-monitor-daemonset.yaml -n $NAMESPACE

 

We can now try to submit the mig-accelerated-job.yaml

 

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    app: samples-tf-mnist-demo-mig
  name: samples-tf-mnist-demo-mig
spec:
  completions: 7
  parallelism: 7
  completionMode: Indexed
  template:
    metadata:
      labels:
        app: samples-tf-mnist-demo-mig
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - Standard_NC24ads_A100_v4
      containers:
      - name: samples-tf-mnist-demo
        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
        args: ["--max_steps", "500"]
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: OnFailure
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

 

 Then we will be submitting the job with kubectl:

 

kubectl apply -f mig-accelerated-job.yaml

 

After the node will startup, the first state will have a taint with mig=notReady:NoSchedule since the MIG configuration is not completed. GPU Operator containers will be installed:

 

kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a

Name:               aks-nc24a100v4-42670331-vmss00000a
...
                    nvidia.com/mig.config=all-1g.10gb
...
Taints:             kubernetes.azure.com/scalesetpriority=spot:NoSchedule
                    mig=notReady:NoSchedule
                    sku=gpu:NoSchedule
...
Non-terminated Pods:          (13 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  gpu-operator                aks-mig-monitor-64zpl                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         16s
  gpu-operator                gpu-feature-discovery-wpd2j                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s
  gpu-operator                node-feature-discovery-worker-79h68         0 (0%)        0 (0%)      0 (0%)           0 (0%)         16s
  gpu-operator                nvidia-container-toolkit-daemonset-q5p9k    0 (0%)        0 (0%)      0 (0%)           0 (0%)         12s
  gpu-operator                nvidia-dcgm-exporter-9g5kg                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s
  gpu-operator                nvidia-device-plugin-daemonset-5wpzk        0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s
  gpu-operator                nvidia-driver-daemonset-kqkzb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s
  gpu-operator                nvidia-operator-validator-lx77m             0 (0%)        0 (0%)      0 (0%)           0 (0%)         12s
  kube-system                 azure-ip-masq-agent-7rd2x                   100m (0%)     500m (2%)   50Mi (0%)        250Mi (0%)     66s
  kube-system                 cloud-node-manager-dc756                    50m (0%)      0 (0%)      50Mi (0%)        512Mi (0%)     66s
  kube-system                 csi-azuredisk-node-5b4nk                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (0%)     66s
  kube-system                 csi-azurefile-node-vlwhv                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (0%)     66s
  kube-system                 kube-proxy-4fkxh                            100m (0%)     0 (0%)      0 (0%)           0 (0%)         66s

 

After the GPU Operator configuration is completed, mig-manager will start being deployed. MIG configuration will be applied and node will then set in a rebooting state:

 

 kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a

                    nvidia.com/mig.config=all-1g.10gb
                    nvidia.com/mig.strategy=single
                    nvidia.com/mig.config.state=rebooting
...
Taints:             kubernetes.azure.com/scalesetpriority=spot:NoSchedule
                    mig=notReady:NoSchedule
                    sku=gpu:NoSchedule
...
Non-terminated Pods:          (14 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  gpu-operator                aks-mig-monitor-64zpl                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m6s
  gpu-operator                gpu-feature-discovery-6btwx                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s
  gpu-operator                node-feature-discovery-worker-79h68         0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m6s
  gpu-operator                nvidia-container-toolkit-daemonset-wplkb    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s
  gpu-operator                nvidia-dcgm-exporter-vnscq                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s
  gpu-operator                nvidia-device-plugin-daemonset-d86dn        0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s
  gpu-operator                nvidia-driver-daemonset-kqkzb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m3s
  gpu-operator                nvidia-mig-manager-t4bw9                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2s
  gpu-operator                nvidia-operator-validator-jrfkn             0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s
  kube-system                 azure-ip-masq-agent-7rd2x                   100m (0%)     500m (2%)   50Mi (0%)        250Mi (0%)     4m56s
  kube-system                 cloud-node-manager-dc756                    50m (0%)      0 (0%)      50Mi (0%)        512Mi (0%)     4m56s
  kube-system                 csi-azuredisk-node-5b4nk                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (0%)     4m56s
  kube-system                 csi-azurefile-node-vlwhv                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (0%)     4m56s
  kube-system                 kube-proxy-4fkxh                            100m (0%)     0 (0%)      0 (0%)           0 (0%)         4m56s

 

After the reboot, the MIG configuration will switch to state "success" and taints will be removed. Scheduling of the 7 pods of our job will then start:

 

 kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a
...
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=all-1g.10gb
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=single
...
Taints:             kubernetes.azure.com/scalesetpriority=spot:NoSchedule
                    sku=gpu:NoSchedule
...
Allocatable:
  cpu:                23660m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             214295444Ki
  nvidia.com/gpu:     7
  pods:               110
...
Non-terminated Pods:          (21 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  default                     samples-tf-mnist-demo-ts-0-5bs64            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  default                     samples-tf-mnist-demo-ts-1-2msdh            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  default                     samples-tf-mnist-demo-ts-2-ck8c8            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  default                     samples-tf-mnist-demo-ts-3-dlkfn            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  default                     samples-tf-mnist-demo-ts-4-899fr            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  default                     samples-tf-mnist-demo-ts-5-dmgpn            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  default                     samples-tf-mnist-demo-ts-6-pvzm4            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  gpu-operator                aks-mig-monitor-64zpl                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         9m9s
  gpu-operator                gpu-feature-discovery-5t9gn                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s
  gpu-operator                node-feature-discovery-worker-79h68         0 (0%)        0 (0%)      0 (0%)           0 (0%)         9m9s
  gpu-operator                nvidia-container-toolkit-daemonset-82dgg    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m22s
  gpu-operator                nvidia-dcgm-exporter-xbxqf                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s
  gpu-operator                nvidia-device-plugin-daemonset-8gkzd        0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s
  gpu-operator                nvidia-driver-daemonset-kqkzb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         9m6s
  gpu-operator                nvidia-mig-manager-jbqls                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m22s
  gpu-operator                nvidia-operator-validator-5rdbh             0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s
  kube-system                 azure-ip-masq-agent-7rd2x                   100m (0%)     500m (2%)   50Mi (0%)        250Mi (0%)     9m59s
  kube-system                 cloud-node-manager-dc756                    50m (0%)      0 (0%)      50Mi (0%)        512Mi (0%)     9m59s
  kube-system                 csi-azuredisk-node-5b4nk                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (0%)     9m59s
  kube-system                 csi-azurefile-node-vlwhv                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (0%)     9m59s
  kube-system                 kube-proxy-4fkxh                            100m (0%)     0 (0%)      0 (0%)           0 (0%)         9m59s
"

 

Checking on a node the status of MIG will visualize the 7 GPU partitions through nvidia-smi:

 

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                   On |
| N/A   27C    P0              71W / 300W |    726MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    7   0   0  |             102MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    8   0   1  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   11   0   4  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   12   0   5  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   13   0   6  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    7    0      28988      C   python                                       82MiB |
|    0    8    0      29140      C   python                                       84MiB |
|    0    9    0      29335      C   python                                       84MiB |
|    0   10    0      29090      C   python                                       84MiB |
|    0   11    0      29031      C   python                                       84MiB |
|    0   12    0      29190      C   python                                       84MiB |
|    0   13    0      29255      C   python                                       84MiB |
+---------------------------------------------------------------------------------------+

 

A few remarks about MIG to take into account:

  • MIG provides physical GPU partitioning, so that the GPU associated to one Pod is totally reserved to that Pod
  • CPU and RAM resources should be considered in the equation, they won't be partitioned by MIG and should follow the standard AKS limits assignment
  • In AKS it is extremely important to note that once MIG configuration is changed for a specific nodepool which has no node allocated, it is not immediately evident in the next autoscaler operation. This means that asking for 7 GPUs on a nodepool scaled down to 0 after the first activation of MIG in the terms above may bring-up 7 nodes
  • The Daemonsets described above just prevents scheduling during the boot-up phases of a node provisioned by autoscaler. If the MIG profile should be changed afterwards changing the MIG label on the node, the node should be cordoned. Changes to the labels must be done at AKS node pool level in case label was set through Azure CLI (using az aks nodepool update) or at single node level (using kubectl patch nodes) in case it was done with kubectl

For example, in the case above, if we want to move to another profile, it is important to cordon the node with the commands:

 

kubectl get nodes

NAME                                 STATUS   ROLES   AGE     VERSION
aks-nc24a100v4-42670331-vmss00000c   Ready    agent   11m     v1.27.7
aks-nodepool1-25743550-vmss000000    Ready    agent   6d16h   v1.27.7
aks-nodepool1-25743550-vmss000001    Ready    agent   6d16h   v1.27.7

kubectl cordon aks-nc24a100v4-42670331-vmss00000c

node/aks-nc24a100v4-42670331-vmss00000c cordoned

 

Be aware that cordoning the nodes will not stop the Pods. You should verify no GPU accelerated workload is running before submitting the label change. 

Since in our case we have applied the label at the AKS level, we will need to change the label from Azure CLI:

 

az aks nodepool update --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --nodepool-name nc24a100v4 --labels "nvidia.com/mig.config"="all-1g.20gb"

 

 
This will trigger a reconfiguration of MIG with new profile applied:

 

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                   On |
| N/A   42C    P0              77W / 300W |     50MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    4   0   1  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    5   0   2  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    6   0   3  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

 


We can then uncordon the node/nodes:

 

kubectl uncordon aks-nc24a100v4-42670331-vmss00000c

node/aks-nc24a100v4-42670331-vmss00000c uncordoned

 

 

Using NVIDIA GPU Driver CRD (preview)

The NVIDIA GPU Driver CRD allows to define in a granular way the driver version and the driver images on each of the nodepools in use in an AKS cluster. This feature is in preview as documented in NVIDIA GPU Operator and not suggested by NVIDIA on production systems.


In order to enable NVIDIA GPU Driver CRD (in case you have already installed NVIDIA GPU Operator you will need to perform helm uninstall, of course taking care of running workloads) perform the following command:

 

 helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator --set-json daemonsets.tolerations='[{"effect": "NoSchedule", "key": "sku", "operator": "Equal", "value": "gpu" }, {"effect": "NoSchedule", "key": "kubernetes.azure.com/scalesetpriority", "value":"spot", "operator": "Equal"}]' --set nfd.enabled=false --set driver.nvidiaDriverCRD.deployDefaultCR=false --set driver.nvidiaDriverCRD.enabled=true

 

After this step, it is important to create nodepools with a proper label to be used to select nodes for driver version (in this case "driver.config"):

 


az aks nodepool add  \
   --resource-group $RESOURCE_GROUP_NAME  \
   --cluster-name $AKS_CLUSTER_NAME \ 
   --name nc4latest  \
   --node-taints sku=gpu:NoSchedule    \
   --node-vm-size Standard_NC4as_T4_v3 \    
   --enable-cluster-autoscaler  \
   --labels "driver.config"="latest"  \
   --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True

az aks nodepool add  \
   --resource-group $RESOURCE_GROUP_NAME  \
   --cluster-name $AKS_CLUSTER_NAME \ 
   --name nc4stable  \
   --node-taints sku=gpu:NoSchedule    \
   --node-vm-size Standard_NC4as_T4_v3 \    
   --enable-cluster-autoscaler  \
   --labels "driver.config"="stable"  \
   --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True

 

After this step, the driver configuration (NVIDIADriver object in AKS) should be created. This can be done with a file called driver-config.yaml with the following content:

 

apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: nc4-latest
spec:
  driverType: gpu
  env: []
  image: driver
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  manager: {}
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
    - key: "kubernetes.azure.com/scalesetpriority"
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
  nodeSelector:
    driver.config: "latest"
  repository: nvcr.io/nvidia
  version: "535.129.03"
---
apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: nc4-stable
spec:
  driverType: gpu
  env: []
  image: driver
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  manager: {}
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
    - key: "kubernetes.azure.com/scalesetpriority"
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
  nodeSelector:
    driver.config: "stable"
  repository: nvcr.io/nvidia
  version: "535.104.12"

 


This can then be applied with kubectl:

 

kubectl apply -f driver-config.yaml -n gpu-operator

 

Now scaling up to nodes (e.g. submitting a GPU workload requesting as affinity exactly the target labels of device.config) we can verify that the driver versions will be the one requested. Running nvidia-smi attaching a shell to the Daemonset container of each of the two nodes:

 

### On latest

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
| N/A   30C    P8              15W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

### On stable

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
| N/A   30C    P8              14W /  70W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

 

NVIDIA GPU Driver CRD allows to specify a specific Docker image and Docker Registry for the NVIDIA Driver installation on each node pool.

This becomes particularly useful in the case we will need to install the Azure specific Virtual GPU Drivers on A10 GPUs. 

On Azure, NVads_A10_v5 VMs are characterized by NVIDIA VGPU technology in the backend, so they require VGPU Drivers. On Azure, the VGPU drivers comes included with the VM cost, so there is no need to get a VGPU license. The binaries available on the Azure Driver download page can be used on the supported OS (including Ubuntu 22) only on Azure VMs.

In this case, there is the possibility to bundle an ad-hoc NVIDIA Driver container image to be used on Azure, making that accessible to a dedicated container registry.


In order to do that, this is the procedure (assuming wehave an ACR attached to AKS with <ACR_NAME>):

 

export ACR_NAME=<ACR_NAME>
az acr login -n $ACR_NAME
git clone https://gitlab.com/nvidia/container-images/driver
cd driver
cp -r ubuntu22.04 ubuntu22.04-aks
cd ubuntu22.04-aks
cd drivers
wget "https://download.microsoft.com/download/1/4/4/14450d0e-a3f2-4b0a-9bb4-a8e729e986c4/NVIDIA-Linux-x86_64-535.154.05-grid-azure.run"
mv NVIDIA-Linux-x86_64-535.154.05-grid-azure.run NVIDIA-Linux-x86_64-535.154.05.run
chmod +x NVIDIA-Linux-x86_64-535.154.05.run
cd ..
sed -i 's%/tmp/install.sh download_installer%echo "Skipping Driver Download"%g' Dockerfile
sed 's%sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x%sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x \&\& mv NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION-grid-azure NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION%g' nvidia-driver -i
docker build     --build-arg DRIVER_VERSION=535.154.05 --build-arg DRIVER_BRANCH=535    --build-arg CUDA_VERSION=12.3.1     --build-arg TARGETARCH=amd64 . -t $ACR_NAME/driver:535.154.05-ubuntu22.04
docker push $ACR_NAME/driver:535.154.05-ubuntu22.04
 

 

After this, let's create a specific NVIDIADriver object for Azure VGPU with a file named azure-vgpu.yaml and the following content (replace <ACR_NAME> with your ACR name):

 

apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: azure-vgpu
spec:
  driverType: gpu
  env: []
  image: driver
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  manager: {}
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
    - key: "kubernetes.azure.com/scalesetpriority"
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
  nodeSelector:
    driver.config: "azurevgpu"
  repository: <ACR_NAME>
  version: "535.154.05"

 

Let's apply it with kubectl:

 

kubectl apply -f azure-vgpu.yaml -n gpu-operator

 

Now, let's create an A10 nodepool with Azure CLI:

 

az aks nodepool add \
    --resource-group $RESOURCE_GROUP_NAME    \
    --cluster-name $AKS_CLUSTER_NAME    \
    --name nv36a10v5     \
    --node-taints sku=gpu:NoSchedule     \
    --node-vm-size Standard_NV36ads_A10_v5    \
    --enable-cluster-autoscaler     \
    --labels "driver.config"="azurevgpu"    \
    --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True

 

Scaling up a node with a specific workload and waiting for the finalization of Driver installation, we will see that the image of the NVIDIA Driver installation has been pulled by our registry:

 

root@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-25743550-vmss000000   Ready    agent   6d23h   v1.27.7
aks-nodepool1-25743550-vmss000001   Ready    agent   6d23h   v1.27.7
aks-nv36a10v5-10653906-vmss000000   Ready    agent   9m24s   v1.27.7
root@aks-gpu-playground-rg-jumpbox:~# kubectl describe node aks-nv36a10v5-10653906-vmss000000| grep gpu-driver
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu-driver-upgrade-enabled: true
  gpu-operator                nvidia-gpu-driver-ubuntu22.04-56df89b87c-6w8tj    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m29s

root@aks-gpu-playground-rg-jumpbox:~# kubectl describe pods -n gpu-operator nvidia-gpu-driver-ubuntu22.04-56df89b87c-6w8tj | grep -i Image
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:bb845160b32fd12eb3fae3e830d2e6a7780bc7405e0d8c5b816242d48be9daa8
    Image:         aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04
    Image ID:      aksgpuplayground.azurecr.io/driver@sha256:deb6e6311a174ca6a989f8338940bf3b1e6ae115ebf738042063f4c3c95c770f
  Normal   Pulled            4m26s  kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2" already present on machine
  Normal   Pulling           4m23s  kubelet            Pulling image "aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04"
  Normal   Pulled            4m16s  kubelet            Successfully pulled image "aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04" in 6.871887325s (6.871898205s including waiting)

 

Also, we can see how the A10 VGPU profile is recognized successfully attaching to the Pod of device-plugin-daemonset:

 

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-24Q                 On  | 00000002:00:00.0 Off |                    0 |
| N/A   N/A    P8              N/A /  N/A |      0MiB / 24512MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

 

 

Thank you

Thank you for reading our blog posts, feel free to leave any comment / feedback, ask for clarifications or report any issue

Updated Feb 27, 2024
Version 6.0

1 Comment

  • Great article folks. Very detailed and very useful for many customers trying to run this kind of workloads in AKS.

"}},"componentScriptGroups({\"componentId\":\"custom.widget.MicrosoftFooter\"})":{"__typename":"ComponentScriptGroups","scriptGroups":{"__typename":"ComponentScriptGroupsDefinition","afterInteractive":{"__typename":"PageScriptGroupDefinition","group":"AFTER_INTERACTIVE","scriptIds":[]},"lazyOnLoad":{"__typename":"PageScriptGroupDefinition","group":"LAZY_ON_LOAD","scriptIds":[]}},"componentScripts":[]},"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/community/NavbarDropdownToggle\"]})":[{"__ref":"CachedAsset:text:en_US-components/community/NavbarDropdownToggle-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/common/QueryHandler\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/common/QueryHandler-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageCoverImage\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageCoverImage-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/nodes/NodeTitle\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/nodes/NodeTitle-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageTimeToRead\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageTimeToRead-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageSubject\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageSubject-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/users/UserLink\"]})":[{"__ref":"CachedAsset:text:en_US-components/users/UserLink-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/users/UserRank\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/users/UserRank-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageTime\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageTime-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageBody\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageBody-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageCustomFields\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageCustomFields-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageRevision\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageRevision-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageReplyButton\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageReplyButton-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/messages/MessageAuthorBio\"]})":[{"__ref":"CachedAsset:text:en_US-components/messages/MessageAuthorBio-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/users/UserAvatar\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/users/UserAvatar-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/ranks/UserRankLabel\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/ranks/UserRankLabel-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/users/UserRegistrationDate\"]})":[{"__ref":"CachedAsset:text:en_US-components/users/UserRegistrationDate-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/nodes/NodeAvatar\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/nodes/NodeAvatar-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/nodes/NodeDescription\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/nodes/NodeDescription-1745505310103"}],"message({\"id\":\"message:4083697\"})":{"__ref":"BlogReplyMessage:message:4083697"},"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"components/tags/TagView/TagViewChip\"]})":[{"__ref":"CachedAsset:text:en_US-components/tags/TagView/TagViewChip-1745505310103"}],"cachedText({\"lastModified\":\"1745505310103\",\"locale\":\"en-US\",\"namespaces\":[\"shared/client/components/nodes/NodeIcon\"]})":[{"__ref":"CachedAsset:text:en_US-shared/client/components/nodes/NodeIcon-1745505310103"}]},"CachedAsset:pages-1745487429080":{"__typename":"CachedAsset","id":"pages-1745487429080","value":[{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"BlogViewAllPostsPage","type":"BLOG","urlPath":"/category/:categoryId/blog/:boardId/all-posts/(/:after|/:before)?","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"CasePortalPage","type":"CASE_PORTAL","urlPath":"/caseportal","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"CreateGroupHubPage","type":"GROUP_HUB","urlPath":"/groups/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"CaseViewPage","type":"CASE_DETAILS","urlPath":"/case/:caseId/:caseNumber","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"InboxPage","type":"COMMUNITY","urlPath":"/inbox","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"HelpFAQPage","type":"COMMUNITY","urlPath":"/help","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"IdeaMessagePage","type":"IDEA_POST","urlPath":"/idea/:boardId/:messageSubject/:messageId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"IdeaViewAllIdeasPage","type":"IDEA","urlPath":"/category/:categoryId/ideas/:boardId/all-ideas/(/:after|/:before)?","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"LoginPage","type":"USER","urlPath":"/signin","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"BlogPostPage","type":"BLOG","urlPath":"/category/:categoryId/blogs/:boardId/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"UserBlogPermissions.Page","type":"COMMUNITY","urlPath":"/c/user-blog-permissions/page","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ThemeEditorPage","type":"COMMUNITY","urlPath":"/designer/themes","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TkbViewAllArticlesPage","type":"TKB","urlPath":"/category/:categoryId/kb/:boardId/all-articles/(/:after|/:before)?","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1730142000000,"localOverride":null,"page":{"id":"AllEvents","type":"CUSTOM","urlPath":"/Events","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"OccasionEditPage","type":"EVENT","urlPath":"/event/:boardId/:messageSubject/:messageId/edit","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"OAuthAuthorizationAllowPage","type":"USER","urlPath":"/auth/authorize/allow","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"PageEditorPage","type":"COMMUNITY","urlPath":"/designer/pages","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"PostPage","type":"COMMUNITY","urlPath":"/category/:categoryId/:boardId/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ForumBoardPage","type":"FORUM","urlPath":"/category/:categoryId/discussions/:boardId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TkbBoardPage","type":"TKB","urlPath":"/category/:categoryId/kb/:boardId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"EventPostPage","type":"EVENT","urlPath":"/category/:categoryId/events/:boardId/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"UserBadgesPage","type":"COMMUNITY","urlPath":"/users/:login/:userId/badges","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"GroupHubMembershipAction","type":"GROUP_HUB","urlPath":"/membership/join/:nodeId/:membershipType","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"MaintenancePage","type":"COMMUNITY","urlPath":"/maintenance","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"IdeaReplyPage","type":"IDEA_REPLY","urlPath":"/idea/:boardId/:messageSubject/:messageId/comments/:replyId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"UserSettingsPage","type":"USER","urlPath":"/mysettings/:userSettingsTab","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"GroupHubsPage","type":"GROUP_HUB","urlPath":"/groups","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ForumPostPage","type":"FORUM","urlPath":"/category/:categoryId/discussions/:boardId/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"OccasionRsvpActionPage","type":"OCCASION","urlPath":"/event/:boardId/:messageSubject/:messageId/rsvp/:responseType","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"VerifyUserEmailPage","type":"USER","urlPath":"/verifyemail/:userId/:verifyEmailToken","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"AllOccasionsPage","type":"OCCASION","urlPath":"/category/:categoryId/events/:boardId/all-events/(/:after|/:before)?","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"EventBoardPage","type":"EVENT","urlPath":"/category/:categoryId/events/:boardId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TkbReplyPage","type":"TKB_REPLY","urlPath":"/kb/:boardId/:messageSubject/:messageId/comments/:replyId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"IdeaBoardPage","type":"IDEA","urlPath":"/category/:categoryId/ideas/:boardId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"CommunityGuideLinesPage","type":"COMMUNITY","urlPath":"/communityguidelines","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"CaseCreatePage","type":"SALESFORCE_CASE_CREATION","urlPath":"/caseportal/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TkbEditPage","type":"TKB","urlPath":"/kb/:boardId/:messageSubject/:messageId/edit","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ForgotPasswordPage","type":"USER","urlPath":"/forgotpassword","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"IdeaEditPage","type":"IDEA","urlPath":"/idea/:boardId/:messageSubject/:messageId/edit","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TagPage","type":"COMMUNITY","urlPath":"/tag/:tagName","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"BlogBoardPage","type":"BLOG","urlPath":"/category/:categoryId/blog/:boardId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"OccasionMessagePage","type":"OCCASION_TOPIC","urlPath":"/event/:boardId/:messageSubject/:messageId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ManageContentPage","type":"COMMUNITY","urlPath":"/managecontent","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ClosedMembershipNodeNonMembersPage","type":"GROUP_HUB","urlPath":"/closedgroup/:groupHubId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"CommunityPage","type":"COMMUNITY","urlPath":"/","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ForumMessagePage","type":"FORUM_TOPIC","urlPath":"/discussions/:boardId/:messageSubject/:messageId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"IdeaPostPage","type":"IDEA","urlPath":"/category/:categoryId/ideas/:boardId/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1730142000000,"localOverride":null,"page":{"id":"CommunityHub.Page","type":"CUSTOM","urlPath":"/Directory","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"BlogMessagePage","type":"BLOG_ARTICLE","urlPath":"/blog/:boardId/:messageSubject/:messageId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"RegistrationPage","type":"USER","urlPath":"/register","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"EditGroupHubPage","type":"GROUP_HUB","urlPath":"/group/:groupHubId/edit","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ForumEditPage","type":"FORUM","urlPath":"/discussions/:boardId/:messageSubject/:messageId/edit","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ResetPasswordPage","type":"USER","urlPath":"/resetpassword/:userId/:resetPasswordToken","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1730142000000,"localOverride":null,"page":{"id":"AllBlogs.Page","type":"CUSTOM","urlPath":"/blogs","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TkbMessagePage","type":"TKB_ARTICLE","urlPath":"/kb/:boardId/:messageSubject/:messageId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"BlogEditPage","type":"BLOG","urlPath":"/blog/:boardId/:messageSubject/:messageId/edit","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ManageUsersPage","type":"USER","urlPath":"/users/manage/:tab?/:manageUsersTab?","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ForumReplyPage","type":"FORUM_REPLY","urlPath":"/discussions/:boardId/:messageSubject/:messageId/replies/:replyId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"PrivacyPolicyPage","type":"COMMUNITY","urlPath":"/privacypolicy","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"NotificationPage","type":"COMMUNITY","urlPath":"/notifications","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"UserPage","type":"USER","urlPath":"/users/:login/:userId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"OccasionReplyPage","type":"OCCASION_REPLY","urlPath":"/event/:boardId/:messageSubject/:messageId/comments/:replyId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ManageMembersPage","type":"GROUP_HUB","urlPath":"/group/:groupHubId/manage/:tab?","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"SearchResultsPage","type":"COMMUNITY","urlPath":"/search","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"BlogReplyPage","type":"BLOG_REPLY","urlPath":"/blog/:boardId/:messageSubject/:messageId/replies/:replyId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"GroupHubPage","type":"GROUP_HUB","urlPath":"/group/:groupHubId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TermsOfServicePage","type":"COMMUNITY","urlPath":"/termsofservice","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"CategoryPage","type":"CATEGORY","urlPath":"/category/:categoryId","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"ForumViewAllTopicsPage","type":"FORUM","urlPath":"/category/:categoryId/discussions/:boardId/all-topics/(/:after|/:before)?","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"TkbPostPage","type":"TKB","urlPath":"/category/:categoryId/kbs/:boardId/create","__typename":"PageDescriptor"},"__typename":"PageResource"},{"lastUpdatedTime":1745487429080,"localOverride":null,"page":{"id":"GroupHubPostPage","type":"GROUP_HUB","urlPath":"/group/:groupHubId/:boardId/create","__typename":"PageDescriptor"},"__typename":"PageResource"}],"localOverride":false},"CachedAsset:text:en_US-components/context/AppContext/AppContextProvider-0":{"__typename":"CachedAsset","id":"text:en_US-components/context/AppContext/AppContextProvider-0","value":{"noCommunity":"Cannot find community","noUser":"Cannot find current user","noNode":"Cannot find node with id {nodeId}","noMessage":"Cannot find message with id {messageId}"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/common/Loading/LoadingDot-0":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/common/Loading/LoadingDot-0","value":{"title":"Loading..."},"localOverride":false},"User:user:-1":{"__typename":"User","id":"user:-1","uid":-1,"login":"Deleted","email":"","avatar":null,"rank":null,"kudosWeight":1,"registrationData":{"__typename":"RegistrationData","status":"ANONYMOUS","registrationTime":null,"confirmEmailStatus":false,"registrationAccessLevel":"VIEW","ssoRegistrationFields":[]},"ssoId":null,"profileSettings":{"__typename":"ProfileSettings","dateDisplayStyle":{"__typename":"InheritableStringSettingWithPossibleValues","key":"layout.friendly_dates_enabled","value":"false","localValue":"true","possibleValues":["true","false"]},"dateDisplayFormat":{"__typename":"InheritableStringSetting","key":"layout.format_pattern_date","value":"MMM dd yyyy","localValue":"MM-dd-yyyy"},"language":{"__typename":"InheritableStringSettingWithPossibleValues","key":"profile.language","value":"en-US","localValue":"en","possibleValues":["en-US"]}},"deleted":false},"Theme:customTheme1":{"__typename":"Theme","id":"customTheme1"},"Category:category:Azure":{"__typename":"Category","id":"category:Azure","entityType":"CATEGORY","displayId":"Azure","nodeType":"category","depth":3,"title":"Azure","shortTitle":"Azure","parent":{"__ref":"Category:category:products-services"},"categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:top":{"__typename":"Category","id":"category:top","displayId":"top","nodeType":"category","depth":0,"title":"Top","entityType":"CATEGORY","shortTitle":"Top"},"Category:category:communities":{"__typename":"Category","id":"category:communities","displayId":"communities","nodeType":"category","depth":1,"parent":{"__ref":"Category:category:top"},"title":"Communities","entityType":"CATEGORY","shortTitle":"Communities"},"Category:category:products-services":{"__typename":"Category","id":"category:products-services","displayId":"products-services","nodeType":"category","depth":2,"parent":{"__ref":"Category:category:communities"},"title":"Products","entityType":"CATEGORY","shortTitle":"Products"},"Blog:board:AzureHighPerformanceComputingBlog":{"__typename":"Blog","id":"board:AzureHighPerformanceComputingBlog","entityType":"BLOG","displayId":"AzureHighPerformanceComputingBlog","nodeType":"board","depth":4,"conversationStyle":"BLOG","title":"Azure High Performance Computing (HPC) Blog","description":"","avatar":null,"profileSettings":{"__typename":"ProfileSettings","language":null},"parent":{"__ref":"Category:category:Azure"},"ancestors":{"__typename":"CoreNodeConnection","edges":[{"__typename":"CoreNodeEdge","node":{"__ref":"Community:community:gxcuf89792"}},{"__typename":"CoreNodeEdge","node":{"__ref":"Category:category:communities"}},{"__typename":"CoreNodeEdge","node":{"__ref":"Category:category:products-services"}},{"__typename":"CoreNodeEdge","node":{"__ref":"Category:category:Azure"}}]},"userContext":{"__typename":"NodeUserContext","canAddAttachments":false,"canUpdateNode":false,"canPostMessages":false,"isSubscribed":false},"boardPolicies":{"__typename":"BoardPolicies","canPublishArticleOnCreate":{"__typename":"PolicyResult","failureReason":{"__typename":"FailureReason","message":"error.lithium.policies.forums.policy_can_publish_on_create_workflow_action.accessDenied","key":"error.lithium.policies.forums.policy_can_publish_on_create_workflow_action.accessDenied","args":[]}}},"shortTitle":"Azure High Performance Computing (HPC) Blog","repliesProperties":{"__typename":"RepliesProperties","sortOrder":"REVERSE_PUBLISH_TIME","repliesFormat":"threaded"},"eventPath":"category:Azure/category:products-services/category:communities/community:gxcuf89792board:AzureHighPerformanceComputingBlog/","tagProperties":{"__typename":"TagNodeProperties","tagsEnabled":{"__typename":"PolicyResult","failureReason":null}},"requireTags":true,"tagType":"PRESET_ONLY"},"AssociatedImage:{\"url\":\"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/cmstNC05WEo0blc\"}":{"__typename":"AssociatedImage","url":"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/cmstNC05WEo0blc","height":512,"width":512,"mimeType":"image/png"},"Rank:rank:4":{"__typename":"Rank","id":"rank:4","position":6,"name":"Microsoft","color":"333333","icon":{"__ref":"AssociatedImage:{\"url\":\"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/cmstNC05WEo0blc\"}"},"rankStyle":"OUTLINE"},"User:user:1563537":{"__typename":"User","id":"user:1563537","uid":1563537,"login":"wolfgangdesalvador","deleted":false,"avatar":{"__typename":"UserAvatar","url":"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/dS0xNTYzNTM3LTQxNzMwM2lEQkFDQ0Y2MjI1QThFMzFC"},"rank":{"__ref":"Rank:rank:4"},"email":"","messagesCount":5,"biography":null,"topicsCount":5,"kudosReceivedCount":12,"kudosGivenCount":0,"kudosWeight":1,"registrationData":{"__typename":"RegistrationData","status":null,"registrationTime":"2022-10-12T09:46:45.624-07:00","confirmEmailStatus":null},"followersCount":null,"solutionsCount":0},"BlogTopicMessage:message:4061318":{"__typename":"BlogTopicMessage","uid":4061318,"subject":"Running GPU accelerated workloads with NVIDIA GPU Operator on AKS","id":"message:4061318","revisionNum":32,"repliesCount":1,"author":{"__ref":"User:user:1563537"},"depth":0,"hasGivenKudo":false,"board":{"__ref":"Blog:board:AzureHighPerformanceComputingBlog"},"conversation":{"__ref":"Conversation:conversation:4061318"},"messagePolicies":{"__typename":"MessagePolicies","canPublishArticleOnEdit":{"__typename":"PolicyResult","failureReason":{"__typename":"FailureReason","message":"error.lithium.policies.forums.policy_can_publish_on_edit_workflow_action.accessDenied","key":"error.lithium.policies.forums.policy_can_publish_on_edit_workflow_action.accessDenied","args":[]}},"canModerateSpamMessage":{"__typename":"PolicyResult","failureReason":{"__typename":"FailureReason","message":"error.lithium.policies.feature.moderation_spam.action.moderate_entity.allowed.accessDenied","key":"error.lithium.policies.feature.moderation_spam.action.moderate_entity.allowed.accessDenied","args":[]}}},"contentWorkflow":{"__typename":"ContentWorkflow","state":"PUBLISH","scheduledPublishTime":null,"scheduledTimezone":null,"userContext":{"__typename":"MessageWorkflowContext","canSubmitForReview":null,"canEdit":false,"canRecall":null,"canSubmitForPublication":null,"canReturnToAuthor":null,"canPublish":null,"canReturnToReview":null,"canSchedule":false},"shortScheduledTimezone":null},"readOnly":false,"editFrozen":false,"moderationData":{"__ref":"ModerationData:moderation_data:4061318"},"teaser":"

\n

The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator for HPC/AI workloads requiring a high degree of customization and granular control over the compute-resources configuration

","body":"

Dr. Wolfgang De Salvador - EMEA GBB HPC/AI Infrastructure Senior Specialist

\n

Dr. Kai Neuffer - Principal Program Manager, Industry and Partner Sales - Energy Industry

\n

 

\n

\n

Resources and references used in this article:

\n\n


As of today, several options are available to run GPU accelerated HPC/AI workloads on Azure, ranging from training to inferencing.

\n

Looking specifically at AI workloads, the most direct and managed way to access GPU resources and related orchestration capabilities for training is represented by Azure Machine Learning distributed training capabilities as well as the related deployment for inferencing options.

\n

At the same time, specific HPC/AI workloads require a high degree of customization and granular control over the compute-resources configuration, including the operating system, the system packages, the HPC/AI software stack and the drivers. This is the case, for example, described in previous blog posts by our benchmarking team  for training of NVIDIA NeMo Megatron model or for MLPerf Training v3.0

\n

In these types of scenarios, it is critical to have the possibility to fine tune the configuration of the host at the operating system level, to precisely match the ideal configuration for getting the most value out of the compute resources.

\n

On Azure, HPC/AI workload orchestration on GPUs is supported on several Azure services, including Azure CycleCloud, Azure Batch and Azure Kuberenetes Services

\n

 

\n

Focus of the blog post

\n

The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator.

\n

The guide will be based on the documentation already available in Azure Learn for configuring GPU nodes or multi-instance GPU profile nodes, as well as on the NVIDIA GPU Operator documentation.

\n

However, the main scope of the article is to present a methodology to manage totally the GPU configuration leveraging on NVIDIA GPU Operator native features, including:

\n\n

 

\n

Deploying a vanilla AKS cluster

\n

The standard way of deploying a Vanilla AKS cluster is to follow the standard procedure described in Azure documentation.

\n

Please be aware that this command will create an AKS cluster with:

\n\n

In general, we strongly recommend for production workloads to look the main security concepts for AKS cluster.

\n\n

This will be out of scope for the present demo, but please be aware that this cluster is meant for NVIDIA GPU Operator demo purposes only.

\n

Using Azure CLI we can create an AKS cluster with this procedure (replace the values between arrows with your preferred values):

\n

 

\n
export RESOURCE_GROUP_NAME=<YOUR_RG_NAME>\nexport AKS_CLUSTER_NAME=<YOUR_AKS_CLUSTER_NAME>\nexport LOCATION=<YOUR_LOCATION>\n\n## Following line to be used only if Resource Group is not available\naz create group --resource-group $RESOURCE_GROUP_NAME --location $LOCATION\n\naz aks create --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME --node-count 2 --generate-ssh-keys 
\n

 

\n

 

\n

Connecting to the cluster

\n

To connect to the AKS cluster, several ways are documented in Azure documentation.

\n

Our favorite approach is using a Linux Ubuntu VM with Azure CLI installed.

\n

This would allow us to run (be aware that in the login command you may be required to use --tenant <TENTANT_ID> in case you have access to multiple tenants or --identity if the VM is on Azure and you rely on an Azure Managed Identity) in case:

\n

 

\n
## Add --tenant <TENANT_ID> in case of multiple tenants\n## Add --identity in case of using a managed identity on the VM\naz login \naz aks install-cli\naz aks get-credentials --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME\n
\n

 

\n

After this is completed, you should be able to perform standard kubectl commands like:

\n

 

\n
kubectl get nodes\n\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes\nNAME                                STATUS   ROLES   AGE     VERSION\naks-nodepool1-25743550-vmss000000   Ready    agent   2d19h   v1.27.7\naks-nodepool1-25743550-vmss000001   Ready    agent   2d19h   v1.27.7\n
\n

 

\n

Command line will be perfectly fine for all the operations in the blog post. However, if you would like to have a TUI experience, we suggest to use k9s, which can be easily installed on Linux following the installation instructions. For Ubuntu, you can install current version at the time of Blog post creation with:

\n

 

\n
wget \"https://github.com/derailed/k9s/releases/download/v0.31.9/k9s_linux_amd64.deb\"\ndpkg -i k9s_linux_amd64.deb
\n

 

\n

k9s allows to easily interact with the different resources of AKS cluster directly from a terminal user interface. It can be launched with k9s command. Detailed documentation on how to navigate on the different resources (Pods, DaemonSets, Nodes) can be found on the official k9s documentation page.

\n

 

\n

Attaching an Azure Container registry to the Azure Kubernetes Cluster (only required for MIG and NVIDIA GPU Driver CRD)

\n

In case you will be using MIG or NVIDIA GPU Driver CRD, it is necessary to create a private Azure Container Registry and attaching that to the AKS cluster.

\n

 

\n
export ACR_NAME=<ACR_NAME_OF_YOUR_CHOICE>\n\naz acr create --resource-group $RESOURCE_GROUP_NAME \\\n  --name $ACR_NAME --sku Basic\n\naz aks update --name $AKS_CLUSTER_NAME --resource-group  $RESOURCE_GROUP_NAME --attach-acr $ACR_NAME
\n

 

\n

You will be able to perform pull and push operations from this Container Registry through Docker using this command on a VM with the container engine installed, provided that the VM has a managed identity with AcrPull/AcrPush permissions :

\n

 

\n
az acr login --name $ACR_NAME
\n

 

\n

 

\n

About taints for AKS GPU nodes

\n

It is important to understand deeply the concept of taints and tolerations for GPU nodes in AKS. This is critical for two reasons:

\n\n

However, the latter point can also be addressed in a more elegant and specific way using a affinity declaration for Jobs or Pods spec requesting GPUs, like for example:

\n

 

\n
affinity:\n  nodeAffinity:\n    requiredDuringSchedulingIgnoredDuringExecution:\n      nodeSelectorTerms:\n      - matchExpressions:\n        - key: node.kubernetes.io/instance-type\n          operator: In\n          values:\n          - Standard_NC4as_T4_v3
\n

 

\n

 

\n

Creating the first GPU pool

\n

The currently created AKS cluster has as a default only a node pool with 2 nodes of Standard_DS2_v2 VMs.

\n

In order to test NVIDIA GPU Operator and run some GPU accelerated workload, we should add a GPU node pool.

\n

It is critical, in case the management of the NVIDIA stack is meant to be managed with GPU operator that the node is created with the tag:

\n

 

\n
SkipGPUDriverInstall=true
\n

 

\n

 This can be done using Azure Cloud Shell, for example using an NC4as_T4_v3 and setting the autoscaling from 0 up to 1 node:

\n

 

\n
az aks nodepool add \\\n    --resource-group $RESOURCE_GROUP_NAME \\\n    --cluster-name $AKS_CLUSTER_NAME \\\n    --name nc4ast4 \\\n    --node-taints sku=gpu:NoSchedule \\\n    --node-vm-size Standard_NC4as_T4_v3 \\\n    --enable-cluster-autoscaler \\\n    --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True\n
\n

 

\n

 In order to deploy in Spot mode, the following flags should be added to Azure CLI:

\n

 

\n
--priority Spot --eviction-policy Delete --spot-max-price -1
\n

 

\n

Recently, a preview feature has been released that is allowing to skip the creation of the tags:

\n

 

\n
# Register the aks-preview extension\naz extension add --name aks-preview\n\n# Update the aks-preview extension\naz extension update --name aks-preview\n\naz aks nodepool add \\\n    --resource-group $RESOURCE_GROUP_NAME \\\n    --cluster-name $AKS_CLUSTER_NAME \\\n    --name nc4ast4 \\\n    --node-taints sku=gpu:NoSchedule \\\n    --node-vm-size Standard_NC4as_T4_v3 \\\n    --enable-cluster-autoscaler \\\n    --min-count 0 --max-count 1 --node-count 0 --skip-gpu-driver-install\n
\n

 

\n

At the end of the process you should get the appropriate node pool defined in the portal and in status “Succeeded”:

\n

 

\n
az aks nodepool list --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME -o table\nName       OsType    KubernetesVersion    VmSize                Count    MaxPods    ProvisioningState    Mode\n---------  --------  -------------------  --------------------  -------  ---------  -------------------  ------\nnodepool1  Linux     1.27.7               Standard_DS2_v2       2        110        Succeeded            System\nnc4ast4    Linux     1.27.7               Standard_NC4as_T4_v3  0        110        Succeeded            User\n
\n

 

\n

 

\n

Install NVIDIA GPU operator

\n

On the machine with kubectl configured and with context configured above for connection to the AKS cluster, run the following to install helm:

\n

 

\n
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \\\n    && chmod 700 get_helm.sh \\\n    && ./get_helm.sh\n
\n

 

\n

To fine tune the node feature recognition, we will install Node Feature Discovery separately from NVIDIA Operator. NVIDIA Operator requires that the label feature.node.kubernetes.io/pci-10de.present=true is applied to the nodes. Moreover, it is important to tune the node discovery plugin so that it will be scheduled even on Spot instances of the Kubernetes cluster and on instances where the taint sku: gpu is applied

\n

 

\n
helm install --wait --create-namespace  -n gpu-operator node-feature-discovery node-feature-discovery --create-namespace    --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts   --set-json master.config.extraLabelNs='[\"nvidia.com\"]'   --set-json worker.tolerations='[{ \"effect\": \"NoSchedule\", \"key\": \"sku\", \"operator\": \"Equal\", \"value\": \"gpu\"},{\"effect\": \"NoSchedule\", \"key\": \"kubernetes.azure.com/scalesetpriority\", \"value\":\"spot\", \"operator\": \"Equal\"},{\"effect\": \"NoSchedule\", \"key\": \"mig\", \"value\":\"notReady\", \"operator\": \"Equal\"}]'
\n

 

\n

After enabling Node Feature Discovery, it is important to create a custom rule to precisely match NVIDIA GPUs on the nodes. This can be done creating a file called nfd-gpu-rule.yaml containing the following:

\n

 

\n
apiVersion: nfd.k8s-sigs.io/v1alpha1\nkind: NodeFeatureRule\nmetadata:\n  name: nfd-gpu-rule\nspec:\n   rules:\n   - name: \"nfd-gpu-rule\"\n     labels:\n        \"feature.node.kubernetes.io/pci-10de.present\": \"true\"\n     matchFeatures:\n        - feature: pci.device\n          matchExpressions:\n            vendor: {op: In, value: [\"10de\"]}\n
\n

 

\n

After this file is created, we should apply this to the AKS cluster:

\n

 

\n
kubectl apply -n gpu-operator -f nfd-gpu-rule.yaml
\n

 

\n

After this step, it is necessary to add NVIDIA Helm repository:

\n

 

\n
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia    && helm repo update
\n

 

\n

And now the next step will be installing the GPU operator, remembering the tainting also for the GPU Operator DaemonSets and also to disable the deployment of the Node Feature Discovery (nfd) that has been done in the previous step:

\n

 

\n
helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator --set-json daemonsets.tolerations='[{ \"effect\": \"NoSchedule\", \"key\": \"sku\", \"operator\": \"Equal\", \"value\": \"gpu\"},{\"effect\": \"NoSchedule\", \"key\": \"kubernetes.azure.com/scalesetpriority\", \"value\":\"spot\", \"operator\": \"Equal\"},{\"effect\": \"NoSchedule\", \"key\": \"mig\", \"value\":\"notReady\", \"operator\": \"Equal\"}]' --set nfd.enabled=false\n
\n

 

\n

 

\n

Running the first GPU example

\n

Once the configuration has been completed, it is now time to check the functionality of the GPU operator submitting the first GPU accelerated Job on AKS. In this stage we will use as a reference the standard TensorFlow example that is also documented in the official AKS Azure Learn pages.

\n

Create a file called gpu-accelerated.yaml with this content:

\n

 

\n
apiVersion: batch/v1\nkind: Job\nmetadata:\n  labels:\n    app: samples-tf-mnist-demo\n  name: samples-tf-mnist-demo\nspec:\n  template:\n    metadata:\n      labels:\n        app: samples-tf-mnist-demo\n    spec:\n      affinity:\n        nodeAffinity:\n          requiredDuringSchedulingIgnoredDuringExecution:\n            nodeSelectorTerms:\n            - matchExpressions:\n              - key: node.kubernetes.io/instance-type\n                operator: In\n                values:\n                - Standard_NC4as_T4_v3\n      containers:\n      - name: samples-tf-mnist-demo\n        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu\n        args: [\"--max_steps\", \"500\"]\n        imagePullPolicy: IfNotPresent\n        volumeMounts:\n          - mountPath: /tmp\n            name: scratch\n        resources:\n          limits:\n            nvidia.com/gpu: 1\n      restartPolicy: OnFailure\n      tolerations:\n      - key: \"sku\"\n        operator: \"Equal\"\n        value: \"gpu\"\n        effect: \"NoSchedule\" \n      volumes:\n        - name: scratch\n          hostPath:\n            # directory location on host\n            path: /mnt/tmp\n            type: DirectoryOrCreate\n            # this field is optional\n
\n

 

\n

This job can be submitted with the following command:

\n

 

\n
kubectl apply -f gpu-accelerated.yaml
\n

 

\n

 After approximately one minute the node should be automatically provisioned:

\n

 

\n
root@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes\nNAME                                STATUS   ROLES   AGE     VERSION\naks-nc4ast4-81279986-vmss000003     Ready    agent   2m38s   v1.27.7\naks-nodepool1-25743550-vmss000000   Ready    agent   4d16h   v1.27.7\naks-nodepool1-25743550-vmss000001   Ready    agent   4d16h   v1.27.7\n
\n

 

\n

We can check that Node Feature Discovery has properly labeled the node:

\n

 

\n
root@aks-gpu-playground-rg-jumpbox:~# kubectl describe nodes  aks-nc4ast4-81279986-vmss000003 | grep pci-\nfeature.node.kubernetes.io/pci-0302_10de.present=true\nfeature.node.kubernetes.io/pci-10de.present=true\n
\n

 

\n

NVIDIA GPU Operator DaemonSets will start preparing the node and after driver installation, NVIDIA Container toolkit and all the related validation is completed.

\n

Once node preparation is completed, the GPU operator will add an allocatable GPU resource to the node:

\n

 

\n
kubectl describe nodes aks-nc4ast4-81279986-vmss000003 \n…\nAllocatable:\n  cpu:                3860m\n  ephemeral-storage:  119703055367\n  hugepages-1Gi:      0\n  hugepages-2Mi:      0\n  memory:             24487780Ki\n  nvidia.com/gpu:     1\n  pods:               110\n…\n
\n

 

\n

We can follow the process with the kubectl logs commands:

\n

 

\n
\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl get pods\nNAME                          READY   STATUS    RESTARTS   AGE\nsamples-tf-mnist-demo-tmpr4   1/1     Running   0          11m\n\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl logs samples-tf-mnist-demo-tmpr4 --follow\n2024-02-18 11:51:31.479768: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA\n2024-02-18 11:51:31.806125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:\nname: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59\npciBusID: 0001:00:00.0\ntotalMemory: 15.57GiB freeMemory: 15.47GiB\n2024-02-18 11:51:31.806157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5)\n2024-02-18 11:54:56.216820: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally\nSuccessfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.\nExtracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz\nSuccessfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.\nExtracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz\nSuccessfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.\nExtracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz\nSuccessfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.\nExtracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz\nAccuracy at step 0: 0.1201\nAccuracy at step 10: 0.7364\n…..\nAccuracy at step 490: 0.9559\nAdding run metadata for 499\n
\n

 

\n

 

\n

Time-slicing configuration

\n

An extremely useful feature of NVIDIA GPU Operator is represented by time-slicing. Time-slicing allows to share a physical GPU available on a node with multiple pods. Of course, this is just a time scheduling partition and not a physical GPU partition. It basically means that the different GPU processes that will be run by the different Pods will receive a proportional time of GPU compute time. However, if a Pod is particularly requiring in terms of GPU processing, it will impact significantly the other Pods sharing the GPU.

\n

In the official NVIDIA GPU operator there are different ways to configure time-slicing. Here, considering that one of the benefits of a cloud environment is the possibility of having multiple different node pools, each with different GPU or configuration, we will focus on a fine-grained definition of the time-slicing at the node pool level.

\n

The steps to enable time-slicing are three:

\n\n

As a first step, the nodes should be labelled with the key “nvidia.com/device-plugin.config”.

\n

For example, let’s label our node array from Azure CLI:

\n

 

\n
az aks nodepool update --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --nodepool-name nc4ast4 --labels \"nvidia.com/device-plugin.config=tesla-t4-ts2
\n

 

\n

After this step, let’s create the ConfigMap object required to allow for a time-slicing 2 on this node pool in a file called time-slicing-config.yaml:

\n

 

\n
apiVersion: v1\nkind: ConfigMap\nmetadata:\n  name: time-slicing-config\ndata:\n  tesla-t4-ts2: |-\n    version: v1\n    flags:\n      migStrategy: none\n    sharing:\n      timeSlicing:\n        resources:\n        - name: nvidia.com/gpu\n          replicas: 2\n
\n

 

\n

Let’s apply the configuration in the GPU operator namespace:

\n

 

\n
kubectl apply -f time-slicing-config.yaml -n gpu-operator
\n

 

\n

Finally, let’s update the cluster policy to enable the time-slicing configuration:

\n

 

\n
kubectl patch clusterpolicy/cluster-policy \\\n    -n gpu-operator --type merge \\\n    -p '{\"spec\": {\"devicePlugin\": {\"config\": {\"name\": \"time-slicing-config\"}}}}'\n
\n

 

\n

Now, let’s try to resubmit the job already used in the first step in two replicas, creating a file called gpu-accelerated-time-slicing.yaml:

\n

 

\n
apiVersion: batch/v1\nkind: Job\nmetadata:\n  labels:\n    app: samples-tf-mnist-demo-ts\n  name: samples-tf-mnist-demo-ts\nspec:\n  completions: 2\n  parallelism: 2\n  completionMode: Indexed\n  template:\n    metadata:\n      labels:\n        app: samples-tf-mnist-demo-ts\n    spec:\n      affinity:\n        nodeAffinity:\n          requiredDuringSchedulingIgnoredDuringExecution:\n            nodeSelectorTerms:\n            - matchExpressions:\n              - key: node.kubernetes.io/instance-type\n                operator: In\n                values:\n                - Standard_NC4as_T4_v3\n      containers:\n      - name: samples-tf-mnist-demo\n        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu\n        args: [\"--max_steps\", \"500\"]\n        imagePullPolicy: IfNotPresent\n        resources:\n          limits:\n            nvidia.com/gpu: 1\n      restartPolicy: OnFailure\n      tolerations:\n      - key: \"sku\"\n        operator: \"Equal\"\n        value: \"gpu\"\n        effect: \"NoSchedule\" \n
\n

 

\n

Let’s submit the job with the standard syntax:

\n

 

\n
kubectl apply -f gpu-accelerated-time-slicing.yaml
\n

 

\n

Now, after the node has been provisioned, we will find that it will get two GPU resources allocatable and will at the same time take the two Pods running concurrently at the same time.

\n

 

\n
kubectl describe node aks-nc4ast4-81279986-vmss000004\n...\nAllocatable:\n  cpu:                3860m\n  ephemeral-storage:  119703055367\n  hugepages-1Gi:      0\n  hugepages-2Mi:      0\n  memory:             24487780Ki\n  nvidia.com/gpu:     2\n  pods:               110\n.....\n  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age\n  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---\n  default                     samples-tf-mnist-demo-ts-0-4tdcf            0 (0%)        0 (0%)      0 (0%)           0 (0%)         29s\n  default                     samples-tf-mnist-demo-ts-1-67hn4            0 (0%)        0 (0%)      0 (0%)           0 (0%)         29s\n  gpu-operator                gpu-feature-discovery-lksj7                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         59s\n  gpu-operator                node-feature-discovery-worker-wbbct         0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m11s\n  gpu-operator                nvidia-container-toolkit-daemonset-8nmx7    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m24s\n  gpu-operator                nvidia-dcgm-exporter-76rs8                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m24s\n  gpu-operator                nvidia-device-plugin-daemonset-btwz7        0 (0%)        0 (0%)      0 (0%)           0 (0%)         55s\n  gpu-operator                nvidia-driver-daemonset-8dkkh               0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m6s\n  gpu-operator                nvidia-operator-validator-s7294             0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m24s\n  kube-system                 azure-ip-masq-agent-fjm5d                   100m (2%)     500m (12%)  50Mi (0%)        250Mi (1%)     9m18s\n  kube-system                 cloud-node-manager-9wpsm                    50m (1%)      0 (0%)      50Mi (0%)        512Mi (2%)     9m18s\n  kube-system                 csi-azuredisk-node-ckqw6                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (1%)     9m18s\n  kube-system                 csi-azurefile-node-xmfbd                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (2%)     9m18s\n  kube-system                 kube-proxy-7l856                            100m (2%)     0 (0%)      0 (0%)           0 (0%)         9m18s\n
\n

 

\n

A few remarks about time-slicing:

\n\n

Let’s imagine for example a nodepool scaled-down to zero that has no time-slicing applied. Let’s assume to configure it with time-slicing equal to 2. Submitting a request for 2 GPU resources may still allocate 2 nodes.

\n

This because the autoscaler has in its memory that each node provides only 1 allocatable GPU. In all subsequent operations, once a node will be correctly exposing 2 GPUs as allocatable for the first time, AKS autoscaler will acknowledge that and it will act accordingly in future autoscaling operations.

\n

 

\n

Multi-Instance GPU (MIG)

\n

NVIDIA Multi-instance GPU allows for GPU partitioning on Ampere and Hopper architecture. This means allowing an available GPU to be partitioned at hardware level (and not at time-slicing level). This means that Pods can have access to a dedicated hardware portion of the GPU resources which is delimited at an hardware level.

\n

In Kubernetes there are two strategies available for MIG, more specifically single and mixed.

\n

In single strategy, the nodes expose a standard “nvidia.com/gpu” set of resources.

\n

In mixed strategy, the nodes expose the specific MIG profiles as resources, like in the example below:

\n

 

\n
Allocatable:\nnvidia.com/mig-1g.5gb:   1\nnvidia.com/mig-2g.10gb:  1\nnvidia.com/mig-3g.20gb:  1\n
\n

 

\n

In order to use MIG, you could follow standard AKS documentation. However, we would like to propose here a method relying totally on NVIDIA GPU operator.

\n

As a first step, it is necessary to allow reboot of nodes to get MIG configuration enabled:

\n

 

\n
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge -p '{\"spec\": {\"migManager\": {\"env\": [{\"name\": \"WITH_REBOOT\", \"value\": \"true\"}]}}}'
\n

 

\n

Let’s start creating a node pools powered by a GPU supporting MIG on Azure, like on the SKU Standard_NC24ads_A100_v4 and let’s label the node with one of the MIG profiles available for A100 80 GiB.

\n

 

\n
az aks nodepool add \\\n    --resource-group $RESOURCE_GROUP_NAME \\\n    --cluster-name $AKS_CLUSTER_NAME \\\n    --name nc24a100v4 \\\n    --node-taints sku=gpu:NoSchedule \\\n    --node-vm-size Standard_NC24ads_A100_v4 \\\n    --enable-cluster-autoscaler \\\n    --min-count 0 --max-count 1 --node-count 0 --skip-gpu-driver-install --labels \"nvidia.com/mig.config\"=\"all-1g.10gb\"\n
\n

 

\n

There is another important detail to consider in this stage with AKS, meaning that the auto-scaling of the nodes will bring-up nodes with a standard GPU configuration, without MIG activated. This means, that NVIDIA GPU operator will install the drivers and then mig-manager will activate the proper MIG configuration profile and reboot. Between these two phases there is a small time window where the GPU resources are exposed by the node and this could potentially trigger a job execution.

\n

To support this scenario, it is important to consider on AKS the need of an additional DaemonSet that prevents any Pod to be scheduled during the MIG configuration. This is available in a dedicated repository.

\n

To deploy the DaemonSet:

\n

 

\n
export NAMESPACE=gpu-operator\nexport ACR_NAME=<YOUR_ACR_NAME>\ngit clone https://github.com/wolfgang-desalvador/aks-mig-monitor.git\ncd aks-mig-monitor\nsed -i \"s/<ACR_NAME>/$ACR_NAME/g\" mig-monitor-daemonset.yaml\nsed -i \"s/<NAMESPACE>/$NAMESPACE/g\" mig-monitor-roles.yaml\ndocker build . -t $ACR_NAME/aks-mig-monitor\ndocker push $ACR_NAME/aks-mig-monitor\nkubectl apply -f mig-monitor-roles.yaml -n $NAMESPACE\nkubectl apply -f mig-monitor-daemonset.yaml -n $NAMESPACE\n
\n

 

\n

We can now try to submit the mig-accelerated-job.yaml

\n

 

\n
apiVersion: batch/v1\nkind: Job\nmetadata:\n  labels:\n    app: samples-tf-mnist-demo-mig\n  name: samples-tf-mnist-demo-mig\nspec:\n  completions: 7\n  parallelism: 7\n  completionMode: Indexed\n  template:\n    metadata:\n      labels:\n        app: samples-tf-mnist-demo-mig\n    spec:\n      affinity:\n        nodeAffinity:\n          requiredDuringSchedulingIgnoredDuringExecution:\n            nodeSelectorTerms:\n            - matchExpressions:\n              - key: node.kubernetes.io/instance-type\n                operator: In\n                values:\n                - Standard_NC24ads_A100_v4\n      containers:\n      - name: samples-tf-mnist-demo\n        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu\n        args: [\"--max_steps\", \"500\"]\n        imagePullPolicy: IfNotPresent\n        resources:\n          limits:\n            nvidia.com/gpu: 1\n      restartPolicy: OnFailure\n      tolerations:\n      - key: \"sku\"\n        operator: \"Equal\"\n        value: \"gpu\"\n        effect: \"NoSchedule\"\n
\n

 

\n

 Then we will be submitting the job with kubectl:

\n

 

\n
kubectl apply -f mig-accelerated-job.yaml
\n

 

\n

After the node will startup, the first state will have a taint with mig=notReady:NoSchedule since the MIG configuration is not completed. GPU Operator containers will be installed:

\n

 

\n
kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a\n\nName:               aks-nc24a100v4-42670331-vmss00000a\n...\n                    nvidia.com/mig.config=all-1g.10gb\n...\nTaints:             kubernetes.azure.com/scalesetpriority=spot:NoSchedule\n                    mig=notReady:NoSchedule\n                    sku=gpu:NoSchedule\n...\nNon-terminated Pods:          (13 in total)\n  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age\n  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---\n  gpu-operator                aks-mig-monitor-64zpl                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         16s\n  gpu-operator                gpu-feature-discovery-wpd2j                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s\n  gpu-operator                node-feature-discovery-worker-79h68         0 (0%)        0 (0%)      0 (0%)           0 (0%)         16s\n  gpu-operator                nvidia-container-toolkit-daemonset-q5p9k    0 (0%)        0 (0%)      0 (0%)           0 (0%)         12s\n  gpu-operator                nvidia-dcgm-exporter-9g5kg                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s\n  gpu-operator                nvidia-device-plugin-daemonset-5wpzk        0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s\n  gpu-operator                nvidia-driver-daemonset-kqkzb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s\n  gpu-operator                nvidia-operator-validator-lx77m             0 (0%)        0 (0%)      0 (0%)           0 (0%)         12s\n  kube-system                 azure-ip-masq-agent-7rd2x                   100m (0%)     500m (2%)   50Mi (0%)        250Mi (0%)     66s\n  kube-system                 cloud-node-manager-dc756                    50m (0%)      0 (0%)      50Mi (0%)        512Mi (0%)     66s\n  kube-system                 csi-azuredisk-node-5b4nk                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (0%)     66s\n  kube-system                 csi-azurefile-node-vlwhv                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (0%)     66s\n  kube-system                 kube-proxy-4fkxh                            100m (0%)     0 (0%)      0 (0%)           0 (0%)         66s\n
\n

 

\n

After the GPU Operator configuration is completed, mig-manager will start being deployed. MIG configuration will be applied and node will then set in a rebooting state:

\n

 

\n
 kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a\n\n                    nvidia.com/mig.config=all-1g.10gb\n                    nvidia.com/mig.strategy=single\n                    nvidia.com/mig.config.state=rebooting\n...\nTaints:             kubernetes.azure.com/scalesetpriority=spot:NoSchedule\n                    mig=notReady:NoSchedule\n                    sku=gpu:NoSchedule\n...\nNon-terminated Pods:          (14 in total)\n  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age\n  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---\n  gpu-operator                aks-mig-monitor-64zpl                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m6s\n  gpu-operator                gpu-feature-discovery-6btwx                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s\n  gpu-operator                node-feature-discovery-worker-79h68         0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m6s\n  gpu-operator                nvidia-container-toolkit-daemonset-wplkb    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s\n  gpu-operator                nvidia-dcgm-exporter-vnscq                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s\n  gpu-operator                nvidia-device-plugin-daemonset-d86dn        0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s\n  gpu-operator                nvidia-driver-daemonset-kqkzb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m3s\n  gpu-operator                nvidia-mig-manager-t4bw9                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2s\n  gpu-operator                nvidia-operator-validator-jrfkn             0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m33s\n  kube-system                 azure-ip-masq-agent-7rd2x                   100m (0%)     500m (2%)   50Mi (0%)        250Mi (0%)     4m56s\n  kube-system                 cloud-node-manager-dc756                    50m (0%)      0 (0%)      50Mi (0%)        512Mi (0%)     4m56s\n  kube-system                 csi-azuredisk-node-5b4nk                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (0%)     4m56s\n  kube-system                 csi-azurefile-node-vlwhv                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (0%)     4m56s\n  kube-system                 kube-proxy-4fkxh                            100m (0%)     0 (0%)      0 (0%)           0 (0%)         4m56s\n
\n

 

\n

After the reboot, the MIG configuration will switch to state \"success\" and taints will be removed. Scheduling of the 7 pods of our job will then start:

\n

 

\n
 kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a\n...\n                    nvidia.com/mig.capable=true\n                    nvidia.com/mig.config=all-1g.10gb\n                    nvidia.com/mig.config.state=success\n                    nvidia.com/mig.strategy=single\n...\nTaints:             kubernetes.azure.com/scalesetpriority=spot:NoSchedule\n                    sku=gpu:NoSchedule\n...\nAllocatable:\n  cpu:                23660m\n  ephemeral-storage:  119703055367\n  hugepages-1Gi:      0\n  hugepages-2Mi:      0\n  memory:             214295444Ki\n  nvidia.com/gpu:     7\n  pods:               110\n...\nNon-terminated Pods:          (21 in total)\n  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age\n  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---\n  default                     samples-tf-mnist-demo-ts-0-5bs64            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m\n  default                     samples-tf-mnist-demo-ts-1-2msdh            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m\n  default                     samples-tf-mnist-demo-ts-2-ck8c8            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m\n  default                     samples-tf-mnist-demo-ts-3-dlkfn            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m\n  default                     samples-tf-mnist-demo-ts-4-899fr            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m\n  default                     samples-tf-mnist-demo-ts-5-dmgpn            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m\n  default                     samples-tf-mnist-demo-ts-6-pvzm4            0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m\n  gpu-operator                aks-mig-monitor-64zpl                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         9m9s\n  gpu-operator                gpu-feature-discovery-5t9gn                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s\n  gpu-operator                node-feature-discovery-worker-79h68         0 (0%)        0 (0%)      0 (0%)           0 (0%)         9m9s\n  gpu-operator                nvidia-container-toolkit-daemonset-82dgg    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m22s\n  gpu-operator                nvidia-dcgm-exporter-xbxqf                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s\n  gpu-operator                nvidia-device-plugin-daemonset-8gkzd        0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s\n  gpu-operator                nvidia-driver-daemonset-kqkzb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         9m6s\n  gpu-operator                nvidia-mig-manager-jbqls                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m22s\n  gpu-operator                nvidia-operator-validator-5rdbh             0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s\n  kube-system                 azure-ip-masq-agent-7rd2x                   100m (0%)     500m (2%)   50Mi (0%)        250Mi (0%)     9m59s\n  kube-system                 cloud-node-manager-dc756                    50m (0%)      0 (0%)      50Mi (0%)        512Mi (0%)     9m59s\n  kube-system                 csi-azuredisk-node-5b4nk                    30m (0%)      0 (0%)      60Mi (0%)        400Mi (0%)     9m59s\n  kube-system                 csi-azurefile-node-vlwhv                    30m (0%)      0 (0%)      60Mi (0%)        600Mi (0%)     9m59s\n  kube-system                 kube-proxy-4fkxh                            100m (0%)     0 (0%)      0 (0%)           0 (0%)         9m59s\n\"\n
\n

 

\n

Checking on a node the status of MIG will visualize the 7 GPU partitions through nvidia-smi:

\n

 

\n
nvidia-smi\n\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |\n|-----------------------------------------+----------------------+----------------------+\n| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n|                                         |                      |               MIG M. |\n|=========================================+======================+======================|\n|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                   On |\n| N/A   27C    P0              71W / 300W |    726MiB / 81920MiB |     N/A      Default |\n|                                         |                      |              Enabled |\n+-----------------------------------------+----------------------+----------------------+\n\n+---------------------------------------------------------------------------------------+\n| MIG devices:                                                                          |\n+------------------+--------------------------------+-----------+-----------------------+\n| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |\n|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |\n|                  |                                |        ECC|                       |\n|==================+================================+===========+=======================|\n|  0    7   0   0  |             102MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |\n|                  |               2MiB / 16383MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0    8   0   1  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |\n|                  |               2MiB / 16383MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0    9   0   2  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |\n|                  |               2MiB / 16383MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0   10   0   3  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |\n|                  |               2MiB / 16383MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0   11   0   4  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |\n|                  |               2MiB / 16383MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0   12   0   5  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |\n|                  |               2MiB / 16383MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0   13   0   6  |             104MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |\n|                  |               2MiB / 16383MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n\n+---------------------------------------------------------------------------------------+\n| Processes:                                                                            |\n|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n|        ID   ID                                                             Usage      |\n|=======================================================================================|\n|    0    7    0      28988      C   python                                       82MiB |\n|    0    8    0      29140      C   python                                       84MiB |\n|    0    9    0      29335      C   python                                       84MiB |\n|    0   10    0      29090      C   python                                       84MiB |\n|    0   11    0      29031      C   python                                       84MiB |\n|    0   12    0      29190      C   python                                       84MiB |\n|    0   13    0      29255      C   python                                       84MiB |\n+---------------------------------------------------------------------------------------+
\n

 

\n

A few remarks about MIG to take into account:

\n\n

For example, in the case above, if we want to move to another profile, it is important to cordon the node with the commands:

\n

 

\n
kubectl get nodes\n\nNAME                                 STATUS   ROLES   AGE     VERSION\naks-nc24a100v4-42670331-vmss00000c   Ready    agent   11m     v1.27.7\naks-nodepool1-25743550-vmss000000    Ready    agent   6d16h   v1.27.7\naks-nodepool1-25743550-vmss000001    Ready    agent   6d16h   v1.27.7\n\nkubectl cordon aks-nc24a100v4-42670331-vmss00000c\n\nnode/aks-nc24a100v4-42670331-vmss00000c cordoned\n
\n

 

\n

Be aware that cordoning the nodes will not stop the Pods. You should verify no GPU accelerated workload is running before submitting the label change. 

\n

Since in our case we have applied the label at the AKS level, we will need to change the label from Azure CLI:

\n

 

\n
az aks nodepool update --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --nodepool-name nc24a100v4 --labels \"nvidia.com/mig.config\"=\"all-1g.20gb\"
\n

 

\n

 
This will trigger a reconfiguration of MIG with new profile applied:

\n

 

\n
+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |\n|-----------------------------------------+----------------------+----------------------+\n| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n|                                         |                      |               MIG M. |\n|=========================================+======================+======================|\n|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                   On |\n| N/A   42C    P0              77W / 300W |     50MiB / 81920MiB |     N/A      Default |\n|                                         |                      |              Enabled |\n+-----------------------------------------+----------------------+----------------------+\n\n+---------------------------------------------------------------------------------------+\n| MIG devices:                                                                          |\n+------------------+--------------------------------+-----------+-----------------------+\n| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |\n|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |\n|                  |                                |        ECC|                       |\n|==================+================================+===========+=======================|\n|  0    3   0   0  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |\n|                  |               0MiB / 32767MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0    4   0   1  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |\n|                  |               0MiB / 32767MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0    5   0   2  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |\n|                  |               0MiB / 32767MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n|  0    6   0   3  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |\n|                  |               0MiB / 32767MiB  |           |                       |\n+------------------+--------------------------------+-----------+-----------------------+\n\n+---------------------------------------------------------------------------------------+\n| Processes:                                                                            |\n|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n|        ID   ID                                                             Usage      |\n|=======================================================================================|\n|  No running processes found                                                           |\n+---------------------------------------------------------------------------------------+
\n

 

\n


We can then uncordon the node/nodes:

\n

 

\n
kubectl uncordon aks-nc24a100v4-42670331-vmss00000c\n\nnode/aks-nc24a100v4-42670331-vmss00000c uncordoned\n
\n

 

\n

 

\n

Using NVIDIA GPU Driver CRD (preview)

\n

The NVIDIA GPU Driver CRD allows to define in a granular way the driver version and the driver images on each of the nodepools in use in an AKS cluster. This feature is in preview as documented in NVIDIA GPU Operator and not suggested by NVIDIA on production systems.

\n


In order to enable NVIDIA GPU Driver CRD (in case you have already installed NVIDIA GPU Operator you will need to perform helm uninstall, of course taking care of running workloads) perform the following command:

\n

 

\n
 helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator --set-json daemonsets.tolerations='[{\"effect\": \"NoSchedule\", \"key\": \"sku\", \"operator\": \"Equal\", \"value\": \"gpu\" }, {\"effect\": \"NoSchedule\", \"key\": \"kubernetes.azure.com/scalesetpriority\", \"value\":\"spot\", \"operator\": \"Equal\"}]' --set nfd.enabled=false --set driver.nvidiaDriverCRD.deployDefaultCR=false --set driver.nvidiaDriverCRD.enabled=true
\n

 

\n

After this step, it is important to create nodepools with a proper label to be used to select nodes for driver version (in this case \"driver.config\"):

\n

 

\n
\naz aks nodepool add  \\\n   --resource-group $RESOURCE_GROUP_NAME  \\\n   --cluster-name $AKS_CLUSTER_NAME \\ \n   --name nc4latest  \\\n   --node-taints sku=gpu:NoSchedule    \\\n   --node-vm-size Standard_NC4as_T4_v3 \\    \n   --enable-cluster-autoscaler  \\\n   --labels \"driver.config\"=\"latest\"  \\\n   --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True\n\naz aks nodepool add  \\\n   --resource-group $RESOURCE_GROUP_NAME  \\\n   --cluster-name $AKS_CLUSTER_NAME \\ \n   --name nc4stable  \\\n   --node-taints sku=gpu:NoSchedule    \\\n   --node-vm-size Standard_NC4as_T4_v3 \\    \n   --enable-cluster-autoscaler  \\\n   --labels \"driver.config\"=\"stable\"  \\\n   --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True\n
\n

 

\n

After this step, the driver configuration (NVIDIADriver object in AKS) should be created. This can be done with a file called driver-config.yaml with the following content:

\n

 

\n
apiVersion: nvidia.com/v1alpha1\nkind: NVIDIADriver\nmetadata:\n  name: nc4-latest\nspec:\n  driverType: gpu\n  env: []\n  image: driver\n  imagePullPolicy: IfNotPresent\n  imagePullSecrets: []\n  manager: {}\n  tolerations:\n    - key: \"sku\"\n      operator: \"Equal\"\n      value: \"gpu\"\n      effect: \"NoSchedule\"\n    - key: \"kubernetes.azure.com/scalesetpriority\"\n      operator: \"Equal\"\n      value: \"spot\"\n      effect: \"NoSchedule\"\n  nodeSelector:\n    driver.config: \"latest\"\n  repository: nvcr.io/nvidia\n  version: \"535.129.03\"\n---\napiVersion: nvidia.com/v1alpha1\nkind: NVIDIADriver\nmetadata:\n  name: nc4-stable\nspec:\n  driverType: gpu\n  env: []\n  image: driver\n  imagePullPolicy: IfNotPresent\n  imagePullSecrets: []\n  manager: {}\n  tolerations:\n    - key: \"sku\"\n      operator: \"Equal\"\n      value: \"gpu\"\n      effect: \"NoSchedule\"\n    - key: \"kubernetes.azure.com/scalesetpriority\"\n      operator: \"Equal\"\n      value: \"spot\"\n      effect: \"NoSchedule\"\n  nodeSelector:\n    driver.config: \"stable\"\n  repository: nvcr.io/nvidia\n  version: \"535.104.12\"
\n

 

\n


This can then be applied with kubectl:

\n

 

\n
kubectl apply -f driver-config.yaml -n gpu-operator
\n

 

\n

Now scaling up to nodes (e.g. submitting a GPU workload requesting as affinity exactly the target labels of device.config) we can verify that the driver versions will be the one requested. Running nvidia-smi attaching a shell to the Daemonset container of each of the two nodes:

\n

 

\n
### On latest\n\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |\n|-----------------------------------------+----------------------+----------------------+\n| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n|                                         |                      |               MIG M. |\n|=========================================+======================+======================|\n|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |\n| N/A   30C    P8              15W /  70W |      2MiB / 16384MiB |      0%      Default |\n|                                         |                      |                  N/A |\n+-----------------------------------------+----------------------+----------------------+\n\n### On stable\n\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |\n|-----------------------------------------+----------------------+----------------------+\n| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n|                                         |                      |               MIG M. |\n|=========================================+======================+======================|\n|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |\n| N/A   30C    P8              14W /  70W |      0MiB / 16384MiB |      0%      Default |\n|                                         |                      |                  N/A |\n+-----------------------------------------+----------------------+----------------------+\n
\n

 

\n

NVIDIA GPU Driver CRD allows to specify a specific Docker image and Docker Registry for the NVIDIA Driver installation on each node pool.

This becomes particularly useful in the case we will need to install the Azure specific Virtual GPU Drivers on A10 GPUs. 

\n

On Azure, NVads_A10_v5 VMs are characterized by NVIDIA VGPU technology in the backend, so they require VGPU Drivers. On Azure, the VGPU drivers comes included with the VM cost, so there is no need to get a VGPU license. The binaries available on the Azure Driver download page can be used on the supported OS (including Ubuntu 22) only on Azure VMs.

\n

In this case, there is the possibility to bundle an ad-hoc NVIDIA Driver container image to be used on Azure, making that accessible to a dedicated container registry.

\n


In order to do that, this is the procedure (assuming wehave an ACR attached to AKS with <ACR_NAME>):

\n

 

\n
export ACR_NAME=<ACR_NAME>\naz acr login -n $ACR_NAME\ngit clone https://gitlab.com/nvidia/container-images/driver\ncd driver\ncp -r ubuntu22.04 ubuntu22.04-aks\ncd ubuntu22.04-aks\ncd drivers\nwget \"https://download.microsoft.com/download/1/4/4/14450d0e-a3f2-4b0a-9bb4-a8e729e986c4/NVIDIA-Linux-x86_64-535.154.05-grid-azure.run\"\nmv NVIDIA-Linux-x86_64-535.154.05-grid-azure.run NVIDIA-Linux-x86_64-535.154.05.run\nchmod +x NVIDIA-Linux-x86_64-535.154.05.run\ncd ..\nsed -i 's%/tmp/install.sh download_installer%echo \"Skipping Driver Download\"%g' Dockerfile\nsed 's%sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x%sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x \\&\\& mv NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION-grid-azure NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION%g' nvidia-driver -i\ndocker build     --build-arg DRIVER_VERSION=535.154.05 --build-arg DRIVER_BRANCH=535    --build-arg CUDA_VERSION=12.3.1     --build-arg TARGETARCH=amd64 . -t $ACR_NAME/driver:535.154.05-ubuntu22.04\ndocker push $ACR_NAME/driver:535.154.05-ubuntu22.04\n 
\n

 

\n

After this, let's create a specific NVIDIADriver object for Azure VGPU with a file named azure-vgpu.yaml and the following content (replace <ACR_NAME> with your ACR name):

\n

 

\n
apiVersion: nvidia.com/v1alpha1\nkind: NVIDIADriver\nmetadata:\n  name: azure-vgpu\nspec:\n  driverType: gpu\n  env: []\n  image: driver\n  imagePullPolicy: IfNotPresent\n  imagePullSecrets: []\n  manager: {}\n  tolerations:\n    - key: \"sku\"\n      operator: \"Equal\"\n      value: \"gpu\"\n      effect: \"NoSchedule\"\n    - key: \"kubernetes.azure.com/scalesetpriority\"\n      operator: \"Equal\"\n      value: \"spot\"\n      effect: \"NoSchedule\"\n  nodeSelector:\n    driver.config: \"azurevgpu\"\n  repository: <ACR_NAME>\n  version: \"535.154.05\"\n
\n

 

\n

Let's apply it with kubectl:

\n

 

\n
kubectl apply -f azure-vgpu.yaml -n gpu-operator
\n

 

\n

Now, let's create an A10 nodepool with Azure CLI:

\n

 

\n
az aks nodepool add \\\n    --resource-group $RESOURCE_GROUP_NAME    \\\n    --cluster-name $AKS_CLUSTER_NAME    \\\n    --name nv36a10v5     \\\n    --node-taints sku=gpu:NoSchedule     \\\n    --node-vm-size Standard_NV36ads_A10_v5    \\\n    --enable-cluster-autoscaler     \\\n    --labels \"driver.config\"=\"azurevgpu\"    \\\n    --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True
\n

 

\n

Scaling up a node with a specific workload and waiting for the finalization of Driver installation, we will see that the image of the NVIDIA Driver installation has been pulled by our registry:

\n

 

\n
root@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes\nNAME                                STATUS   ROLES   AGE     VERSION\naks-nodepool1-25743550-vmss000000   Ready    agent   6d23h   v1.27.7\naks-nodepool1-25743550-vmss000001   Ready    agent   6d23h   v1.27.7\naks-nv36a10v5-10653906-vmss000000   Ready    agent   9m24s   v1.27.7\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl describe node aks-nv36a10v5-10653906-vmss000000| grep gpu-driver\n                    nvidia.com/gpu-driver-upgrade-state=upgrade-done\n                    nvidia.com/gpu-driver-upgrade-enabled: true\n  gpu-operator                nvidia-gpu-driver-ubuntu22.04-56df89b87c-6w8tj    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m29s\n\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl describe pods -n gpu-operator nvidia-gpu-driver-ubuntu22.04-56df89b87c-6w8tj | grep -i Image\n    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2\n    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:bb845160b32fd12eb3fae3e830d2e6a7780bc7405e0d8c5b816242d48be9daa8\n    Image:         aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04\n    Image ID:      aksgpuplayground.azurecr.io/driver@sha256:deb6e6311a174ca6a989f8338940bf3b1e6ae115ebf738042063f4c3c95c770f\n  Normal   Pulled            4m26s  kubelet            Container image \"nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2\" already present on machine\n  Normal   Pulling           4m23s  kubelet            Pulling image \"aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04\"\n  Normal   Pulled            4m16s  kubelet            Successfully pulled image \"aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04\" in 6.871887325s (6.871898205s including waiting)\n
\n

 

\n

Also, we can see how the A10 VGPU profile is recognized successfully attaching to the Pod of device-plugin-daemonset:

\n

 

\n
+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |\n|-----------------------------------------+----------------------+----------------------+\n| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n|                                         |                      |               MIG M. |\n|=========================================+======================+======================|\n|   0  NVIDIA A10-24Q                 On  | 00000002:00:00.0 Off |                    0 |\n| N/A   N/A    P8              N/A /  N/A |      0MiB / 24512MiB |      0%      Default |\n|                                         |                      |             Disabled |\n+-----------------------------------------+----------------------+----------------------+
\n

 

\n

 

\n

Thank you

\n

Thank you for reading our blog posts, feel free to leave any comment / feedback, ask for clarifications or report any issue

","body@stringLength":"73226","rawBody":"

Dr. Wolfgang De Salvador - EMEA GBB HPC/AI Infrastructure Senior Specialist

\n

Dr. Kai Neuffer - Principal Program Manager, Industry and Partner Sales - Energy Industry

\n

 

\n

\n

Resources and references used in this article:

\n\n


As of today, several options are available to run GPU accelerated HPC/AI workloads on Azure, ranging from training to inferencing.

\n

Looking specifically at AI workloads, the most direct and managed way to access GPU resources and related orchestration capabilities for training is represented by Azure Machine Learning distributed training capabilities as well as the related deployment for inferencing options.

\n

At the same time, specific HPC/AI workloads require a high degree of customization and granular control over the compute-resources configuration, including the operating system, the system packages, the HPC/AI software stack and the drivers. This is the case, for example, described in previous blog posts by our benchmarking team  for training of NVIDIA NeMo Megatron model or for MLPerf Training v3.0

\n

In these types of scenarios, it is critical to have the possibility to fine tune the configuration of the host at the operating system level, to precisely match the ideal configuration for getting the most value out of the compute resources.

\n

On Azure, HPC/AI workload orchestration on GPUs is supported on several Azure services, including Azure CycleCloud, Azure Batch and Azure Kuberenetes Services

\n

 

\n

Focus of the blog post

\n

The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator.

\n

The guide will be based on the documentation already available in Azure Learn for configuring GPU nodes or multi-instance GPU profile nodes, as well as on the NVIDIA GPU Operator documentation.

\n

However, the main scope of the article is to present a methodology to manage totally the GPU configuration leveraging on NVIDIA GPU Operator native features, including:

\n\n

 

\n

Deploying a vanilla AKS cluster

\n

The standard way of deploying a Vanilla AKS cluster is to follow the standard procedure described in Azure documentation.

\n

Please be aware that this command will create an AKS cluster with:

\n\n

In general, we strongly recommend for production workloads to look the main security concepts for AKS cluster.

\n\n

This will be out of scope for the present demo, but please be aware that this cluster is meant for NVIDIA GPU Operator demo purposes only.

\n

Using Azure CLI we can create an AKS cluster with this procedure (replace the values between arrows with your preferred values):

\n

 

\nexport RESOURCE_GROUP_NAME=<YOUR_RG_NAME>\nexport AKS_CLUSTER_NAME=<YOUR_AKS_CLUSTER_NAME>\nexport LOCATION=<YOUR_LOCATION>\n\n## Following line to be used only if Resource Group is not available\naz create group --resource-group $RESOURCE_GROUP_NAME --location $LOCATION\n\naz aks create --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME --node-count 2 --generate-ssh-keys \n

 

\n

 

\n

Connecting to the cluster

\n

To connect to the AKS cluster, several ways are documented in Azure documentation.

\n

Our favorite approach is using a Linux Ubuntu VM with Azure CLI installed.

\n

This would allow us to run (be aware that in the login command you may be required to use --tenant <TENTANT_ID> in case you have access to multiple tenants or --identity if the VM is on Azure and you rely on an Azure Managed Identity) in case:

\n

 

\n## Add --tenant <TENANT_ID> in case of multiple tenants\n## Add --identity in case of using a managed identity on the VM\naz login \naz aks install-cli\naz aks get-credentials --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME\n\n

 

\n

After this is completed, you should be able to perform standard kubectl commands like:

\n

 

\nkubectl get nodes\n\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes\nNAME STATUS ROLES AGE VERSION\naks-nodepool1-25743550-vmss000000 Ready agent 2d19h v1.27.7\naks-nodepool1-25743550-vmss000001 Ready agent 2d19h v1.27.7\n\n

 

\n

Command line will be perfectly fine for all the operations in the blog post. However, if you would like to have a TUI experience, we suggest to use k9s, which can be easily installed on Linux following the installation instructions. For Ubuntu, you can install current version at the time of Blog post creation with:

\n

 

\nwget \"https://github.com/derailed/k9s/releases/download/v0.31.9/k9s_linux_amd64.deb\"\ndpkg -i k9s_linux_amd64.deb\n

 

\n

k9s allows to easily interact with the different resources of AKS cluster directly from a terminal user interface. It can be launched with k9s command. Detailed documentation on how to navigate on the different resources (Pods, DaemonSets, Nodes) can be found on the official k9s documentation page.

\n

 

\n

Attaching an Azure Container registry to the Azure Kubernetes Cluster (only required for MIG and NVIDIA GPU Driver CRD)

\n

In case you will be using MIG or NVIDIA GPU Driver CRD, it is necessary to create a private Azure Container Registry and attaching that to the AKS cluster.

\n

 

\nexport ACR_NAME=<ACR_NAME_OF_YOUR_CHOICE>\n\naz acr create --resource-group $RESOURCE_GROUP_NAME \\\n --name $ACR_NAME --sku Basic\n\naz aks update --name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --attach-acr $ACR_NAME\n

 

\n

You will be able to perform pull and push operations from this Container Registry through Docker using this command on a VM with the container engine installed, provided that the VM has a managed identity with AcrPull/AcrPush permissions :

\n

 

\naz acr login --name $ACR_NAME\n

 

\n

 

\n

About taints for AKS GPU nodes

\n

It is important to understand deeply the concept of taints and tolerations for GPU nodes in AKS. This is critical for two reasons:

\n\n

However, the latter point can also be addressed in a more elegant and specific way using a affinity declaration for Jobs or Pods spec requesting GPUs, like for example:

\n

 

\naffinity:\n nodeAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n nodeSelectorTerms:\n - matchExpressions:\n - key: node.kubernetes.io/instance-type\n operator: In\n values:\n - Standard_NC4as_T4_v3\n

 

\n

 

\n

Creating the first GPU pool

\n

The currently created AKS cluster has as a default only a node pool with 2 nodes of Standard_DS2_v2 VMs.

\n

In order to test NVIDIA GPU Operator and run some GPU accelerated workload, we should add a GPU node pool.

\n

It is critical, in case the management of the NVIDIA stack is meant to be managed with GPU operator that the node is created with the tag:

\n

 

\nSkipGPUDriverInstall=true\n

 

\n

 This can be done using Azure Cloud Shell, for example using an NC4as_T4_v3 and setting the autoscaling from 0 up to 1 node:

\n

 

\naz aks nodepool add \\\n --resource-group $RESOURCE_GROUP_NAME \\\n --cluster-name $AKS_CLUSTER_NAME \\\n --name nc4ast4 \\\n --node-taints sku=gpu:NoSchedule \\\n --node-vm-size Standard_NC4as_T4_v3 \\\n --enable-cluster-autoscaler \\\n --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True\n\n

 

\n

 In order to deploy in Spot mode, the following flags should be added to Azure CLI:

\n

 

\n--priority Spot --eviction-policy Delete --spot-max-price -1\n

 

\n

Recently, a preview feature has been released that is allowing to skip the creation of the tags:

\n

 

\n# Register the aks-preview extension\naz extension add --name aks-preview\n\n# Update the aks-preview extension\naz extension update --name aks-preview\n\naz aks nodepool add \\\n --resource-group $RESOURCE_GROUP_NAME \\\n --cluster-name $AKS_CLUSTER_NAME \\\n --name nc4ast4 \\\n --node-taints sku=gpu:NoSchedule \\\n --node-vm-size Standard_NC4as_T4_v3 \\\n --enable-cluster-autoscaler \\\n --min-count 0 --max-count 1 --node-count 0 --skip-gpu-driver-install\n\n

 

\n

At the end of the process you should get the appropriate node pool defined in the portal and in status “Succeeded”:

\n

 

\naz aks nodepool list --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME -o table\nName OsType KubernetesVersion VmSize Count MaxPods ProvisioningState Mode\n--------- -------- ------------------- -------------------- ------- --------- ------------------- ------\nnodepool1 Linux 1.27.7 Standard_DS2_v2 2 110 Succeeded System\nnc4ast4 Linux 1.27.7 Standard_NC4as_T4_v3 0 110 Succeeded User\n\n

 

\n

 

\n

Install NVIDIA GPU operator

\n

On the machine with kubectl configured and with context configured above for connection to the AKS cluster, run the following to install helm:

\n

 

\ncurl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \\\n && chmod 700 get_helm.sh \\\n && ./get_helm.sh\n\n

 

\n

To fine tune the node feature recognition, we will install Node Feature Discovery separately from NVIDIA Operator. NVIDIA Operator requires that the label feature.node.kubernetes.io/pci-10de.present=true is applied to the nodes. Moreover, it is important to tune the node discovery plugin so that it will be scheduled even on Spot instances of the Kubernetes cluster and on instances where the taint sku: gpu is applied

\n

 

\nhelm install --wait --create-namespace -n gpu-operator node-feature-discovery node-feature-discovery --create-namespace --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts --set-json master.config.extraLabelNs='[\"nvidia.com\"]' --set-json worker.tolerations='[{ \"effect\": \"NoSchedule\", \"key\": \"sku\", \"operator\": \"Equal\", \"value\": \"gpu\"},{\"effect\": \"NoSchedule\", \"key\": \"kubernetes.azure.com/scalesetpriority\", \"value\":\"spot\", \"operator\": \"Equal\"},{\"effect\": \"NoSchedule\", \"key\": \"mig\", \"value\":\"notReady\", \"operator\": \"Equal\"}]'\n

 

\n

After enabling Node Feature Discovery, it is important to create a custom rule to precisely match NVIDIA GPUs on the nodes. This can be done creating a file called nfd-gpu-rule.yaml containing the following:

\n

 

\napiVersion: nfd.k8s-sigs.io/v1alpha1\nkind: NodeFeatureRule\nmetadata:\n name: nfd-gpu-rule\nspec:\n rules:\n - name: \"nfd-gpu-rule\"\n labels:\n \"feature.node.kubernetes.io/pci-10de.present\": \"true\"\n matchFeatures:\n - feature: pci.device\n matchExpressions:\n vendor: {op: In, value: [\"10de\"]}\n\n

 

\n

After this file is created, we should apply this to the AKS cluster:

\n

 

\nkubectl apply -n gpu-operator -f nfd-gpu-rule.yaml\n

 

\n

After this step, it is necessary to add NVIDIA Helm repository:

\n

 

\nhelm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update\n

 

\n

And now the next step will be installing the GPU operator, remembering the tainting also for the GPU Operator DaemonSets and also to disable the deployment of the Node Feature Discovery (nfd) that has been done in the previous step:

\n

 

\nhelm install --wait --generate-name -n gpu-operator nvidia/gpu-operator --set-json daemonsets.tolerations='[{ \"effect\": \"NoSchedule\", \"key\": \"sku\", \"operator\": \"Equal\", \"value\": \"gpu\"},{\"effect\": \"NoSchedule\", \"key\": \"kubernetes.azure.com/scalesetpriority\", \"value\":\"spot\", \"operator\": \"Equal\"},{\"effect\": \"NoSchedule\", \"key\": \"mig\", \"value\":\"notReady\", \"operator\": \"Equal\"}]' --set nfd.enabled=false\n\n

 

\n

 

\n

Running the first GPU example

\n

Once the configuration has been completed, it is now time to check the functionality of the GPU operator submitting the first GPU accelerated Job on AKS. In this stage we will use as a reference the standard TensorFlow example that is also documented in the official AKS Azure Learn pages.

\n

Create a file called gpu-accelerated.yaml with this content:

\n

 

\napiVersion: batch/v1\nkind: Job\nmetadata:\n labels:\n app: samples-tf-mnist-demo\n name: samples-tf-mnist-demo\nspec:\n template:\n metadata:\n labels:\n app: samples-tf-mnist-demo\n spec:\n affinity:\n nodeAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n nodeSelectorTerms:\n - matchExpressions:\n - key: node.kubernetes.io/instance-type\n operator: In\n values:\n - Standard_NC4as_T4_v3\n containers:\n - name: samples-tf-mnist-demo\n image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu\n args: [\"--max_steps\", \"500\"]\n imagePullPolicy: IfNotPresent\n volumeMounts:\n - mountPath: /tmp\n name: scratch\n resources:\n limits:\n nvidia.com/gpu: 1\n restartPolicy: OnFailure\n tolerations:\n - key: \"sku\"\n operator: \"Equal\"\n value: \"gpu\"\n effect: \"NoSchedule\" \n volumes:\n - name: scratch\n hostPath:\n # directory location on host\n path: /mnt/tmp\n type: DirectoryOrCreate\n # this field is optional\n\n

 

\n

This job can be submitted with the following command:

\n

 

\nkubectl apply -f gpu-accelerated.yaml\n

 

\n

 After approximately one minute the node should be automatically provisioned:

\n

 

\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes\nNAME STATUS ROLES AGE VERSION\naks-nc4ast4-81279986-vmss000003 Ready agent 2m38s v1.27.7\naks-nodepool1-25743550-vmss000000 Ready agent 4d16h v1.27.7\naks-nodepool1-25743550-vmss000001 Ready agent 4d16h v1.27.7\n\n

 

\n

We can check that Node Feature Discovery has properly labeled the node:

\n

 

\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl describe nodes aks-nc4ast4-81279986-vmss000003 | grep pci-\nfeature.node.kubernetes.io/pci-0302_10de.present=true\nfeature.node.kubernetes.io/pci-10de.present=true\n\n

 

\n

NVIDIA GPU Operator DaemonSets will start preparing the node and after driver installation, NVIDIA Container toolkit and all the related validation is completed.

\n

Once node preparation is completed, the GPU operator will add an allocatable GPU resource to the node:

\n

 

\nkubectl describe nodes aks-nc4ast4-81279986-vmss000003 \n…\nAllocatable:\n cpu: 3860m\n ephemeral-storage: 119703055367\n hugepages-1Gi: 0\n hugepages-2Mi: 0\n memory: 24487780Ki\n nvidia.com/gpu: 1\n pods: 110\n…\n\n

 

\n

We can follow the process with the kubectl logs commands:

\n

 

\n\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl get pods\nNAME READY STATUS RESTARTS AGE\nsamples-tf-mnist-demo-tmpr4 1/1 Running 0 11m\n\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl logs samples-tf-mnist-demo-tmpr4 --follow\n2024-02-18 11:51:31.479768: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA\n2024-02-18 11:51:31.806125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:\nname: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59\npciBusID: 0001:00:00.0\ntotalMemory: 15.57GiB freeMemory: 15.47GiB\n2024-02-18 11:51:31.806157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5)\n2024-02-18 11:54:56.216820: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally\nSuccessfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.\nExtracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz\nSuccessfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.\nExtracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz\nSuccessfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.\nExtracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz\nSuccessfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.\nExtracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz\nAccuracy at step 0: 0.1201\nAccuracy at step 10: 0.7364\n…..\nAccuracy at step 490: 0.9559\nAdding run metadata for 499\n\n

 

\n

 

\n

Time-slicing configuration

\n

An extremely useful feature of NVIDIA GPU Operator is represented by time-slicing. Time-slicing allows to share a physical GPU available on a node with multiple pods. Of course, this is just a time scheduling partition and not a physical GPU partition. It basically means that the different GPU processes that will be run by the different Pods will receive a proportional time of GPU compute time. However, if a Pod is particularly requiring in terms of GPU processing, it will impact significantly the other Pods sharing the GPU.

\n

In the official NVIDIA GPU operator there are different ways to configure time-slicing. Here, considering that one of the benefits of a cloud environment is the possibility of having multiple different node pools, each with different GPU or configuration, we will focus on a fine-grained definition of the time-slicing at the node pool level.

\n

The steps to enable time-slicing are three:

\n\n

As a first step, the nodes should be labelled with the key “nvidia.com/device-plugin.config”.

\n

For example, let’s label our node array from Azure CLI:

\n

 

\naz aks nodepool update --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --nodepool-name nc4ast4 --labels \"nvidia.com/device-plugin.config=tesla-t4-ts2\n

 

\n

After this step, let’s create the ConfigMap object required to allow for a time-slicing 2 on this node pool in a file called time-slicing-config.yaml:

\n

 

\napiVersion: v1\nkind: ConfigMap\nmetadata:\n name: time-slicing-config\ndata:\n tesla-t4-ts2: |-\n version: v1\n flags:\n migStrategy: none\n sharing:\n timeSlicing:\n resources:\n - name: nvidia.com/gpu\n replicas: 2\n\n

 

\n

Let’s apply the configuration in the GPU operator namespace:

\n

 

\nkubectl apply -f time-slicing-config.yaml -n gpu-operator\n

 

\n

Finally, let’s update the cluster policy to enable the time-slicing configuration:

\n

 

\nkubectl patch clusterpolicy/cluster-policy \\\n -n gpu-operator --type merge \\\n -p '{\"spec\": {\"devicePlugin\": {\"config\": {\"name\": \"time-slicing-config\"}}}}'\n\n

 

\n

Now, let’s try to resubmit the job already used in the first step in two replicas, creating a file called gpu-accelerated-time-slicing.yaml:

\n

 

\napiVersion: batch/v1\nkind: Job\nmetadata:\n labels:\n app: samples-tf-mnist-demo-ts\n name: samples-tf-mnist-demo-ts\nspec:\n completions: 2\n parallelism: 2\n completionMode: Indexed\n template:\n metadata:\n labels:\n app: samples-tf-mnist-demo-ts\n spec:\n affinity:\n nodeAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n nodeSelectorTerms:\n - matchExpressions:\n - key: node.kubernetes.io/instance-type\n operator: In\n values:\n - Standard_NC4as_T4_v3\n containers:\n - name: samples-tf-mnist-demo\n image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu\n args: [\"--max_steps\", \"500\"]\n imagePullPolicy: IfNotPresent\n resources:\n limits:\n nvidia.com/gpu: 1\n restartPolicy: OnFailure\n tolerations:\n - key: \"sku\"\n operator: \"Equal\"\n value: \"gpu\"\n effect: \"NoSchedule\" \n\n

 

\n

Let’s submit the job with the standard syntax:

\n

 

\nkubectl apply -f gpu-accelerated-time-slicing.yaml\n

 

\n

Now, after the node has been provisioned, we will find that it will get two GPU resources allocatable and will at the same time take the two Pods running concurrently at the same time.

\n

 

\nkubectl describe node aks-nc4ast4-81279986-vmss000004\n...\nAllocatable:\n cpu: 3860m\n ephemeral-storage: 119703055367\n hugepages-1Gi: 0\n hugepages-2Mi: 0\n memory: 24487780Ki\n nvidia.com/gpu: 2\n pods: 110\n.....\n Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age\n --------- ---- ------------ ---------- --------------- ------------- ---\n default samples-tf-mnist-demo-ts-0-4tdcf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29s\n default samples-tf-mnist-demo-ts-1-67hn4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29s\n gpu-operator gpu-feature-discovery-lksj7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 59s\n gpu-operator node-feature-discovery-worker-wbbct 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8m11s\n gpu-operator nvidia-container-toolkit-daemonset-8nmx7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m24s\n gpu-operator nvidia-dcgm-exporter-76rs8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m24s\n gpu-operator nvidia-device-plugin-daemonset-btwz7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 55s\n gpu-operator nvidia-driver-daemonset-8dkkh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8m6s\n gpu-operator nvidia-operator-validator-s7294 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m24s\n kube-system azure-ip-masq-agent-fjm5d 100m (2%) 500m (12%) 50Mi (0%) 250Mi (1%) 9m18s\n kube-system cloud-node-manager-9wpsm 50m (1%) 0 (0%) 50Mi (0%) 512Mi (2%) 9m18s\n kube-system csi-azuredisk-node-ckqw6 30m (0%) 0 (0%) 60Mi (0%) 400Mi (1%) 9m18s\n kube-system csi-azurefile-node-xmfbd 30m (0%) 0 (0%) 60Mi (0%) 600Mi (2%) 9m18s\n kube-system kube-proxy-7l856 100m (2%) 0 (0%) 0 (0%) 0 (0%) 9m18s\n\n

 

\n

A few remarks about time-slicing:

\n\n

Let’s imagine for example a nodepool scaled-down to zero that has no time-slicing applied. Let’s assume to configure it with time-slicing equal to 2. Submitting a request for 2 GPU resources may still allocate 2 nodes.

\n

This because the autoscaler has in its memory that each node provides only 1 allocatable GPU. In all subsequent operations, once a node will be correctly exposing 2 GPUs as allocatable for the first time, AKS autoscaler will acknowledge that and it will act accordingly in future autoscaling operations.

\n

 

\n

Multi-Instance GPU (MIG)

\n

NVIDIA Multi-instance GPU allows for GPU partitioning on Ampere and Hopper architecture. This means allowing an available GPU to be partitioned at hardware level (and not at time-slicing level). This means that Pods can have access to a dedicated hardware portion of the GPU resources which is delimited at an hardware level.

\n

In Kubernetes there are two strategies available for MIG, more specifically single and mixed.

\n

In single strategy, the nodes expose a standard “nvidia.com/gpu” set of resources.

\n

In mixed strategy, the nodes expose the specific MIG profiles as resources, like in the example below:

\n

 

\nAllocatable:\nnvidia.com/mig-1g.5gb: 1\nnvidia.com/mig-2g.10gb: 1\nnvidia.com/mig-3g.20gb: 1\n\n

 

\n

In order to use MIG, you could follow standard AKS documentation. However, we would like to propose here a method relying totally on NVIDIA GPU operator.

\n

As a first step, it is necessary to allow reboot of nodes to get MIG configuration enabled:

\n

 

\nkubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge -p '{\"spec\": {\"migManager\": {\"env\": [{\"name\": \"WITH_REBOOT\", \"value\": \"true\"}]}}}'\n

 

\n

Let’s start creating a node pools powered by a GPU supporting MIG on Azure, like on the SKU Standard_NC24ads_A100_v4 and let’s label the node with one of the MIG profiles available for A100 80 GiB.

\n

 

\naz aks nodepool add \\\n --resource-group $RESOURCE_GROUP_NAME \\\n --cluster-name $AKS_CLUSTER_NAME \\\n --name nc24a100v4 \\\n --node-taints sku=gpu:NoSchedule \\\n --node-vm-size Standard_NC24ads_A100_v4 \\\n --enable-cluster-autoscaler \\\n --min-count 0 --max-count 1 --node-count 0 --skip-gpu-driver-install --labels \"nvidia.com/mig.config\"=\"all-1g.10gb\"\n\n

 

\n

There is another important detail to consider in this stage with AKS, meaning that the auto-scaling of the nodes will bring-up nodes with a standard GPU configuration, without MIG activated. This means, that NVIDIA GPU operator will install the drivers and then mig-manager will activate the proper MIG configuration profile and reboot. Between these two phases there is a small time window where the GPU resources are exposed by the node and this could potentially trigger a job execution.

\n

To support this scenario, it is important to consider on AKS the need of an additional DaemonSet that prevents any Pod to be scheduled during the MIG configuration. This is available in a dedicated repository.

\n

To deploy the DaemonSet:

\n

 

\nexport NAMESPACE=gpu-operator\nexport ACR_NAME=<YOUR_ACR_NAME>\ngit clone https://github.com/wolfgang-desalvador/aks-mig-monitor.git\ncd aks-mig-monitor\nsed -i \"s/<ACR_NAME>/$ACR_NAME/g\" mig-monitor-daemonset.yaml\nsed -i \"s/<NAMESPACE>/$NAMESPACE/g\" mig-monitor-roles.yaml\ndocker build . -t $ACR_NAME/aks-mig-monitor\ndocker push $ACR_NAME/aks-mig-monitor\nkubectl apply -f mig-monitor-roles.yaml -n $NAMESPACE\nkubectl apply -f mig-monitor-daemonset.yaml -n $NAMESPACE\n\n

 

\n

We can now try to submit the mig-accelerated-job.yaml

\n

 

\napiVersion: batch/v1\nkind: Job\nmetadata:\n labels:\n app: samples-tf-mnist-demo-mig\n name: samples-tf-mnist-demo-mig\nspec:\n completions: 7\n parallelism: 7\n completionMode: Indexed\n template:\n metadata:\n labels:\n app: samples-tf-mnist-demo-mig\n spec:\n affinity:\n nodeAffinity:\n requiredDuringSchedulingIgnoredDuringExecution:\n nodeSelectorTerms:\n - matchExpressions:\n - key: node.kubernetes.io/instance-type\n operator: In\n values:\n - Standard_NC24ads_A100_v4\n containers:\n - name: samples-tf-mnist-demo\n image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu\n args: [\"--max_steps\", \"500\"]\n imagePullPolicy: IfNotPresent\n resources:\n limits:\n nvidia.com/gpu: 1\n restartPolicy: OnFailure\n tolerations:\n - key: \"sku\"\n operator: \"Equal\"\n value: \"gpu\"\n effect: \"NoSchedule\"\n\n

 

\n

 Then we will be submitting the job with kubectl:

\n

 

\nkubectl apply -f mig-accelerated-job.yaml\n

 

\n

After the node will startup, the first state will have a taint with mig=notReady:NoSchedule since the MIG configuration is not completed. GPU Operator containers will be installed:

\n

 

\nkubectl describe nodes aks-nc24a100v4-42670331-vmss00000a\n\nName: aks-nc24a100v4-42670331-vmss00000a\n...\n nvidia.com/mig.config=all-1g.10gb\n...\nTaints: kubernetes.azure.com/scalesetpriority=spot:NoSchedule\n mig=notReady:NoSchedule\n sku=gpu:NoSchedule\n...\nNon-terminated Pods: (13 in total)\n Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age\n --------- ---- ------------ ---------- --------------- ------------- ---\n gpu-operator aks-mig-monitor-64zpl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16s\n gpu-operator gpu-feature-discovery-wpd2j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13s\n gpu-operator node-feature-discovery-worker-79h68 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16s\n gpu-operator nvidia-container-toolkit-daemonset-q5p9k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12s\n gpu-operator nvidia-dcgm-exporter-9g5kg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13s\n gpu-operator nvidia-device-plugin-daemonset-5wpzk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13s\n gpu-operator nvidia-driver-daemonset-kqkzb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13s\n gpu-operator nvidia-operator-validator-lx77m 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12s\n kube-system azure-ip-masq-agent-7rd2x 100m (0%) 500m (2%) 50Mi (0%) 250Mi (0%) 66s\n kube-system cloud-node-manager-dc756 50m (0%) 0 (0%) 50Mi (0%) 512Mi (0%) 66s\n kube-system csi-azuredisk-node-5b4nk 30m (0%) 0 (0%) 60Mi (0%) 400Mi (0%) 66s\n kube-system csi-azurefile-node-vlwhv 30m (0%) 0 (0%) 60Mi (0%) 600Mi (0%) 66s\n kube-system kube-proxy-4fkxh 100m (0%) 0 (0%) 0 (0%) 0 (0%) 66s\n\n

 

\n

After the GPU Operator configuration is completed, mig-manager will start being deployed. MIG configuration will be applied and node will then set in a rebooting state:

\n

 

\n kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a\n\n nvidia.com/mig.config=all-1g.10gb\n nvidia.com/mig.strategy=single\n nvidia.com/mig.config.state=rebooting\n...\nTaints: kubernetes.azure.com/scalesetpriority=spot:NoSchedule\n mig=notReady:NoSchedule\n sku=gpu:NoSchedule\n...\nNon-terminated Pods: (14 in total)\n Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age\n --------- ---- ------------ ---------- --------------- ------------- ---\n gpu-operator aks-mig-monitor-64zpl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m6s\n gpu-operator gpu-feature-discovery-6btwx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m33s\n gpu-operator node-feature-discovery-worker-79h68 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m6s\n gpu-operator nvidia-container-toolkit-daemonset-wplkb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m33s\n gpu-operator nvidia-dcgm-exporter-vnscq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m33s\n gpu-operator nvidia-device-plugin-daemonset-d86dn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m33s\n gpu-operator nvidia-driver-daemonset-kqkzb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m3s\n gpu-operator nvidia-mig-manager-t4bw9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2s\n gpu-operator nvidia-operator-validator-jrfkn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m33s\n kube-system azure-ip-masq-agent-7rd2x 100m (0%) 500m (2%) 50Mi (0%) 250Mi (0%) 4m56s\n kube-system cloud-node-manager-dc756 50m (0%) 0 (0%) 50Mi (0%) 512Mi (0%) 4m56s\n kube-system csi-azuredisk-node-5b4nk 30m (0%) 0 (0%) 60Mi (0%) 400Mi (0%) 4m56s\n kube-system csi-azurefile-node-vlwhv 30m (0%) 0 (0%) 60Mi (0%) 600Mi (0%) 4m56s\n kube-system kube-proxy-4fkxh 100m (0%) 0 (0%) 0 (0%) 0 (0%) 4m56s\n\n

 

\n

After the reboot, the MIG configuration will switch to state \"success\" and taints will be removed. Scheduling of the 7 pods of our job will then start:

\n

 

\n kubectl describe nodes aks-nc24a100v4-42670331-vmss00000a\n...\n nvidia.com/mig.capable=true\n nvidia.com/mig.config=all-1g.10gb\n nvidia.com/mig.config.state=success\n nvidia.com/mig.strategy=single\n...\nTaints: kubernetes.azure.com/scalesetpriority=spot:NoSchedule\n sku=gpu:NoSchedule\n...\nAllocatable:\n cpu: 23660m\n ephemeral-storage: 119703055367\n hugepages-1Gi: 0\n hugepages-2Mi: 0\n memory: 214295444Ki\n nvidia.com/gpu: 7\n pods: 110\n...\nNon-terminated Pods: (21 in total)\n Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age\n --------- ---- ------------ ---------- --------------- ------------- ---\n default samples-tf-mnist-demo-ts-0-5bs64 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m\n default samples-tf-mnist-demo-ts-1-2msdh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m\n default samples-tf-mnist-demo-ts-2-ck8c8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m\n default samples-tf-mnist-demo-ts-3-dlkfn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m\n default samples-tf-mnist-demo-ts-4-899fr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m\n default samples-tf-mnist-demo-ts-5-dmgpn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m\n default samples-tf-mnist-demo-ts-6-pvzm4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m\n gpu-operator aks-mig-monitor-64zpl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9m9s\n gpu-operator gpu-feature-discovery-5t9gn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 41s\n gpu-operator node-feature-discovery-worker-79h68 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9m9s\n gpu-operator nvidia-container-toolkit-daemonset-82dgg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m22s\n gpu-operator nvidia-dcgm-exporter-xbxqf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 41s\n gpu-operator nvidia-device-plugin-daemonset-8gkzd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 41s\n gpu-operator nvidia-driver-daemonset-kqkzb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9m6s\n gpu-operator nvidia-mig-manager-jbqls 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m22s\n gpu-operator nvidia-operator-validator-5rdbh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 41s\n kube-system azure-ip-masq-agent-7rd2x 100m (0%) 500m (2%) 50Mi (0%) 250Mi (0%) 9m59s\n kube-system cloud-node-manager-dc756 50m (0%) 0 (0%) 50Mi (0%) 512Mi (0%) 9m59s\n kube-system csi-azuredisk-node-5b4nk 30m (0%) 0 (0%) 60Mi (0%) 400Mi (0%) 9m59s\n kube-system csi-azurefile-node-vlwhv 30m (0%) 0 (0%) 60Mi (0%) 600Mi (0%) 9m59s\n kube-system kube-proxy-4fkxh 100m (0%) 0 (0%) 0 (0%) 0 (0%) 9m59s\n\"\n\n

 

\n

Checking on a node the status of MIG will visualize the 7 GPU partitions through nvidia-smi:

\n

 

\nnvidia-smi\n\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |\n|-----------------------------------------+----------------------+----------------------+\n| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n| | | MIG M. |\n|=========================================+======================+======================|\n| 0 NVIDIA A100 80GB PCIe On | 00000001:00:00.0 Off | On |\n| N/A 27C P0 71W / 300W | 726MiB / 81920MiB | N/A Default |\n| | | Enabled |\n+-----------------------------------------+----------------------+----------------------+\n\n+---------------------------------------------------------------------------------------+\n| MIG devices: |\n+------------------+--------------------------------+-----------+-----------------------+\n| GPU GI CI MIG | Memory-Usage | Vol| Shared |\n| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |\n| | | ECC| |\n|==================+================================+===========+=======================|\n| 0 7 0 0 | 102MiB / 9728MiB | 14 0 | 1 0 0 0 0 |\n| | 2MiB / 16383MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 8 0 1 | 104MiB / 9728MiB | 14 0 | 1 0 0 0 0 |\n| | 2MiB / 16383MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 9 0 2 | 104MiB / 9728MiB | 14 0 | 1 0 0 0 0 |\n| | 2MiB / 16383MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 10 0 3 | 104MiB / 9728MiB | 14 0 | 1 0 0 0 0 |\n| | 2MiB / 16383MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 11 0 4 | 104MiB / 9728MiB | 14 0 | 1 0 0 0 0 |\n| | 2MiB / 16383MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 12 0 5 | 104MiB / 9728MiB | 14 0 | 1 0 0 0 0 |\n| | 2MiB / 16383MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 13 0 6 | 104MiB / 9728MiB | 14 0 | 1 0 0 0 0 |\n| | 2MiB / 16383MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n\n+---------------------------------------------------------------------------------------+\n| Processes: |\n| GPU GI CI PID Type Process name GPU Memory |\n| ID ID Usage |\n|=======================================================================================|\n| 0 7 0 28988 C python 82MiB |\n| 0 8 0 29140 C python 84MiB |\n| 0 9 0 29335 C python 84MiB |\n| 0 10 0 29090 C python 84MiB |\n| 0 11 0 29031 C python 84MiB |\n| 0 12 0 29190 C python 84MiB |\n| 0 13 0 29255 C python 84MiB |\n+---------------------------------------------------------------------------------------+\n

 

\n

A few remarks about MIG to take into account:

\n\n

For example, in the case above, if we want to move to another profile, it is important to cordon the node with the commands:

\n

 

\nkubectl get nodes\n\nNAME STATUS ROLES AGE VERSION\naks-nc24a100v4-42670331-vmss00000c Ready agent 11m v1.27.7\naks-nodepool1-25743550-vmss000000 Ready agent 6d16h v1.27.7\naks-nodepool1-25743550-vmss000001 Ready agent 6d16h v1.27.7\n\nkubectl cordon aks-nc24a100v4-42670331-vmss00000c\n\nnode/aks-nc24a100v4-42670331-vmss00000c cordoned\n\n

 

\n

Be aware that cordoning the nodes will not stop the Pods. You should verify no GPU accelerated workload is running before submitting the label change. 

\n

Since in our case we have applied the label at the AKS level, we will need to change the label from Azure CLI:

\n

 

\naz aks nodepool update --cluster-name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --nodepool-name nc24a100v4 --labels \"nvidia.com/mig.config\"=\"all-1g.20gb\"\n

 

\n

 
This will trigger a reconfiguration of MIG with new profile applied:

\n

 

\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |\n|-----------------------------------------+----------------------+----------------------+\n| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n| | | MIG M. |\n|=========================================+======================+======================|\n| 0 NVIDIA A100 80GB PCIe On | 00000001:00:00.0 Off | On |\n| N/A 42C P0 77W / 300W | 50MiB / 81920MiB | N/A Default |\n| | | Enabled |\n+-----------------------------------------+----------------------+----------------------+\n\n+---------------------------------------------------------------------------------------+\n| MIG devices: |\n+------------------+--------------------------------+-----------+-----------------------+\n| GPU GI CI MIG | Memory-Usage | Vol| Shared |\n| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |\n| | | ECC| |\n|==================+================================+===========+=======================|\n| 0 3 0 0 | 12MiB / 19968MiB | 14 0 | 1 0 1 0 0 |\n| | 0MiB / 32767MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 4 0 1 | 12MiB / 19968MiB | 14 0 | 1 0 1 0 0 |\n| | 0MiB / 32767MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 5 0 2 | 12MiB / 19968MiB | 14 0 | 1 0 1 0 0 |\n| | 0MiB / 32767MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n| 0 6 0 3 | 12MiB / 19968MiB | 14 0 | 1 0 1 0 0 |\n| | 0MiB / 32767MiB | | |\n+------------------+--------------------------------+-----------+-----------------------+\n\n+---------------------------------------------------------------------------------------+\n| Processes: |\n| GPU GI CI PID Type Process name GPU Memory |\n| ID ID Usage |\n|=======================================================================================|\n| No running processes found |\n+---------------------------------------------------------------------------------------+\n

 

\n


We can then uncordon the node/nodes:

\n

 

\nkubectl uncordon aks-nc24a100v4-42670331-vmss00000c\n\nnode/aks-nc24a100v4-42670331-vmss00000c uncordoned\n\n

 

\n

 

\n

Using NVIDIA GPU Driver CRD (preview)

\n

The NVIDIA GPU Driver CRD allows to define in a granular way the driver version and the driver images on each of the nodepools in use in an AKS cluster. This feature is in preview as documented in NVIDIA GPU Operator and not suggested by NVIDIA on production systems.

\n


In order to enable NVIDIA GPU Driver CRD (in case you have already installed NVIDIA GPU Operator you will need to perform helm uninstall, of course taking care of running workloads) perform the following command:

\n

 

\n helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator --set-json daemonsets.tolerations='[{\"effect\": \"NoSchedule\", \"key\": \"sku\", \"operator\": \"Equal\", \"value\": \"gpu\" }, {\"effect\": \"NoSchedule\", \"key\": \"kubernetes.azure.com/scalesetpriority\", \"value\":\"spot\", \"operator\": \"Equal\"}]' --set nfd.enabled=false --set driver.nvidiaDriverCRD.deployDefaultCR=false --set driver.nvidiaDriverCRD.enabled=true\n

 

\n

After this step, it is important to create nodepools with a proper label to be used to select nodes for driver version (in this case \"driver.config\"):

\n

 

\n\naz aks nodepool add \\\n --resource-group $RESOURCE_GROUP_NAME \\\n --cluster-name $AKS_CLUSTER_NAME \\ \n --name nc4latest \\\n --node-taints sku=gpu:NoSchedule \\\n --node-vm-size Standard_NC4as_T4_v3 \\ \n --enable-cluster-autoscaler \\\n --labels \"driver.config\"=\"latest\" \\\n --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True\n\naz aks nodepool add \\\n --resource-group $RESOURCE_GROUP_NAME \\\n --cluster-name $AKS_CLUSTER_NAME \\ \n --name nc4stable \\\n --node-taints sku=gpu:NoSchedule \\\n --node-vm-size Standard_NC4as_T4_v3 \\ \n --enable-cluster-autoscaler \\\n --labels \"driver.config\"=\"stable\" \\\n --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True\n\n

 

\n

After this step, the driver configuration (NVIDIADriver object in AKS) should be created. This can be done with a file called driver-config.yaml with the following content:

\n

 

\napiVersion: nvidia.com/v1alpha1\nkind: NVIDIADriver\nmetadata:\n name: nc4-latest\nspec:\n driverType: gpu\n env: []\n image: driver\n imagePullPolicy: IfNotPresent\n imagePullSecrets: []\n manager: {}\n tolerations:\n - key: \"sku\"\n operator: \"Equal\"\n value: \"gpu\"\n effect: \"NoSchedule\"\n - key: \"kubernetes.azure.com/scalesetpriority\"\n operator: \"Equal\"\n value: \"spot\"\n effect: \"NoSchedule\"\n nodeSelector:\n driver.config: \"latest\"\n repository: nvcr.io/nvidia\n version: \"535.129.03\"\n---\napiVersion: nvidia.com/v1alpha1\nkind: NVIDIADriver\nmetadata:\n name: nc4-stable\nspec:\n driverType: gpu\n env: []\n image: driver\n imagePullPolicy: IfNotPresent\n imagePullSecrets: []\n manager: {}\n tolerations:\n - key: \"sku\"\n operator: \"Equal\"\n value: \"gpu\"\n effect: \"NoSchedule\"\n - key: \"kubernetes.azure.com/scalesetpriority\"\n operator: \"Equal\"\n value: \"spot\"\n effect: \"NoSchedule\"\n nodeSelector:\n driver.config: \"stable\"\n repository: nvcr.io/nvidia\n version: \"535.104.12\"\n

 

\n


This can then be applied with kubectl:

\n

 

\nkubectl apply -f driver-config.yaml -n gpu-operator\n

 

\n

Now scaling up to nodes (e.g. submitting a GPU workload requesting as affinity exactly the target labels of device.config) we can verify that the driver versions will be the one requested. Running nvidia-smi attaching a shell to the Daemonset container of each of the two nodes:

\n

 

\n### On latest\n\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |\n|-----------------------------------------+----------------------+----------------------+\n| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n| | | MIG M. |\n|=========================================+======================+======================|\n| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |\n| N/A 30C P8 15W / 70W | 2MiB / 16384MiB | 0% Default |\n| | | N/A |\n+-----------------------------------------+----------------------+----------------------+\n\n### On stable\n\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |\n|-----------------------------------------+----------------------+----------------------+\n| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n| | | MIG M. |\n|=========================================+======================+======================|\n| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |\n| N/A 30C P8 14W / 70W | 0MiB / 16384MiB | 0% Default |\n| | | N/A |\n+-----------------------------------------+----------------------+----------------------+\n\n

 

\n

NVIDIA GPU Driver CRD allows to specify a specific Docker image and Docker Registry for the NVIDIA Driver installation on each node pool.

This becomes particularly useful in the case we will need to install the Azure specific Virtual GPU Drivers on A10 GPUs. 

\n

On Azure, NVads_A10_v5 VMs are characterized by NVIDIA VGPU technology in the backend, so they require VGPU Drivers. On Azure, the VGPU drivers comes included with the VM cost, so there is no need to get a VGPU license. The binaries available on the Azure Driver download page can be used on the supported OS (including Ubuntu 22) only on Azure VMs.

\n

In this case, there is the possibility to bundle an ad-hoc NVIDIA Driver container image to be used on Azure, making that accessible to a dedicated container registry.

\n


In order to do that, this is the procedure (assuming wehave an ACR attached to AKS with <ACR_NAME>):

\n

 

\nexport ACR_NAME=<ACR_NAME>\naz acr login -n $ACR_NAME\ngit clone https://gitlab.com/nvidia/container-images/driver\ncd driver\ncp -r ubuntu22.04 ubuntu22.04-aks\ncd ubuntu22.04-aks\ncd drivers\nwget \"https://download.microsoft.com/download/1/4/4/14450d0e-a3f2-4b0a-9bb4-a8e729e986c4/NVIDIA-Linux-x86_64-535.154.05-grid-azure.run\"\nmv NVIDIA-Linux-x86_64-535.154.05-grid-azure.run NVIDIA-Linux-x86_64-535.154.05.run\nchmod +x NVIDIA-Linux-x86_64-535.154.05.run\ncd ..\nsed -i 's%/tmp/install.sh download_installer%echo \"Skipping Driver Download\"%g' Dockerfile\nsed 's%sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x%sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x \\&\\& mv NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION-grid-azure NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION%g' nvidia-driver -i\ndocker build --build-arg DRIVER_VERSION=535.154.05 --build-arg DRIVER_BRANCH=535 --build-arg CUDA_VERSION=12.3.1 --build-arg TARGETARCH=amd64 . -t $ACR_NAME/driver:535.154.05-ubuntu22.04\ndocker push $ACR_NAME/driver:535.154.05-ubuntu22.04\n \n

 

\n

After this, let's create a specific NVIDIADriver object for Azure VGPU with a file named azure-vgpu.yaml and the following content (replace <ACR_NAME> with your ACR name):

\n

 

\napiVersion: nvidia.com/v1alpha1\nkind: NVIDIADriver\nmetadata:\n name: azure-vgpu\nspec:\n driverType: gpu\n env: []\n image: driver\n imagePullPolicy: IfNotPresent\n imagePullSecrets: []\n manager: {}\n tolerations:\n - key: \"sku\"\n operator: \"Equal\"\n value: \"gpu\"\n effect: \"NoSchedule\"\n - key: \"kubernetes.azure.com/scalesetpriority\"\n operator: \"Equal\"\n value: \"spot\"\n effect: \"NoSchedule\"\n nodeSelector:\n driver.config: \"azurevgpu\"\n repository: <ACR_NAME>\n version: \"535.154.05\"\n\n

 

\n

Let's apply it with kubectl:

\n

 

\nkubectl apply -f azure-vgpu.yaml -n gpu-operator\n

 

\n

Now, let's create an A10 nodepool with Azure CLI:

\n

 

\naz aks nodepool add \\\n --resource-group $RESOURCE_GROUP_NAME \\\n --cluster-name $AKS_CLUSTER_NAME \\\n --name nv36a10v5 \\\n --node-taints sku=gpu:NoSchedule \\\n --node-vm-size Standard_NV36ads_A10_v5 \\\n --enable-cluster-autoscaler \\\n --labels \"driver.config\"=\"azurevgpu\" \\\n --min-count 0 --max-count 1 --node-count 0 --tags SkipGPUDriverInstall=True\n

 

\n

Scaling up a node with a specific workload and waiting for the finalization of Driver installation, we will see that the image of the NVIDIA Driver installation has been pulled by our registry:

\n

 

\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes\nNAME STATUS ROLES AGE VERSION\naks-nodepool1-25743550-vmss000000 Ready agent 6d23h v1.27.7\naks-nodepool1-25743550-vmss000001 Ready agent 6d23h v1.27.7\naks-nv36a10v5-10653906-vmss000000 Ready agent 9m24s v1.27.7\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl describe node aks-nv36a10v5-10653906-vmss000000| grep gpu-driver\n nvidia.com/gpu-driver-upgrade-state=upgrade-done\n nvidia.com/gpu-driver-upgrade-enabled: true\n gpu-operator nvidia-gpu-driver-ubuntu22.04-56df89b87c-6w8tj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m29s\n\nroot@aks-gpu-playground-rg-jumpbox:~# kubectl describe pods -n gpu-operator nvidia-gpu-driver-ubuntu22.04-56df89b87c-6w8tj | grep -i Image\n Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2\n Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:bb845160b32fd12eb3fae3e830d2e6a7780bc7405e0d8c5b816242d48be9daa8\n Image: aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04\n Image ID: aksgpuplayground.azurecr.io/driver@sha256:deb6e6311a174ca6a989f8338940bf3b1e6ae115ebf738042063f4c3c95c770f\n Normal Pulled 4m26s kubelet Container image \"nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2\" already present on machine\n Normal Pulling 4m23s kubelet Pulling image \"aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04\"\n Normal Pulled 4m16s kubelet Successfully pulled image \"aksgpuplayground.azurecr.io/driver:535.154.05-ubuntu22.04\" in 6.871887325s (6.871898205s including waiting)\n\n

 

\n

Also, we can see how the A10 VGPU profile is recognized successfully attaching to the Pod of device-plugin-daemonset:

\n

 

\n+---------------------------------------------------------------------------------------+\n| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |\n|-----------------------------------------+----------------------+----------------------+\n| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n| | | MIG M. |\n|=========================================+======================+======================|\n| 0 NVIDIA A10-24Q On | 00000002:00:00.0 Off | 0 |\n| N/A N/A P8 N/A / N/A | 0MiB / 24512MiB | 0% Default |\n| | | Disabled |\n+-----------------------------------------+----------------------+----------------------+\n

 

\n

 

\n

Thank you

\n

Thank you for reading our blog posts, feel free to leave any comment / feedback, ask for clarifications or report any issue

","kudosSumWeight":4,"postTime":"2024-02-23T10:36:10.872-08:00","images":{"__typename":"AssociatedImageConnection","edges":[{"__typename":"AssociatedImageEdge","cursor":"MjUuMXwyLjF8b3wyNXxfTlZffDE","node":{"__ref":"AssociatedImage:{\"url\":\"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/bS00MDYxMzE4LTU1NDYyNWk4NzEyNzdCOEMyOTUxRUMw?revision=32\"}"}},{"__typename":"AssociatedImageEdge","cursor":"MjUuMXwyLjF8b3wyNXxfTlZffDI","node":{"__ref":"AssociatedImage:{\"url\":\"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/bS00MDYxMzE4LTU1NTYyNmlEN0NCNjEwQjM1ODUyQ0RC?revision=32\"}"}}],"totalCount":2,"pageInfo":{"__typename":"PageInfo","hasNextPage":false,"endCursor":null,"hasPreviousPage":false,"startCursor":null}},"attachments":{"__typename":"AttachmentConnection","pageInfo":{"__typename":"PageInfo","hasNextPage":false,"endCursor":null,"hasPreviousPage":false,"startCursor":null},"edges":[]},"tags":{"__typename":"TagConnection","pageInfo":{"__typename":"PageInfo","hasNextPage":false,"endCursor":null,"hasPreviousPage":false,"startCursor":null},"edges":[{"__typename":"TagEdge","cursor":"MjUuMXwyLjF8b3wxMHxfTlZffDE","node":{"__typename":"Tag","id":"tag:ai infrastructure","text":"ai infrastructure","time":"2022-10-26T12:47:40.044-07:00","lastActivityTime":null,"messagesCount":null,"followersCount":null}}]},"timeToRead":29,"rawTeaser":"

\n

The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator for HPC/AI workloads requiring a high degree of customization and granular control over the compute-resources configuration

","introduction":"","coverImage":null,"coverImageProperties":{"__typename":"CoverImageProperties","style":"STANDARD","titlePosition":"BOTTOM","altText":""},"currentRevision":{"__ref":"Revision:revision:4061318_32"},"latestVersion":{"__typename":"FriendlyVersion","major":"6","minor":"0"},"metrics":{"__typename":"MessageMetrics","views":14767},"visibilityScope":"PUBLIC","canonicalUrl":null,"seoTitle":null,"seoDescription":null,"placeholder":false,"originalMessageForPlaceholder":null,"contributors":{"__typename":"UserConnection","edges":[]},"nonCoAuthorContributors":{"__typename":"UserConnection","edges":[]},"coAuthors":{"__typename":"UserConnection","edges":[]},"blogMessagePolicies":{"__typename":"BlogMessagePolicies","canDoAuthoringActionsOnBlog":{"__typename":"PolicyResult","failureReason":{"__typename":"FailureReason","message":"error.lithium.policies.blog.action_can_do_authoring_action.accessDenied","key":"error.lithium.policies.blog.action_can_do_authoring_action.accessDenied","args":[]}}},"archivalData":null,"replies":{"__typename":"MessageConnection","edges":[{"__typename":"MessageEdge","cursor":"MjUuMXwyLjF8aXwxMHwxMzI6MHxpbnQsNDA4MzY5Nyw0MDgzNjk3","node":{"__ref":"BlogReplyMessage:message:4083697"}}],"pageInfo":{"__typename":"PageInfo","hasNextPage":false,"endCursor":null,"hasPreviousPage":false,"startCursor":null}},"customFields":[],"revisions({\"constraints\":{\"isPublished\":{\"eq\":true}},\"first\":1})":{"__typename":"RevisionConnection","totalCount":32}},"Conversation:conversation:4061318":{"__typename":"Conversation","id":"conversation:4061318","solved":false,"topic":{"__ref":"BlogTopicMessage:message:4061318"},"lastPostingActivityTime":"2024-03-13T02:44:20.569-07:00","lastPostTime":"2024-03-13T02:44:20.569-07:00","unreadReplyCount":1,"isSubscribed":false},"ModerationData:moderation_data:4061318":{"__typename":"ModerationData","id":"moderation_data:4061318","status":"APPROVED","rejectReason":null,"isReportedAbuse":false,"rejectUser":null,"rejectTime":null,"rejectActorType":null},"AssociatedImage:{\"url\":\"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/bS00MDYxMzE4LTU1NDYyNWk4NzEyNzdCOEMyOTUxRUMw?revision=32\"}":{"__typename":"AssociatedImage","url":"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/bS00MDYxMzE4LTU1NDYyNWk4NzEyNzdCOEMyOTUxRUMw?revision=32","title":"nvidia-gpu-operator-image.jpg","associationType":"TEASER","width":1260,"height":709,"altText":null},"AssociatedImage:{\"url\":\"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/bS00MDYxMzE4LTU1NTYyNmlEN0NCNjEwQjM1ODUyQ0RC?revision=32\"}":{"__typename":"AssociatedImage","url":"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/bS00MDYxMzE4LTU1NTYyNmlEN0NCNjEwQjM1ODUyQ0RC?revision=32","title":"wolfgangdesalvador_0-1709036419688.jpeg","associationType":"BODY","width":1260,"height":709,"altText":null},"Revision:revision:4061318_32":{"__typename":"Revision","id":"revision:4061318_32","lastEditTime":"2024-02-27T04:27:52.203-08:00"},"CachedAsset:theme:customTheme1-1744326567485":{"__typename":"CachedAsset","id":"theme:customTheme1-1744326567485","value":{"id":"customTheme1","animation":{"fast":"150ms","normal":"250ms","slow":"500ms","slowest":"750ms","function":"cubic-bezier(0.07, 0.91, 0.51, 1)","__typename":"AnimationThemeSettings"},"avatar":{"borderRadius":"50%","collections":["default"],"__typename":"AvatarThemeSettings"},"basics":{"browserIcon":{"imageAssetName":"favicon-1730836283320.png","imageLastModified":"1730836286415","__typename":"ThemeAsset"},"customerLogo":{"imageAssetName":"favicon-1730836271365.png","imageLastModified":"1730836274203","__typename":"ThemeAsset"},"maximumWidthOfPageContent":"1300px","oneColumnNarrowWidth":"800px","gridGutterWidthMd":"30px","gridGutterWidthXs":"10px","pageWidthStyle":"WIDTH_OF_BROWSER","__typename":"BasicsThemeSettings"},"buttons":{"borderRadiusSm":"3px","borderRadius":"3px","borderRadiusLg":"5px","paddingY":"5px","paddingYLg":"7px","paddingYHero":"var(--lia-bs-btn-padding-y-lg)","paddingX":"12px","paddingXLg":"16px","paddingXHero":"60px","fontStyle":"NORMAL","fontWeight":"700","textTransform":"NONE","disabledOpacity":0.5,"primaryTextColor":"var(--lia-bs-white)","primaryTextHoverColor":"var(--lia-bs-white)","primaryTextActiveColor":"var(--lia-bs-white)","primaryBgColor":"var(--lia-bs-primary)","primaryBgHoverColor":"hsl(var(--lia-bs-primary-h), var(--lia-bs-primary-s), calc(var(--lia-bs-primary-l) * 0.85))","primaryBgActiveColor":"hsl(var(--lia-bs-primary-h), var(--lia-bs-primary-s), calc(var(--lia-bs-primary-l) * 0.7))","primaryBorder":"1px solid transparent","primaryBorderHover":"1px solid transparent","primaryBorderActive":"1px solid transparent","primaryBorderFocus":"1px solid var(--lia-bs-white)","primaryBoxShadowFocus":"0 0 0 1px var(--lia-bs-primary), 0 0 0 4px hsla(var(--lia-bs-primary-h), var(--lia-bs-primary-s), var(--lia-bs-primary-l), 0.2)","secondaryTextColor":"var(--lia-bs-gray-900)","secondaryTextHoverColor":"hsl(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), calc(var(--lia-bs-gray-900-l) * 0.95))","secondaryTextActiveColor":"hsl(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), calc(var(--lia-bs-gray-900-l) * 0.9))","secondaryBgColor":"var(--lia-bs-gray-200)","secondaryBgHoverColor":"hsl(var(--lia-bs-gray-200-h), var(--lia-bs-gray-200-s), calc(var(--lia-bs-gray-200-l) * 0.96))","secondaryBgActiveColor":"hsl(var(--lia-bs-gray-200-h), var(--lia-bs-gray-200-s), calc(var(--lia-bs-gray-200-l) * 0.92))","secondaryBorder":"1px solid transparent","secondaryBorderHover":"1px solid transparent","secondaryBorderActive":"1px solid transparent","secondaryBorderFocus":"1px solid transparent","secondaryBoxShadowFocus":"0 0 0 1px var(--lia-bs-primary), 0 0 0 4px hsla(var(--lia-bs-primary-h), var(--lia-bs-primary-s), var(--lia-bs-primary-l), 0.2)","tertiaryTextColor":"var(--lia-bs-gray-900)","tertiaryTextHoverColor":"hsl(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), calc(var(--lia-bs-gray-900-l) * 0.95))","tertiaryTextActiveColor":"hsl(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), calc(var(--lia-bs-gray-900-l) * 0.9))","tertiaryBgColor":"transparent","tertiaryBgHoverColor":"transparent","tertiaryBgActiveColor":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.04)","tertiaryBorder":"1px solid transparent","tertiaryBorderHover":"1px solid hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.08)","tertiaryBorderActive":"1px solid transparent","tertiaryBorderFocus":"1px solid transparent","tertiaryBoxShadowFocus":"0 0 0 1px var(--lia-bs-primary), 0 0 0 4px hsla(var(--lia-bs-primary-h), var(--lia-bs-primary-s), var(--lia-bs-primary-l), 0.2)","destructiveTextColor":"var(--lia-bs-danger)","destructiveTextHoverColor":"hsl(var(--lia-bs-danger-h), var(--lia-bs-danger-s), calc(var(--lia-bs-danger-l) * 0.95))","destructiveTextActiveColor":"hsl(var(--lia-bs-danger-h), var(--lia-bs-danger-s), calc(var(--lia-bs-danger-l) * 0.9))","destructiveBgColor":"var(--lia-bs-gray-200)","destructiveBgHoverColor":"hsl(var(--lia-bs-gray-200-h), var(--lia-bs-gray-200-s), calc(var(--lia-bs-gray-200-l) * 0.96))","destructiveBgActiveColor":"hsl(var(--lia-bs-gray-200-h), var(--lia-bs-gray-200-s), calc(var(--lia-bs-gray-200-l) * 0.92))","destructiveBorder":"1px solid transparent","destructiveBorderHover":"1px solid transparent","destructiveBorderActive":"1px solid transparent","destructiveBorderFocus":"1px solid transparent","destructiveBoxShadowFocus":"0 0 0 1px var(--lia-bs-primary), 0 0 0 4px hsla(var(--lia-bs-primary-h), var(--lia-bs-primary-s), var(--lia-bs-primary-l), 0.2)","__typename":"ButtonsThemeSettings"},"border":{"color":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.08)","mainContent":"NONE","sideContent":"LIGHT","radiusSm":"3px","radius":"5px","radiusLg":"9px","radius50":"100vw","__typename":"BorderThemeSettings"},"boxShadow":{"xs":"0 0 0 1px hsla(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), var(--lia-bs-gray-900-l), 0.08), 0 3px 0 -1px hsla(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), var(--lia-bs-gray-900-l), 0.16)","sm":"0 2px 4px hsla(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), var(--lia-bs-gray-900-l), 0.12)","md":"0 5px 15px hsla(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), var(--lia-bs-gray-900-l), 0.3)","lg":"0 10px 30px hsla(var(--lia-bs-gray-900-h), var(--lia-bs-gray-900-s), var(--lia-bs-gray-900-l), 0.3)","__typename":"BoxShadowThemeSettings"},"cards":{"bgColor":"var(--lia-panel-bg-color)","borderRadius":"var(--lia-panel-border-radius)","boxShadow":"var(--lia-box-shadow-xs)","__typename":"CardsThemeSettings"},"chip":{"maxWidth":"300px","height":"30px","__typename":"ChipThemeSettings"},"coreTypes":{"defaultMessageLinkColor":"var(--lia-bs-link-color)","defaultMessageLinkDecoration":"none","defaultMessageLinkFontStyle":"NORMAL","defaultMessageLinkFontWeight":"400","defaultMessageFontStyle":"NORMAL","defaultMessageFontWeight":"400","forumColor":"#4099E2","forumFontFamily":"var(--lia-bs-font-family-base)","forumFontWeight":"var(--lia-default-message-font-weight)","forumLineHeight":"var(--lia-bs-line-height-base)","forumFontStyle":"var(--lia-default-message-font-style)","forumMessageLinkColor":"var(--lia-default-message-link-color)","forumMessageLinkDecoration":"var(--lia-default-message-link-decoration)","forumMessageLinkFontStyle":"var(--lia-default-message-link-font-style)","forumMessageLinkFontWeight":"var(--lia-default-message-link-font-weight)","forumSolvedColor":"#148563","blogColor":"#1CBAA0","blogFontFamily":"var(--lia-bs-font-family-base)","blogFontWeight":"var(--lia-default-message-font-weight)","blogLineHeight":"1.75","blogFontStyle":"var(--lia-default-message-font-style)","blogMessageLinkColor":"var(--lia-default-message-link-color)","blogMessageLinkDecoration":"var(--lia-default-message-link-decoration)","blogMessageLinkFontStyle":"var(--lia-default-message-link-font-style)","blogMessageLinkFontWeight":"var(--lia-default-message-link-font-weight)","tkbColor":"#4C6B90","tkbFontFamily":"var(--lia-bs-font-family-base)","tkbFontWeight":"var(--lia-default-message-font-weight)","tkbLineHeight":"1.75","tkbFontStyle":"var(--lia-default-message-font-style)","tkbMessageLinkColor":"var(--lia-default-message-link-color)","tkbMessageLinkDecoration":"var(--lia-default-message-link-decoration)","tkbMessageLinkFontStyle":"var(--lia-default-message-link-font-style)","tkbMessageLinkFontWeight":"var(--lia-default-message-link-font-weight)","qandaColor":"#4099E2","qandaFontFamily":"var(--lia-bs-font-family-base)","qandaFontWeight":"var(--lia-default-message-font-weight)","qandaLineHeight":"var(--lia-bs-line-height-base)","qandaFontStyle":"var(--lia-default-message-link-font-style)","qandaMessageLinkColor":"var(--lia-default-message-link-color)","qandaMessageLinkDecoration":"var(--lia-default-message-link-decoration)","qandaMessageLinkFontStyle":"var(--lia-default-message-link-font-style)","qandaMessageLinkFontWeight":"var(--lia-default-message-link-font-weight)","qandaSolvedColor":"#3FA023","ideaColor":"#FF8000","ideaFontFamily":"var(--lia-bs-font-family-base)","ideaFontWeight":"var(--lia-default-message-font-weight)","ideaLineHeight":"var(--lia-bs-line-height-base)","ideaFontStyle":"var(--lia-default-message-font-style)","ideaMessageLinkColor":"var(--lia-default-message-link-color)","ideaMessageLinkDecoration":"var(--lia-default-message-link-decoration)","ideaMessageLinkFontStyle":"var(--lia-default-message-link-font-style)","ideaMessageLinkFontWeight":"var(--lia-default-message-link-font-weight)","contestColor":"#FCC845","contestFontFamily":"var(--lia-bs-font-family-base)","contestFontWeight":"var(--lia-default-message-font-weight)","contestLineHeight":"var(--lia-bs-line-height-base)","contestFontStyle":"var(--lia-default-message-link-font-style)","contestMessageLinkColor":"var(--lia-default-message-link-color)","contestMessageLinkDecoration":"var(--lia-default-message-link-decoration)","contestMessageLinkFontStyle":"ITALIC","contestMessageLinkFontWeight":"var(--lia-default-message-link-font-weight)","occasionColor":"#D13A1F","occasionFontFamily":"var(--lia-bs-font-family-base)","occasionFontWeight":"var(--lia-default-message-font-weight)","occasionLineHeight":"var(--lia-bs-line-height-base)","occasionFontStyle":"var(--lia-default-message-font-style)","occasionMessageLinkColor":"var(--lia-default-message-link-color)","occasionMessageLinkDecoration":"var(--lia-default-message-link-decoration)","occasionMessageLinkFontStyle":"var(--lia-default-message-link-font-style)","occasionMessageLinkFontWeight":"var(--lia-default-message-link-font-weight)","grouphubColor":"#333333","categoryColor":"#949494","communityColor":"#FFFFFF","productColor":"#949494","__typename":"CoreTypesThemeSettings"},"colors":{"black":"#000000","white":"#FFFFFF","gray100":"#F7F7F7","gray200":"#F7F7F7","gray300":"#E8E8E8","gray400":"#D9D9D9","gray500":"#CCCCCC","gray600":"#717171","gray700":"#707070","gray800":"#545454","gray900":"#333333","dark":"#545454","light":"#F7F7F7","primary":"#0069D4","secondary":"#333333","bodyText":"#1E1E1E","bodyBg":"#FFFFFF","info":"#409AE2","success":"#41C5AE","warning":"#FCC844","danger":"#BC341B","alertSystem":"#FF6600","textMuted":"#707070","highlight":"#FFFCAD","outline":"var(--lia-bs-primary)","custom":["#D3F5A4","#243A5E"],"__typename":"ColorsThemeSettings"},"divider":{"size":"3px","marginLeft":"4px","marginRight":"4px","borderRadius":"50%","bgColor":"var(--lia-bs-gray-600)","bgColorActive":"var(--lia-bs-gray-600)","__typename":"DividerThemeSettings"},"dropdown":{"fontSize":"var(--lia-bs-font-size-sm)","borderColor":"var(--lia-bs-border-color)","borderRadius":"var(--lia-bs-border-radius-sm)","dividerBg":"var(--lia-bs-gray-300)","itemPaddingY":"5px","itemPaddingX":"20px","headerColor":"var(--lia-bs-gray-700)","__typename":"DropdownThemeSettings"},"email":{"link":{"color":"#0069D4","hoverColor":"#0061c2","decoration":"none","hoverDecoration":"underline","__typename":"EmailLinkSettings"},"border":{"color":"#e4e4e4","__typename":"EmailBorderSettings"},"buttons":{"borderRadiusLg":"5px","paddingXLg":"16px","paddingYLg":"7px","fontWeight":"700","primaryTextColor":"#ffffff","primaryTextHoverColor":"#ffffff","primaryBgColor":"#0069D4","primaryBgHoverColor":"#005cb8","primaryBorder":"1px solid transparent","primaryBorderHover":"1px solid transparent","__typename":"EmailButtonsSettings"},"panel":{"borderRadius":"5px","borderColor":"#e4e4e4","__typename":"EmailPanelSettings"},"__typename":"EmailThemeSettings"},"emoji":{"skinToneDefault":"#ffcd43","skinToneLight":"#fae3c5","skinToneMediumLight":"#e2cfa5","skinToneMedium":"#daa478","skinToneMediumDark":"#a78058","skinToneDark":"#5e4d43","__typename":"EmojiThemeSettings"},"heading":{"color":"var(--lia-bs-body-color)","fontFamily":"Segoe UI","fontStyle":"NORMAL","fontWeight":"400","h1FontSize":"34px","h2FontSize":"32px","h3FontSize":"28px","h4FontSize":"24px","h5FontSize":"20px","h6FontSize":"16px","lineHeight":"1.3","subHeaderFontSize":"11px","subHeaderFontWeight":"500","h1LetterSpacing":"normal","h2LetterSpacing":"normal","h3LetterSpacing":"normal","h4LetterSpacing":"normal","h5LetterSpacing":"normal","h6LetterSpacing":"normal","subHeaderLetterSpacing":"2px","h1FontWeight":"var(--lia-bs-headings-font-weight)","h2FontWeight":"var(--lia-bs-headings-font-weight)","h3FontWeight":"var(--lia-bs-headings-font-weight)","h4FontWeight":"var(--lia-bs-headings-font-weight)","h5FontWeight":"var(--lia-bs-headings-font-weight)","h6FontWeight":"var(--lia-bs-headings-font-weight)","__typename":"HeadingThemeSettings"},"icons":{"size10":"10px","size12":"12px","size14":"14px","size16":"16px","size20":"20px","size24":"24px","size30":"30px","size40":"40px","size50":"50px","size60":"60px","size80":"80px","size120":"120px","size160":"160px","__typename":"IconsThemeSettings"},"imagePreview":{"bgColor":"var(--lia-bs-gray-900)","titleColor":"var(--lia-bs-white)","controlColor":"var(--lia-bs-white)","controlBgColor":"var(--lia-bs-gray-800)","__typename":"ImagePreviewThemeSettings"},"input":{"borderColor":"var(--lia-bs-gray-600)","disabledColor":"var(--lia-bs-gray-600)","focusBorderColor":"var(--lia-bs-primary)","labelMarginBottom":"10px","btnFontSize":"var(--lia-bs-font-size-sm)","focusBoxShadow":"0 0 0 3px hsla(var(--lia-bs-primary-h), var(--lia-bs-primary-s), var(--lia-bs-primary-l), 0.2)","checkLabelMarginBottom":"2px","checkboxBorderRadius":"3px","borderRadiusSm":"var(--lia-bs-border-radius-sm)","borderRadius":"var(--lia-bs-border-radius)","borderRadiusLg":"var(--lia-bs-border-radius-lg)","formTextMarginTop":"4px","textAreaBorderRadius":"var(--lia-bs-border-radius)","activeFillColor":"var(--lia-bs-primary)","__typename":"InputThemeSettings"},"loading":{"dotDarkColor":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.2)","dotLightColor":"hsla(var(--lia-bs-white-h), var(--lia-bs-white-s), var(--lia-bs-white-l), 0.5)","barDarkColor":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.06)","barLightColor":"hsla(var(--lia-bs-white-h), var(--lia-bs-white-s), var(--lia-bs-white-l), 0.4)","__typename":"LoadingThemeSettings"},"link":{"color":"var(--lia-bs-primary)","hoverColor":"hsl(var(--lia-bs-primary-h), var(--lia-bs-primary-s), calc(var(--lia-bs-primary-l) - 10%))","decoration":"none","hoverDecoration":"underline","__typename":"LinkThemeSettings"},"listGroup":{"itemPaddingY":"15px","itemPaddingX":"15px","borderColor":"var(--lia-bs-gray-300)","__typename":"ListGroupThemeSettings"},"modal":{"contentTextColor":"var(--lia-bs-body-color)","contentBg":"var(--lia-bs-white)","backgroundBg":"var(--lia-bs-black)","smSize":"440px","mdSize":"760px","lgSize":"1080px","backdropOpacity":0.3,"contentBoxShadowXs":"var(--lia-bs-box-shadow-sm)","contentBoxShadow":"var(--lia-bs-box-shadow)","headerFontWeight":"700","__typename":"ModalThemeSettings"},"navbar":{"position":"FIXED","background":{"attachment":null,"clip":null,"color":"var(--lia-bs-white)","imageAssetName":"","imageLastModified":"0","origin":null,"position":"CENTER_CENTER","repeat":"NO_REPEAT","size":"COVER","__typename":"BackgroundProps"},"backgroundOpacity":0.8,"paddingTop":"15px","paddingBottom":"15px","borderBottom":"1px solid var(--lia-bs-border-color)","boxShadow":"var(--lia-bs-box-shadow-sm)","brandMarginRight":"30px","brandMarginRightSm":"10px","brandLogoHeight":"30px","linkGap":"10px","linkJustifyContent":"flex-start","linkPaddingY":"5px","linkPaddingX":"10px","linkDropdownPaddingY":"9px","linkDropdownPaddingX":"var(--lia-nav-link-px)","linkColor":"var(--lia-bs-body-color)","linkHoverColor":"var(--lia-bs-primary)","linkFontSize":"var(--lia-bs-font-size-sm)","linkFontStyle":"NORMAL","linkFontWeight":"400","linkTextTransform":"NONE","linkLetterSpacing":"normal","linkBorderRadius":"var(--lia-bs-border-radius-sm)","linkBgColor":"transparent","linkBgHoverColor":"transparent","linkBorder":"none","linkBorderHover":"none","linkBoxShadow":"none","linkBoxShadowHover":"none","linkTextBorderBottom":"none","linkTextBorderBottomHover":"none","dropdownPaddingTop":"10px","dropdownPaddingBottom":"15px","dropdownPaddingX":"10px","dropdownMenuOffset":"2px","dropdownDividerMarginTop":"10px","dropdownDividerMarginBottom":"10px","dropdownBorderColor":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.08)","controllerBgHoverColor":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.1)","controllerIconColor":"var(--lia-bs-body-color)","controllerIconHoverColor":"var(--lia-bs-body-color)","controllerTextColor":"var(--lia-nav-controller-icon-color)","controllerTextHoverColor":"var(--lia-nav-controller-icon-hover-color)","controllerHighlightColor":"hsla(30, 100%, 50%)","controllerHighlightTextColor":"var(--lia-yiq-light)","controllerBorderRadius":"var(--lia-border-radius-50)","hamburgerColor":"var(--lia-nav-controller-icon-color)","hamburgerHoverColor":"var(--lia-nav-controller-icon-color)","hamburgerBgColor":"transparent","hamburgerBgHoverColor":"transparent","hamburgerBorder":"none","hamburgerBorderHover":"none","collapseMenuMarginLeft":"20px","collapseMenuDividerBg":"var(--lia-nav-link-color)","collapseMenuDividerOpacity":0.16,"__typename":"NavbarThemeSettings"},"pager":{"textColor":"var(--lia-bs-link-color)","textFontWeight":"var(--lia-font-weight-md)","textFontSize":"var(--lia-bs-font-size-sm)","__typename":"PagerThemeSettings"},"panel":{"bgColor":"var(--lia-bs-white)","borderRadius":"var(--lia-bs-border-radius)","borderColor":"var(--lia-bs-border-color)","boxShadow":"none","__typename":"PanelThemeSettings"},"popover":{"arrowHeight":"8px","arrowWidth":"16px","maxWidth":"300px","minWidth":"100px","headerBg":"var(--lia-bs-white)","borderColor":"var(--lia-bs-border-color)","borderRadius":"var(--lia-bs-border-radius)","boxShadow":"0 0.5rem 1rem hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.15)","__typename":"PopoverThemeSettings"},"prism":{"color":"#000000","bgColor":"#f5f2f0","fontFamily":"var(--font-family-monospace)","fontSize":"var(--lia-bs-font-size-base)","fontWeightBold":"var(--lia-bs-font-weight-bold)","fontStyleItalic":"italic","tabSize":2,"highlightColor":"#b3d4fc","commentColor":"#62707e","punctuationColor":"#6f6f6f","namespaceOpacity":"0.7","propColor":"#990055","selectorColor":"#517a00","operatorColor":"#906736","operatorBgColor":"hsla(0, 0%, 100%, 0.5)","keywordColor":"#0076a9","functionColor":"#d3284b","variableColor":"#c14700","__typename":"PrismThemeSettings"},"rte":{"bgColor":"var(--lia-bs-white)","borderRadius":"var(--lia-panel-border-radius)","boxShadow":" var(--lia-panel-box-shadow)","customColor1":"#bfedd2","customColor2":"#fbeeb8","customColor3":"#f8cac6","customColor4":"#eccafa","customColor5":"#c2e0f4","customColor6":"#2dc26b","customColor7":"#f1c40f","customColor8":"#e03e2d","customColor9":"#b96ad9","customColor10":"#3598db","customColor11":"#169179","customColor12":"#e67e23","customColor13":"#ba372a","customColor14":"#843fa1","customColor15":"#236fa1","customColor16":"#ecf0f1","customColor17":"#ced4d9","customColor18":"#95a5a6","customColor19":"#7e8c8d","customColor20":"#34495e","customColor21":"#000000","customColor22":"#ffffff","defaultMessageHeaderMarginTop":"40px","defaultMessageHeaderMarginBottom":"20px","defaultMessageItemMarginTop":"0","defaultMessageItemMarginBottom":"10px","diffAddedColor":"hsla(170, 53%, 51%, 0.4)","diffChangedColor":"hsla(43, 97%, 63%, 0.4)","diffNoneColor":"hsla(0, 0%, 80%, 0.4)","diffRemovedColor":"hsla(9, 74%, 47%, 0.4)","specialMessageHeaderMarginTop":"40px","specialMessageHeaderMarginBottom":"20px","specialMessageItemMarginTop":"0","specialMessageItemMarginBottom":"10px","__typename":"RteThemeSettings"},"tags":{"bgColor":"var(--lia-bs-gray-200)","bgHoverColor":"var(--lia-bs-gray-400)","borderRadius":"var(--lia-bs-border-radius-sm)","color":"var(--lia-bs-body-color)","hoverColor":"var(--lia-bs-body-color)","fontWeight":"var(--lia-font-weight-md)","fontSize":"var(--lia-font-size-xxs)","textTransform":"UPPERCASE","letterSpacing":"0.5px","__typename":"TagsThemeSettings"},"toasts":{"borderRadius":"var(--lia-bs-border-radius)","paddingX":"12px","__typename":"ToastsThemeSettings"},"typography":{"fontFamilyBase":"Segoe UI","fontStyleBase":"NORMAL","fontWeightBase":"400","fontWeightLight":"300","fontWeightNormal":"400","fontWeightMd":"500","fontWeightBold":"700","letterSpacingSm":"normal","letterSpacingXs":"normal","lineHeightBase":"1.5","fontSizeBase":"16px","fontSizeXxs":"11px","fontSizeXs":"12px","fontSizeSm":"14px","fontSizeLg":"20px","fontSizeXl":"24px","smallFontSize":"14px","customFonts":[{"source":"SERVER","name":"Segoe UI","styles":[{"style":"NORMAL","weight":"400","__typename":"FontStyleData"},{"style":"NORMAL","weight":"300","__typename":"FontStyleData"},{"style":"NORMAL","weight":"600","__typename":"FontStyleData"},{"style":"NORMAL","weight":"700","__typename":"FontStyleData"},{"style":"ITALIC","weight":"400","__typename":"FontStyleData"}],"assetNames":["SegoeUI-normal-400.woff2","SegoeUI-normal-300.woff2","SegoeUI-normal-600.woff2","SegoeUI-normal-700.woff2","SegoeUI-italic-400.woff2"],"__typename":"CustomFont"},{"source":"SERVER","name":"MWF Fluent Icons","styles":[{"style":"NORMAL","weight":"400","__typename":"FontStyleData"}],"assetNames":["MWFFluentIcons-normal-400.woff2"],"__typename":"CustomFont"}],"__typename":"TypographyThemeSettings"},"unstyledListItem":{"marginBottomSm":"5px","marginBottomMd":"10px","marginBottomLg":"15px","marginBottomXl":"20px","marginBottomXxl":"25px","__typename":"UnstyledListItemThemeSettings"},"yiq":{"light":"#ffffff","dark":"#000000","__typename":"YiqThemeSettings"},"colorLightness":{"primaryDark":0.36,"primaryLight":0.74,"primaryLighter":0.89,"primaryLightest":0.95,"infoDark":0.39,"infoLight":0.72,"infoLighter":0.85,"infoLightest":0.93,"successDark":0.24,"successLight":0.62,"successLighter":0.8,"successLightest":0.91,"warningDark":0.39,"warningLight":0.68,"warningLighter":0.84,"warningLightest":0.93,"dangerDark":0.41,"dangerLight":0.72,"dangerLighter":0.89,"dangerLightest":0.95,"__typename":"ColorLightnessThemeSettings"},"localOverride":false,"__typename":"Theme"},"localOverride":false},"CachedAsset:text:en_US-components/common/EmailVerification-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/common/EmailVerification-1745505310103","value":{"email.verification.title":"Email Verification Required","email.verification.message.update.email":"To participate in the community, you must first verify your email address. The verification email was sent to {email}. To change your email, visit My Settings.","email.verification.message.resend.email":"To participate in the community, you must first verify your email address. The verification email was sent to {email}. Resend email."},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/common/Loading/LoadingDot-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/common/Loading/LoadingDot-1745505310103","value":{"title":"Loading..."},"localOverride":false},"CachedAsset:quilt:o365.prod:pages/blogs/BlogMessagePage:board:AzureHighPerformanceComputingBlog-1745502712902":{"__typename":"CachedAsset","id":"quilt:o365.prod:pages/blogs/BlogMessagePage:board:AzureHighPerformanceComputingBlog-1745502712902","value":{"id":"BlogMessagePage","container":{"id":"Common","headerProps":{"backgroundImageProps":null,"backgroundColor":null,"addComponents":null,"removeComponents":["community.widget.bannerWidget"],"componentOrder":null,"__typename":"QuiltContainerSectionProps"},"headerComponentProps":{"community.widget.breadcrumbWidget":{"disableLastCrumbForDesktop":false}},"footerProps":null,"footerComponentProps":null,"items":[{"id":"blog-article","layout":"ONE_COLUMN","bgColor":null,"showTitle":null,"showDescription":null,"textPosition":null,"textColor":null,"sectionEditLevel":"LOCKED","bgImage":null,"disableSpacing":null,"edgeToEdgeDisplay":null,"fullHeight":null,"showBorder":null,"__typename":"OneColumnQuiltSection","columnMap":{"main":[{"id":"blogs.widget.blogArticleWidget","className":"lia-blog-container","props":null,"__typename":"QuiltComponent"}],"__typename":"OneSectionColumns"}},{"id":"section-1729184836777","layout":"MAIN_SIDE","bgColor":"transparent","showTitle":false,"showDescription":false,"textPosition":"CENTER","textColor":"var(--lia-bs-body-color)","sectionEditLevel":null,"bgImage":null,"disableSpacing":null,"edgeToEdgeDisplay":null,"fullHeight":null,"showBorder":null,"__typename":"MainSideQuiltSection","columnMap":{"main":[],"side":[],"__typename":"MainSideSectionColumns"}}],"__typename":"QuiltContainer"},"__typename":"Quilt","localOverride":false},"localOverride":false},"CachedAsset:text:en_US-pages/blogs/BlogMessagePage-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-pages/blogs/BlogMessagePage-1745505310103","value":{"title":"{contextMessageSubject} | {communityTitle}","errorMissing":"This blog post cannot be found","name":"Blog Message Page","section.blog-article.title":"Blog Post","archivedMessageTitle":"This Content Has Been Archived","section.section-1729184836777.title":"","section.section-1729184836777.description":"","section.CncIde.title":"Blog Post","section.tifEmD.description":"","section.tifEmD.title":""},"localOverride":false},"CachedAsset:quiltWrapper:o365.prod:Common:1745505310304":{"__typename":"CachedAsset","id":"quiltWrapper:o365.prod:Common:1745505310304","value":{"id":"Common","header":{"backgroundImageProps":{"assetName":null,"backgroundSize":"COVER","backgroundRepeat":"NO_REPEAT","backgroundPosition":"CENTER_CENTER","lastModified":null,"__typename":"BackgroundImageProps"},"backgroundColor":"transparent","items":[{"id":"community.widget.navbarWidget","props":{"showUserName":true,"showRegisterLink":true,"useIconLanguagePicker":true,"useLabelLanguagePicker":true,"className":"QuiltComponent_lia-component-edit-mode__0nCcm","links":{"sideLinks":[],"mainLinks":[{"children":[],"linkType":"INTERNAL","id":"gxcuf89792","params":{},"routeName":"CommunityPage"},{"children":[],"linkType":"EXTERNAL","id":"external-link","url":"/Directory","target":"SELF"},{"children":[{"linkType":"INTERNAL","id":"microsoft365","params":{"categoryId":"microsoft365"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"windows","params":{"categoryId":"Windows"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"Common-microsoft365-copilot-link","params":{"categoryId":"Microsoft365Copilot"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"microsoft-teams","params":{"categoryId":"MicrosoftTeams"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"microsoft-securityand-compliance","params":{"categoryId":"microsoft-security"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"azure","params":{"categoryId":"Azure"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"Common-content_management-link","params":{"categoryId":"Content_Management"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"exchange","params":{"categoryId":"Exchange"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"windows-server","params":{"categoryId":"Windows-Server"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"outlook","params":{"categoryId":"Outlook"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"microsoft-endpoint-manager","params":{"categoryId":"microsoftintune"},"routeName":"CategoryPage"},{"linkType":"EXTERNAL","id":"external-link-2","url":"/Directory","target":"SELF"}],"linkType":"EXTERNAL","id":"communities","url":"/","target":"BLANK"},{"children":[{"linkType":"INTERNAL","id":"a-i","params":{"categoryId":"AI"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"education-sector","params":{"categoryId":"EducationSector"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"partner-community","params":{"categoryId":"PartnerCommunity"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"i-t-ops-talk","params":{"categoryId":"ITOpsTalk"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"healthcare-and-life-sciences","params":{"categoryId":"HealthcareAndLifeSciences"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"microsoft-mechanics","params":{"categoryId":"MicrosoftMechanics"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"public-sector","params":{"categoryId":"PublicSector"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"s-m-b","params":{"categoryId":"MicrosoftforNonprofits"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"io-t","params":{"categoryId":"IoT"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"startupsat-microsoft","params":{"categoryId":"StartupsatMicrosoft"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"driving-adoption","params":{"categoryId":"DrivingAdoption"},"routeName":"CategoryPage"},{"linkType":"EXTERNAL","id":"external-link-1","url":"/Directory","target":"SELF"}],"linkType":"EXTERNAL","id":"communities-1","url":"/","target":"SELF"},{"children":[],"linkType":"EXTERNAL","id":"external","url":"/Blogs","target":"SELF"},{"children":[],"linkType":"EXTERNAL","id":"external-1","url":"/Events","target":"SELF"},{"children":[{"linkType":"INTERNAL","id":"microsoft-learn-1","params":{"categoryId":"MicrosoftLearn"},"routeName":"CategoryPage"},{"linkType":"INTERNAL","id":"microsoft-learn-blog","params":{"boardId":"MicrosoftLearnBlog","categoryId":"MicrosoftLearn"},"routeName":"BlogBoardPage"},{"linkType":"EXTERNAL","id":"external-10","url":"https://learningroomdirectory.microsoft.com/","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-3","url":"https://docs.microsoft.com/learn/dynamics365/?WT.mc_id=techcom_header-webpage-m365","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-4","url":"https://docs.microsoft.com/learn/m365/?wt.mc_id=techcom_header-webpage-m365","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-5","url":"https://docs.microsoft.com/learn/topics/sci/?wt.mc_id=techcom_header-webpage-m365","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-6","url":"https://docs.microsoft.com/learn/powerplatform/?wt.mc_id=techcom_header-webpage-powerplatform","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-7","url":"https://docs.microsoft.com/learn/github/?wt.mc_id=techcom_header-webpage-github","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-8","url":"https://docs.microsoft.com/learn/teams/?wt.mc_id=techcom_header-webpage-teams","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-9","url":"https://docs.microsoft.com/learn/dotnet/?wt.mc_id=techcom_header-webpage-dotnet","target":"BLANK"},{"linkType":"EXTERNAL","id":"external-2","url":"https://docs.microsoft.com/learn/azure/?WT.mc_id=techcom_header-webpage-m365","target":"BLANK"}],"linkType":"INTERNAL","id":"microsoft-learn","params":{"categoryId":"MicrosoftLearn"},"routeName":"CategoryPage"},{"children":[],"linkType":"INTERNAL","id":"community-info-center","params":{"categoryId":"Community-Info-Center"},"routeName":"CategoryPage"}]},"style":{"boxShadow":"var(--lia-bs-box-shadow-sm)","controllerHighlightColor":"hsla(30, 100%, 50%)","linkFontWeight":"400","dropdownDividerMarginBottom":"10px","hamburgerBorderHover":"none","linkBoxShadowHover":"none","linkFontSize":"14px","backgroundOpacity":0.8,"controllerBorderRadius":"var(--lia-border-radius-50)","hamburgerBgColor":"transparent","hamburgerColor":"var(--lia-nav-controller-icon-color)","linkTextBorderBottom":"none","brandLogoHeight":"30px","linkBgHoverColor":"transparent","linkLetterSpacing":"normal","collapseMenuDividerOpacity":0.16,"dropdownPaddingBottom":"15px","paddingBottom":"15px","dropdownMenuOffset":"2px","hamburgerBgHoverColor":"transparent","borderBottom":"1px solid var(--lia-bs-border-color)","hamburgerBorder":"none","dropdownPaddingX":"10px","brandMarginRightSm":"10px","linkBoxShadow":"none","collapseMenuDividerBg":"var(--lia-nav-link-color)","linkColor":"var(--lia-bs-body-color)","linkJustifyContent":"flex-start","dropdownPaddingTop":"10px","controllerHighlightTextColor":"var(--lia-yiq-dark)","controllerTextColor":"var(--lia-nav-controller-icon-color)","background":{"imageAssetName":"","color":"var(--lia-bs-white)","size":"COVER","repeat":"NO_REPEAT","position":"CENTER_CENTER","imageLastModified":""},"linkBorderRadius":"var(--lia-bs-border-radius-sm)","linkHoverColor":"var(--lia-bs-body-color)","position":"FIXED","linkBorder":"none","linkTextBorderBottomHover":"2px solid var(--lia-bs-body-color)","brandMarginRight":"30px","hamburgerHoverColor":"var(--lia-nav-controller-icon-color)","linkBorderHover":"none","collapseMenuMarginLeft":"20px","linkFontStyle":"NORMAL","controllerTextHoverColor":"var(--lia-nav-controller-icon-hover-color)","linkPaddingX":"10px","linkPaddingY":"5px","paddingTop":"15px","linkTextTransform":"NONE","dropdownBorderColor":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.08)","controllerBgHoverColor":"hsla(var(--lia-bs-black-h), var(--lia-bs-black-s), var(--lia-bs-black-l), 0.1)","linkBgColor":"transparent","linkDropdownPaddingX":"var(--lia-nav-link-px)","linkDropdownPaddingY":"9px","controllerIconColor":"var(--lia-bs-body-color)","dropdownDividerMarginTop":"10px","linkGap":"10px","controllerIconHoverColor":"var(--lia-bs-body-color)"},"showSearchIcon":false,"languagePickerStyle":"iconAndLabel"},"__typename":"QuiltComponent"},{"id":"community.widget.breadcrumbWidget","props":{"backgroundColor":"transparent","linkHighlightColor":"var(--lia-bs-primary)","visualEffects":{"showBottomBorder":true},"linkTextColor":"var(--lia-bs-gray-700)"},"__typename":"QuiltComponent"},{"id":"custom.widget.community_banner","props":{"widgetVisibility":"signedInOrAnonymous","useTitle":true,"usePageWidth":false,"useBackground":false,"title":"","lazyLoad":false},"__typename":"QuiltComponent"},{"id":"custom.widget.HeroBanner","props":{"widgetVisibility":"signedInOrAnonymous","usePageWidth":false,"useTitle":true,"cMax_items":3,"useBackground":false,"title":"","lazyLoad":false,"widgetChooser":"custom.widget.HeroBanner"},"__typename":"QuiltComponent"}],"__typename":"QuiltWrapperSection"},"footer":{"backgroundImageProps":{"assetName":null,"backgroundSize":"COVER","backgroundRepeat":"NO_REPEAT","backgroundPosition":"CENTER_CENTER","lastModified":null,"__typename":"BackgroundImageProps"},"backgroundColor":"transparent","items":[{"id":"custom.widget.MicrosoftFooter","props":{"widgetVisibility":"signedInOrAnonymous","useTitle":true,"useBackground":false,"title":"","lazyLoad":false},"__typename":"QuiltComponent"}],"__typename":"QuiltWrapperSection"},"__typename":"QuiltWrapper","localOverride":false},"localOverride":false},"CachedAsset:text:en_US-components/common/ActionFeedback-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/common/ActionFeedback-1745505310103","value":{"joinedGroupHub.title":"Welcome","joinedGroupHub.message":"You are now a member of this group and are subscribed to updates.","groupHubInviteNotFound.title":"Invitation Not Found","groupHubInviteNotFound.message":"Sorry, we could not find your invitation to the group. The owner may have canceled the invite.","groupHubNotFound.title":"Group Not Found","groupHubNotFound.message":"The grouphub you tried to join does not exist. It may have been deleted.","existingGroupHubMember.title":"Already Joined","existingGroupHubMember.message":"You are already a member of this group.","accountLocked.title":"Account Locked","accountLocked.message":"Your account has been locked due to multiple failed attempts. Try again in {lockoutTime} minutes.","editedGroupHub.title":"Changes Saved","editedGroupHub.message":"Your group has been updated.","leftGroupHub.title":"Goodbye","leftGroupHub.message":"You are no longer a member of this group and will not receive future updates.","deletedGroupHub.title":"Deleted","deletedGroupHub.message":"The group has been deleted.","groupHubCreated.title":"Group Created","groupHubCreated.message":"{groupHubName} is ready to use","accountClosed.title":"Account Closed","accountClosed.message":"The account has been closed and you will now be redirected to the homepage","resetTokenExpired.title":"Reset Password Link has Expired","resetTokenExpired.message":"Try resetting your password again","invalidUrl.title":"Invalid URL","invalidUrl.message":"The URL you're using is not recognized. Verify your URL and try again.","accountClosedForUser.title":"Account Closed","accountClosedForUser.message":"{userName}'s account is closed","inviteTokenInvalid.title":"Invitation Invalid","inviteTokenInvalid.message":"Your invitation to the community has been canceled or expired.","inviteTokenError.title":"Invitation Verification Failed","inviteTokenError.message":"The url you are utilizing is not recognized. Verify your URL and try again","pageNotFound.title":"Access Denied","pageNotFound.message":"You do not have access to this area of the community or it doesn't exist","eventAttending.title":"Responded as Attending","eventAttending.message":"You'll be notified when there's new activity and reminded as the event approaches","eventInterested.title":"Responded as Interested","eventInterested.message":"You'll be notified when there's new activity and reminded as the event approaches","eventNotFound.title":"Event Not Found","eventNotFound.message":"The event you tried to respond to does not exist.","redirectToRelatedPage.title":"Showing Related Content","redirectToRelatedPageForBaseUsers.title":"Showing Related Content","redirectToRelatedPageForBaseUsers.message":"The content you are trying to access is archived","redirectToRelatedPage.message":"The content you are trying to access is archived","relatedUrl.archivalLink.flyoutMessage":"The content you are trying to access is archived View Archived Content"},"localOverride":false},"CachedAsset:component:custom.widget.community_banner-en-1744400827912":{"__typename":"CachedAsset","id":"component:custom.widget.community_banner-en-1744400827912","value":{"component":{"id":"custom.widget.community_banner","template":{"id":"community_banner","markupLanguage":"HANDLEBARS","style":".community-banner {\n a.top-bar.btn {\n top: 0px;\n width: 100%;\n z-index: 999;\n text-align: center;\n left: 0px;\n background: #0068b8;\n color: white;\n padding: 10px 0px;\n display: block;\n box-shadow: none !important;\n border: none !important;\n border-radius: none !important;\n margin: 0px !important;\n font-size: 14px;\n }\n}\n","texts":null,"defaults":{"config":{"applicablePages":[],"description":"community announcement text","fetchedContent":null,"__typename":"ComponentConfiguration"},"props":[],"__typename":"ComponentProperties"},"components":[{"id":"custom.widget.community_banner","form":null,"config":null,"props":[],"__typename":"Component"}],"grouping":"CUSTOM","__typename":"ComponentTemplate"},"properties":{"config":{"applicablePages":[],"description":"community announcement text","fetchedContent":null,"__typename":"ComponentConfiguration"},"props":[],"__typename":"ComponentProperties"},"form":null,"__typename":"Component","localOverride":false},"globalCss":{"css":".custom_widget_community_banner_community-banner_1x9u2_1 {\n a.custom_widget_community_banner_top-bar_1x9u2_2.custom_widget_community_banner_btn_1x9u2_2 {\n top: 0;\n width: 100%;\n z-index: 999;\n text-align: center;\n left: 0;\n background: #0068b8;\n color: white;\n padding: 0.625rem 0;\n display: block;\n box-shadow: none !important;\n border: none !important;\n border-radius: none !important;\n margin: 0 !important;\n font-size: 0.875rem;\n }\n}\n","tokens":{"community-banner":"custom_widget_community_banner_community-banner_1x9u2_1","top-bar":"custom_widget_community_banner_top-bar_1x9u2_2","btn":"custom_widget_community_banner_btn_1x9u2_2"}},"form":null},"localOverride":false},"CachedAsset:component:custom.widget.HeroBanner-en-1744400827912":{"__typename":"CachedAsset","id":"component:custom.widget.HeroBanner-en-1744400827912","value":{"component":{"id":"custom.widget.HeroBanner","template":{"id":"HeroBanner","markupLanguage":"REACT","style":null,"texts":{"searchPlaceholderText":"Search this community","followActionText":"Follow","unfollowActionText":"Following","searchOnHoverText":"Please enter your search term(s) and then press return key to complete a search.","blogs.sidebar.pagetitle":"Latest Blogs | Microsoft Tech Community","followThisNode":"Follow this node","unfollowThisNode":"Unfollow this node"},"defaults":{"config":{"applicablePages":[],"description":null,"fetchedContent":null,"__typename":"ComponentConfiguration"},"props":[{"id":"max_items","dataType":"NUMBER","list":false,"defaultValue":"3","label":"Max Items","description":"The maximum number of items to display in the carousel","possibleValues":null,"control":"INPUT","__typename":"PropDefinition"}],"__typename":"ComponentProperties"},"components":[{"id":"custom.widget.HeroBanner","form":{"fields":[{"id":"widgetChooser","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"title","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"useTitle","validation":null,"noValidation":null,"dataType":"BOOLEAN","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"useBackground","validation":null,"noValidation":null,"dataType":"BOOLEAN","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"widgetVisibility","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"moreOptions","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"cMax_items","validation":null,"noValidation":null,"dataType":"NUMBER","list":false,"control":"INPUT","defaultValue":"3","label":"Max Items","description":"The maximum number of items to display in the carousel","possibleValues":null,"__typename":"FormField"}],"layout":{"rows":[{"id":"widgetChooserGroup","type":"fieldset","as":null,"items":[{"id":"widgetChooser","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"titleGroup","type":"fieldset","as":null,"items":[{"id":"title","className":null,"__typename":"FormFieldRef"},{"id":"useTitle","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"useBackground","type":"fieldset","as":null,"items":[{"id":"useBackground","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"widgetVisibility","type":"fieldset","as":null,"items":[{"id":"widgetVisibility","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"moreOptionsGroup","type":"fieldset","as":null,"items":[{"id":"moreOptions","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"componentPropsGroup","type":"fieldset","as":null,"items":[{"id":"cMax_items","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"}],"actionButtons":null,"className":"custom_widget_HeroBanner_form","formGroupFieldSeparator":"divider","__typename":"FormLayout"},"__typename":"Form"},"config":null,"props":[],"__typename":"Component"}],"grouping":"CUSTOM","__typename":"ComponentTemplate"},"properties":{"config":{"applicablePages":[],"description":null,"fetchedContent":null,"__typename":"ComponentConfiguration"},"props":[{"id":"max_items","dataType":"NUMBER","list":false,"defaultValue":"3","label":"Max Items","description":"The maximum number of items to display in the carousel","possibleValues":null,"control":"INPUT","__typename":"PropDefinition"}],"__typename":"ComponentProperties"},"form":{"fields":[{"id":"widgetChooser","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"title","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"useTitle","validation":null,"noValidation":null,"dataType":"BOOLEAN","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"useBackground","validation":null,"noValidation":null,"dataType":"BOOLEAN","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"widgetVisibility","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"moreOptions","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"cMax_items","validation":null,"noValidation":null,"dataType":"NUMBER","list":false,"control":"INPUT","defaultValue":"3","label":"Max Items","description":"The maximum number of items to display in the carousel","possibleValues":null,"__typename":"FormField"}],"layout":{"rows":[{"id":"widgetChooserGroup","type":"fieldset","as":null,"items":[{"id":"widgetChooser","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"titleGroup","type":"fieldset","as":null,"items":[{"id":"title","className":null,"__typename":"FormFieldRef"},{"id":"useTitle","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"useBackground","type":"fieldset","as":null,"items":[{"id":"useBackground","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"widgetVisibility","type":"fieldset","as":null,"items":[{"id":"widgetVisibility","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"moreOptionsGroup","type":"fieldset","as":null,"items":[{"id":"moreOptions","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"componentPropsGroup","type":"fieldset","as":null,"items":[{"id":"cMax_items","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"}],"actionButtons":null,"className":"custom_widget_HeroBanner_form","formGroupFieldSeparator":"divider","__typename":"FormLayout"},"__typename":"Form"},"__typename":"Component","localOverride":false},"globalCss":null,"form":{"fields":[{"id":"widgetChooser","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"title","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"useTitle","validation":null,"noValidation":null,"dataType":"BOOLEAN","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"useBackground","validation":null,"noValidation":null,"dataType":"BOOLEAN","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"widgetVisibility","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"moreOptions","validation":null,"noValidation":null,"dataType":"STRING","list":null,"control":null,"defaultValue":null,"label":null,"description":null,"possibleValues":null,"__typename":"FormField"},{"id":"cMax_items","validation":null,"noValidation":null,"dataType":"NUMBER","list":false,"control":"INPUT","defaultValue":"3","label":"Max Items","description":"The maximum number of items to display in the carousel","possibleValues":null,"__typename":"FormField"}],"layout":{"rows":[{"id":"widgetChooserGroup","type":"fieldset","as":null,"items":[{"id":"widgetChooser","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"titleGroup","type":"fieldset","as":null,"items":[{"id":"title","className":null,"__typename":"FormFieldRef"},{"id":"useTitle","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"useBackground","type":"fieldset","as":null,"items":[{"id":"useBackground","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"widgetVisibility","type":"fieldset","as":null,"items":[{"id":"widgetVisibility","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"moreOptionsGroup","type":"fieldset","as":null,"items":[{"id":"moreOptions","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"},{"id":"componentPropsGroup","type":"fieldset","as":null,"items":[{"id":"cMax_items","className":null,"__typename":"FormFieldRef"}],"props":null,"legend":null,"description":null,"className":null,"viewVariant":null,"toggleState":null,"__typename":"FormFieldset"}],"actionButtons":null,"className":"custom_widget_HeroBanner_form","formGroupFieldSeparator":"divider","__typename":"FormLayout"},"__typename":"Form"}},"localOverride":false},"CachedAsset:component:custom.widget.MicrosoftFooter-en-1744400827912":{"__typename":"CachedAsset","id":"component:custom.widget.MicrosoftFooter-en-1744400827912","value":{"component":{"id":"custom.widget.MicrosoftFooter","template":{"id":"MicrosoftFooter","markupLanguage":"HANDLEBARS","style":".context-uhf {\n min-width: 280px;\n font-size: 15px;\n box-sizing: border-box;\n -ms-text-size-adjust: 100%;\n -webkit-text-size-adjust: 100%;\n & *,\n & *:before,\n & *:after {\n box-sizing: inherit;\n }\n a.c-uhff-link {\n color: #616161;\n word-break: break-word;\n text-decoration: none;\n }\n &a:link,\n &a:focus,\n &a:hover,\n &a:active,\n &a:visited {\n text-decoration: none;\n color: inherit;\n }\n & div {\n font-family: 'Segoe UI', SegoeUI, 'Helvetica Neue', Helvetica, Arial, sans-serif;\n }\n}\n.c-uhff {\n background: #f2f2f2;\n margin: -1.5625;\n width: auto;\n height: auto;\n}\n.c-uhff-nav {\n margin: 0 auto;\n max-width: calc(1600px + 10%);\n padding: 0 5%;\n box-sizing: inherit;\n &:before,\n &:after {\n content: ' ';\n display: table;\n clear: left;\n }\n @media only screen and (max-width: 1083px) {\n padding-left: 12px;\n }\n .c-heading-4 {\n color: #616161;\n word-break: break-word;\n font-size: 15px;\n line-height: 20px;\n padding: 36px 0 4px;\n font-weight: 600;\n }\n .c-uhff-nav-row {\n .c-uhff-nav-group {\n display: block;\n float: left;\n min-height: 1px;\n vertical-align: text-top;\n padding: 0 12px;\n width: 100%;\n zoom: 1;\n &:first-child {\n padding-left: 0;\n @media only screen and (max-width: 1083px) {\n padding-left: 12px;\n }\n }\n @media only screen and (min-width: 540px) and (max-width: 1082px) {\n width: 33.33333%;\n }\n @media only screen and (min-width: 1083px) {\n width: 16.6666666667%;\n }\n ul.c-list.f-bare {\n font-size: 11px;\n line-height: 16px;\n margin-top: 0;\n margin-bottom: 0;\n padding-left: 0;\n list-style-type: none;\n li {\n word-break: break-word;\n padding: 8px 0;\n margin: 0;\n }\n }\n }\n }\n}\n.c-uhff-base {\n background: #f2f2f2;\n margin: 0 auto;\n max-width: calc(1600px + 10%);\n padding: 30px 5% 16px;\n &:before,\n &:after {\n content: ' ';\n display: table;\n }\n &:after {\n clear: both;\n }\n a.c-uhff-ccpa {\n font-size: 11px;\n line-height: 16px;\n float: left;\n margin: 3px 0;\n }\n a.c-uhff-ccpa:hover {\n text-decoration: underline;\n }\n ul.c-list {\n font-size: 11px;\n line-height: 16px;\n float: right;\n margin: 3px 0;\n color: #616161;\n li {\n padding: 0 24px 4px 0;\n display: inline-block;\n }\n }\n .c-list.f-bare {\n padding-left: 0;\n list-style-type: none;\n }\n @media only screen and (max-width: 1083px) {\n display: flex;\n flex-wrap: wrap;\n padding: 30px 24px 16px;\n }\n}\n\n.social-share {\n position: fixed;\n top: 60%;\n transform: translateY(-50%);\n left: 0;\n z-index: 1000;\n}\n\n.sharing-options {\n list-style: none;\n padding: 0;\n margin: 0;\n display: block;\n flex-direction: column;\n background-color: white;\n width: 43px;\n border-radius: 0px 7px 7px 0px;\n}\n.linkedin-icon {\n border-top-right-radius: 7px;\n}\n.linkedin-icon:hover {\n border-radius: 0;\n}\n.social-share-rss-image {\n border-bottom-right-radius: 7px;\n}\n.social-share-rss-image:hover {\n border-radius: 0;\n}\n\n.social-link-footer {\n position: relative;\n display: block;\n margin: -2px 0;\n transition: all 0.2s ease;\n}\n.social-link-footer:hover .linkedin-icon {\n border-radius: 0;\n}\n.social-link-footer:hover .social-share-rss-image {\n border-radius: 0;\n}\n\n.social-link-footer img {\n width: 40px;\n height: auto;\n transition: filter 0.3s ease;\n}\n\n.social-share-list {\n width: 40px;\n}\n.social-share-rss-image {\n width: 40px;\n}\n\n.share-icon {\n border: 2px solid transparent;\n display: inline-block;\n position: relative;\n}\n\n.share-icon:hover {\n opacity: 1;\n border: 2px solid white;\n box-sizing: border-box;\n}\n\n.share-icon:hover .label {\n opacity: 1;\n visibility: visible;\n border: 2px solid white;\n box-sizing: border-box;\n border-left: none;\n}\n\n.label {\n position: absolute;\n left: 100%;\n white-space: nowrap;\n opacity: 0;\n visibility: hidden;\n transition: all 0.2s ease;\n color: white;\n border-radius: 0 10 0 10px;\n top: 50%;\n transform: translateY(-50%);\n height: 40px;\n border-radius: 0 6px 6px 0;\n display: flex;\n align-items: center;\n justify-content: center;\n padding: 20px 5px 20px 8px;\n margin-left: -1px;\n}\n.linkedin {\n background-color: #0474b4;\n}\n.facebook {\n background-color: #3c5c9c;\n}\n.twitter {\n background-color: white;\n color: black;\n}\n.reddit {\n background-color: #fc4404;\n}\n.mail {\n background-color: #848484;\n}\n.bluesky {\n background-color: white;\n color: black;\n}\n.rss {\n background-color: #ec7b1c;\n}\n#RSS {\n width: 40px;\n height: 40px;\n}\n\n@media (max-width: 991px) {\n .social-share {\n display: none;\n }\n}\n","texts":{"New tab":"What's New","New 1":"Surface Laptop Studio 2","New 2":"Surface Laptop Go 3","New 3":"Surface Pro 9","New 4":"Surface Laptop 5","New 5":"Surface Studio 2+","New 6":"Copilot in Windows","New 7":"Microsoft 365","New 8":"Windows 11 apps","Store tab":"Microsoft Store","Store 1":"Account Profile","Store 2":"Download Center","Store 3":"Microsoft Store Support","Store 4":"Returns","Store 5":"Order tracking","Store 6":"Certified Refurbished","Store 7":"Microsoft Store Promise","Store 8":"Flexible Payments","Education tab":"Education","Edu 1":"Microsoft in education","Edu 2":"Devices for education","Edu 3":"Microsoft Teams for Education","Edu 4":"Microsoft 365 Education","Edu 5":"How to buy for your school","Edu 6":"Educator Training and development","Edu 7":"Deals for students and parents","Edu 8":"Azure for students","Business tab":"Business","Bus 1":"Microsoft Cloud","Bus 2":"Microsoft Security","Bus 3":"Dynamics 365","Bus 4":"Microsoft 365","Bus 5":"Microsoft Power Platform","Bus 6":"Microsoft Teams","Bus 7":"Microsoft Industry","Bus 8":"Small Business","Developer tab":"Developer & IT","Dev 1":"Azure","Dev 2":"Developer Center","Dev 3":"Documentation","Dev 4":"Microsoft Learn","Dev 5":"Microsoft Tech Community","Dev 6":"Azure Marketplace","Dev 7":"AppSource","Dev 8":"Visual Studio","Company tab":"Company","Com 1":"Careers","Com 2":"About Microsoft","Com 3":"Company News","Com 4":"Privacy at Microsoft","Com 5":"Investors","Com 6":"Diversity and inclusion","Com 7":"Accessiblity","Com 8":"Sustainibility"},"defaults":{"config":{"applicablePages":[],"description":"The Microsoft Footer","fetchedContent":null,"__typename":"ComponentConfiguration"},"props":[],"__typename":"ComponentProperties"},"components":[{"id":"custom.widget.MicrosoftFooter","form":null,"config":null,"props":[],"__typename":"Component"}],"grouping":"CUSTOM","__typename":"ComponentTemplate"},"properties":{"config":{"applicablePages":[],"description":"The Microsoft Footer","fetchedContent":null,"__typename":"ComponentConfiguration"},"props":[],"__typename":"ComponentProperties"},"form":null,"__typename":"Component","localOverride":false},"globalCss":{"css":".custom_widget_MicrosoftFooter_context-uhf_105bp_1 {\n min-width: 17.5rem;\n font-size: 0.9375rem;\n box-sizing: border-box;\n -ms-text-size-adjust: 100%;\n -webkit-text-size-adjust: 100%;\n & *,\n & *:before,\n & *:after {\n box-sizing: inherit;\n }\n a.custom_widget_MicrosoftFooter_c-uhff-link_105bp_12 {\n color: #616161;\n word-break: break-word;\n text-decoration: none;\n }\n &a:link,\n &a:focus,\n &a:hover,\n &a:active,\n &a:visited {\n text-decoration: none;\n color: inherit;\n }\n & div {\n font-family: 'Segoe UI', SegoeUI, 'Helvetica Neue', Helvetica, Arial, sans-serif;\n }\n}\n.custom_widget_MicrosoftFooter_c-uhff_105bp_12 {\n background: #f2f2f2;\n margin: -1.5625;\n width: auto;\n height: auto;\n}\n.custom_widget_MicrosoftFooter_c-uhff-nav_105bp_35 {\n margin: 0 auto;\n max-width: calc(100rem + 10%);\n padding: 0 5%;\n box-sizing: inherit;\n &:before,\n &:after {\n content: ' ';\n display: table;\n clear: left;\n }\n @media only screen and (max-width: 1083px) {\n padding-left: 0.75rem;\n }\n .custom_widget_MicrosoftFooter_c-heading-4_105bp_49 {\n color: #616161;\n word-break: break-word;\n font-size: 0.9375rem;\n line-height: 1.25rem;\n padding: 2.25rem 0 0.25rem;\n font-weight: 600;\n }\n .custom_widget_MicrosoftFooter_c-uhff-nav-row_105bp_57 {\n .custom_widget_MicrosoftFooter_c-uhff-nav-group_105bp_58 {\n display: block;\n float: left;\n min-height: 0.0625rem;\n vertical-align: text-top;\n padding: 0 0.75rem;\n width: 100%;\n zoom: 1;\n &:first-child {\n padding-left: 0;\n @media only screen and (max-width: 1083px) {\n padding-left: 0.75rem;\n }\n }\n @media only screen and (min-width: 540px) and (max-width: 1082px) {\n width: 33.33333%;\n }\n @media only screen and (min-width: 1083px) {\n width: 16.6666666667%;\n }\n ul.custom_widget_MicrosoftFooter_c-list_105bp_78.custom_widget_MicrosoftFooter_f-bare_105bp_78 {\n font-size: 0.6875rem;\n line-height: 1rem;\n margin-top: 0;\n margin-bottom: 0;\n padding-left: 0;\n list-style-type: none;\n li {\n word-break: break-word;\n padding: 0.5rem 0;\n margin: 0;\n }\n }\n }\n }\n}\n.custom_widget_MicrosoftFooter_c-uhff-base_105bp_94 {\n background: #f2f2f2;\n margin: 0 auto;\n max-width: calc(100rem + 10%);\n padding: 1.875rem 5% 1rem;\n &:before,\n &:after {\n content: ' ';\n display: table;\n }\n &:after {\n clear: both;\n }\n a.custom_widget_MicrosoftFooter_c-uhff-ccpa_105bp_107 {\n font-size: 0.6875rem;\n line-height: 1rem;\n float: left;\n margin: 0.1875rem 0;\n }\n a.custom_widget_MicrosoftFooter_c-uhff-ccpa_105bp_107:hover {\n text-decoration: underline;\n }\n ul.custom_widget_MicrosoftFooter_c-list_105bp_78 {\n font-size: 0.6875rem;\n line-height: 1rem;\n float: right;\n margin: 0.1875rem 0;\n color: #616161;\n li {\n padding: 0 1.5rem 0.25rem 0;\n display: inline-block;\n }\n }\n .custom_widget_MicrosoftFooter_c-list_105bp_78.custom_widget_MicrosoftFooter_f-bare_105bp_78 {\n padding-left: 0;\n list-style-type: none;\n }\n @media only screen and (max-width: 1083px) {\n display: flex;\n flex-wrap: wrap;\n padding: 1.875rem 1.5rem 1rem;\n }\n}\n.custom_widget_MicrosoftFooter_social-share_105bp_138 {\n position: fixed;\n top: 60%;\n transform: translateY(-50%);\n left: 0;\n z-index: 1000;\n}\n.custom_widget_MicrosoftFooter_sharing-options_105bp_146 {\n list-style: none;\n padding: 0;\n margin: 0;\n display: block;\n flex-direction: column;\n background-color: white;\n width: 2.6875rem;\n border-radius: 0 0.4375rem 0.4375rem 0;\n}\n.custom_widget_MicrosoftFooter_linkedin-icon_105bp_156 {\n border-top-right-radius: 7px;\n}\n.custom_widget_MicrosoftFooter_linkedin-icon_105bp_156:hover {\n border-radius: 0;\n}\n.custom_widget_MicrosoftFooter_social-share-rss-image_105bp_162 {\n border-bottom-right-radius: 7px;\n}\n.custom_widget_MicrosoftFooter_social-share-rss-image_105bp_162:hover {\n border-radius: 0;\n}\n.custom_widget_MicrosoftFooter_social-link-footer_105bp_169 {\n position: relative;\n display: block;\n margin: -0.125rem 0;\n transition: all 0.2s ease;\n}\n.custom_widget_MicrosoftFooter_social-link-footer_105bp_169:hover .custom_widget_MicrosoftFooter_linkedin-icon_105bp_156 {\n border-radius: 0;\n}\n.custom_widget_MicrosoftFooter_social-link-footer_105bp_169:hover .custom_widget_MicrosoftFooter_social-share-rss-image_105bp_162 {\n border-radius: 0;\n}\n.custom_widget_MicrosoftFooter_social-link-footer_105bp_169 img {\n width: 2.5rem;\n height: auto;\n transition: filter 0.3s ease;\n}\n.custom_widget_MicrosoftFooter_social-share-list_105bp_188 {\n width: 2.5rem;\n}\n.custom_widget_MicrosoftFooter_social-share-rss-image_105bp_162 {\n width: 2.5rem;\n}\n.custom_widget_MicrosoftFooter_share-icon_105bp_195 {\n border: 2px solid transparent;\n display: inline-block;\n position: relative;\n}\n.custom_widget_MicrosoftFooter_share-icon_105bp_195:hover {\n opacity: 1;\n border: 2px solid white;\n box-sizing: border-box;\n}\n.custom_widget_MicrosoftFooter_share-icon_105bp_195:hover .custom_widget_MicrosoftFooter_label_105bp_207 {\n opacity: 1;\n visibility: visible;\n border: 2px solid white;\n box-sizing: border-box;\n border-left: none;\n}\n.custom_widget_MicrosoftFooter_label_105bp_207 {\n position: absolute;\n left: 100%;\n white-space: nowrap;\n opacity: 0;\n visibility: hidden;\n transition: all 0.2s ease;\n color: white;\n border-radius: 0 10 0 0.625rem;\n top: 50%;\n transform: translateY(-50%);\n height: 2.5rem;\n border-radius: 0 0.375rem 0.375rem 0;\n display: flex;\n align-items: center;\n justify-content: center;\n padding: 1.25rem 0.3125rem 1.25rem 0.5rem;\n margin-left: -0.0625rem;\n}\n.custom_widget_MicrosoftFooter_linkedin_105bp_156 {\n background-color: #0474b4;\n}\n.custom_widget_MicrosoftFooter_facebook_105bp_237 {\n background-color: #3c5c9c;\n}\n.custom_widget_MicrosoftFooter_twitter_105bp_240 {\n background-color: white;\n color: black;\n}\n.custom_widget_MicrosoftFooter_reddit_105bp_244 {\n background-color: #fc4404;\n}\n.custom_widget_MicrosoftFooter_mail_105bp_247 {\n background-color: #848484;\n}\n.custom_widget_MicrosoftFooter_bluesky_105bp_250 {\n background-color: white;\n color: black;\n}\n.custom_widget_MicrosoftFooter_rss_105bp_254 {\n background-color: #ec7b1c;\n}\n#custom_widget_MicrosoftFooter_RSS_105bp_1 {\n width: 2.5rem;\n height: 2.5rem;\n}\n@media (max-width: 991px) {\n .custom_widget_MicrosoftFooter_social-share_105bp_138 {\n display: none;\n }\n}\n","tokens":{"context-uhf":"custom_widget_MicrosoftFooter_context-uhf_105bp_1","c-uhff-link":"custom_widget_MicrosoftFooter_c-uhff-link_105bp_12","c-uhff":"custom_widget_MicrosoftFooter_c-uhff_105bp_12","c-uhff-nav":"custom_widget_MicrosoftFooter_c-uhff-nav_105bp_35","c-heading-4":"custom_widget_MicrosoftFooter_c-heading-4_105bp_49","c-uhff-nav-row":"custom_widget_MicrosoftFooter_c-uhff-nav-row_105bp_57","c-uhff-nav-group":"custom_widget_MicrosoftFooter_c-uhff-nav-group_105bp_58","c-list":"custom_widget_MicrosoftFooter_c-list_105bp_78","f-bare":"custom_widget_MicrosoftFooter_f-bare_105bp_78","c-uhff-base":"custom_widget_MicrosoftFooter_c-uhff-base_105bp_94","c-uhff-ccpa":"custom_widget_MicrosoftFooter_c-uhff-ccpa_105bp_107","social-share":"custom_widget_MicrosoftFooter_social-share_105bp_138","sharing-options":"custom_widget_MicrosoftFooter_sharing-options_105bp_146","linkedin-icon":"custom_widget_MicrosoftFooter_linkedin-icon_105bp_156","social-share-rss-image":"custom_widget_MicrosoftFooter_social-share-rss-image_105bp_162","social-link-footer":"custom_widget_MicrosoftFooter_social-link-footer_105bp_169","social-share-list":"custom_widget_MicrosoftFooter_social-share-list_105bp_188","share-icon":"custom_widget_MicrosoftFooter_share-icon_105bp_195","label":"custom_widget_MicrosoftFooter_label_105bp_207","linkedin":"custom_widget_MicrosoftFooter_linkedin_105bp_156","facebook":"custom_widget_MicrosoftFooter_facebook_105bp_237","twitter":"custom_widget_MicrosoftFooter_twitter_105bp_240","reddit":"custom_widget_MicrosoftFooter_reddit_105bp_244","mail":"custom_widget_MicrosoftFooter_mail_105bp_247","bluesky":"custom_widget_MicrosoftFooter_bluesky_105bp_250","rss":"custom_widget_MicrosoftFooter_rss_105bp_254","RSS":"custom_widget_MicrosoftFooter_RSS_105bp_1"}},"form":null},"localOverride":false},"CachedAsset:text:en_US-components/community/Breadcrumb-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/community/Breadcrumb-1745505310103","value":{"navLabel":"Breadcrumbs","dropdown":"Additional parent page navigation"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageBanner-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageBanner-1745505310103","value":{"messageMarkedAsSpam":"This post has been marked as spam","messageMarkedAsSpam@board:TKB":"This article has been marked as spam","messageMarkedAsSpam@board:BLOG":"This post has been marked as spam","messageMarkedAsSpam@board:FORUM":"This discussion has been marked as spam","messageMarkedAsSpam@board:OCCASION":"This event has been marked as spam","messageMarkedAsSpam@board:IDEA":"This idea has been marked as spam","manageSpam":"Manage Spam","messageMarkedAsAbuse":"This post has been marked as abuse","messageMarkedAsAbuse@board:TKB":"This article has been marked as abuse","messageMarkedAsAbuse@board:BLOG":"This post has been marked as abuse","messageMarkedAsAbuse@board:FORUM":"This discussion has been marked as abuse","messageMarkedAsAbuse@board:OCCASION":"This event has been marked as abuse","messageMarkedAsAbuse@board:IDEA":"This idea has been marked as abuse","preModCommentAuthorText":"This comment will be published as soon as it is approved","preModCommentModeratorText":"This comment is awaiting moderation","messageMarkedAsOther":"This post has been rejected due to other reasons","messageMarkedAsOther@board:TKB":"This article has been rejected due to other reasons","messageMarkedAsOther@board:BLOG":"This post has been rejected due to other reasons","messageMarkedAsOther@board:FORUM":"This discussion has been rejected due to other reasons","messageMarkedAsOther@board:OCCASION":"This event has been rejected due to other reasons","messageMarkedAsOther@board:IDEA":"This idea has been rejected due to other reasons","messageArchived":"This post was archived on {date}","relatedUrl":"View Related Content","relatedContentText":"Showing related content","archivedContentLink":"View Archived Content"},"localOverride":false},"Category:category:Exchange":{"__typename":"Category","id":"category:Exchange","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:Outlook":{"__typename":"Category","id":"category:Outlook","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:Community-Info-Center":{"__typename":"Category","id":"category:Community-Info-Center","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:EducationSector":{"__typename":"Category","id":"category:EducationSector","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:DrivingAdoption":{"__typename":"Category","id":"category:DrivingAdoption","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:Windows-Server":{"__typename":"Category","id":"category:Windows-Server","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:MicrosoftTeams":{"__typename":"Category","id":"category:MicrosoftTeams","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:PublicSector":{"__typename":"Category","id":"category:PublicSector","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:microsoft365":{"__typename":"Category","id":"category:microsoft365","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:IoT":{"__typename":"Category","id":"category:IoT","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:HealthcareAndLifeSciences":{"__typename":"Category","id":"category:HealthcareAndLifeSciences","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:ITOpsTalk":{"__typename":"Category","id":"category:ITOpsTalk","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:MicrosoftLearn":{"__typename":"Category","id":"category:MicrosoftLearn","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Blog:board:MicrosoftLearnBlog":{"__typename":"Blog","id":"board:MicrosoftLearnBlog","blogPolicies":{"__typename":"BlogPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}},"boardPolicies":{"__typename":"BoardPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:AI":{"__typename":"Category","id":"category:AI","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:MicrosoftMechanics":{"__typename":"Category","id":"category:MicrosoftMechanics","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:MicrosoftforNonprofits":{"__typename":"Category","id":"category:MicrosoftforNonprofits","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:StartupsatMicrosoft":{"__typename":"Category","id":"category:StartupsatMicrosoft","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:PartnerCommunity":{"__typename":"Category","id":"category:PartnerCommunity","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:Microsoft365Copilot":{"__typename":"Category","id":"category:Microsoft365Copilot","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:Windows":{"__typename":"Category","id":"category:Windows","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:Content_Management":{"__typename":"Category","id":"category:Content_Management","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:microsoft-security":{"__typename":"Category","id":"category:microsoft-security","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"Category:category:microsoftintune":{"__typename":"Category","id":"category:microsoftintune","categoryPolicies":{"__typename":"CategoryPolicies","canReadNode":{"__typename":"PolicyResult","failureReason":null}}},"QueryVariables:TopicReplyList:message:4061318:32":{"__typename":"QueryVariables","id":"TopicReplyList:message:4061318:32","value":{"id":"message:4061318","first":10,"sorts":{"postTime":{"direction":"DESC"}},"repliesFirst":3,"repliesFirstDepthThree":1,"repliesSorts":{"postTime":{"direction":"DESC"}},"useAvatar":true,"useAuthorLogin":true,"useAuthorRank":true,"useBody":true,"useKudosCount":true,"useTimeToRead":false,"useMedia":false,"useReadOnlyIcon":false,"useRepliesCount":true,"useSearchSnippet":false,"useAcceptedSolutionButton":false,"useSolvedBadge":false,"useAttachments":false,"attachmentsFirst":5,"useTags":true,"useNodeAncestors":false,"useUserHoverCard":false,"useNodeHoverCard":false,"useModerationStatus":true,"usePreviewSubjectModal":false,"useMessageStatus":true}},"ROOT_MUTATION":{"__typename":"Mutation"},"CachedAsset:text:en_US-components/community/Navbar-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/community/Navbar-1745505310103","value":{"community":"Community Home","inbox":"Inbox","manageContent":"Manage Content","tos":"Terms of Service","forgotPassword":"Forgot Password","themeEditor":"Theme Editor","edit":"Edit Navigation Bar","skipContent":"Skip to content","gxcuf89792":"Tech Community","external-1":"Events","s-m-b":"Nonprofit Community","windows-server":"Windows Server","education-sector":"Education Sector","driving-adoption":"Driving Adoption","Common-content_management-link":"Content Management","microsoft-learn":"Microsoft Learn","s-q-l-server":"Content Management","partner-community":"Microsoft Partner Community","microsoft365":"Microsoft 365","external-9":".NET","external-8":"Teams","external-7":"Github","products-services":"Products","external-6":"Power Platform","communities-1":"Topics","external-5":"Microsoft Security","planner":"Outlook","external-4":"Microsoft 365","external-3":"Dynamics 365","azure":"Azure","healthcare-and-life-sciences":"Healthcare and Life Sciences","external-2":"Azure","microsoft-mechanics":"Microsoft Mechanics","microsoft-learn-1":"Community","external-10":"Learning Room Directory","microsoft-learn-blog":"Blog","windows":"Windows","i-t-ops-talk":"ITOps Talk","external-link-1":"View All","microsoft-securityand-compliance":"Microsoft Security","public-sector":"Public Sector","community-info-center":"Lounge","external-link-2":"View All","microsoft-teams":"Microsoft Teams","external":"Blogs","microsoft-endpoint-manager":"Microsoft Intune","startupsat-microsoft":"Startups at Microsoft","exchange":"Exchange","a-i":"AI and Machine Learning","io-t":"Internet of Things (IoT)","Common-microsoft365-copilot-link":"Microsoft 365 Copilot","outlook":"Microsoft 365 Copilot","external-link":"Community Hubs","communities":"Products"},"localOverride":false},"CachedAsset:text:en_US-components/community/NavbarHamburgerDropdown-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/community/NavbarHamburgerDropdown-1745505310103","value":{"hamburgerLabel":"Side Menu"},"localOverride":false},"CachedAsset:text:en_US-components/community/BrandLogo-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/community/BrandLogo-1745505310103","value":{"logoAlt":"Khoros","themeLogoAlt":"Brand Logo"},"localOverride":false},"CachedAsset:text:en_US-components/community/NavbarTextLinks-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/community/NavbarTextLinks-1745505310103","value":{"more":"More"},"localOverride":false},"CachedAsset:text:en_US-components/authentication/AuthenticationLink-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/authentication/AuthenticationLink-1745505310103","value":{"title.login":"Sign In","title.registration":"Register","title.forgotPassword":"Forgot Password","title.multiAuthLogin":"Sign In"},"localOverride":false},"CachedAsset:text:en_US-components/nodes/NodeLink-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/nodes/NodeLink-1745505310103","value":{"place":"Place {name}"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageView/MessageViewStandard-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageView/MessageViewStandard-1745505310103","value":{"anonymous":"Anonymous","author":"{messageAuthorLogin}","authorBy":"{messageAuthorLogin}","board":"{messageBoardTitle}","replyToUser":" to {parentAuthor}","showMoreReplies":"Show More","replyText":"Reply","repliesText":"Replies","markedAsSolved":"Marked as Solved","movedMessagePlaceholder.BLOG":"{count, plural, =0 {This comment has been} other {These comments have been} }","movedMessagePlaceholder.TKB":"{count, plural, =0 {This comment has been} other {These comments have been} }","movedMessagePlaceholder.FORUM":"{count, plural, =0 {This reply has been} other {These replies have been} }","movedMessagePlaceholder.IDEA":"{count, plural, =0 {This comment has been} other {These comments have been} }","movedMessagePlaceholder.OCCASION":"{count, plural, =0 {This comment has been} other {These comments have been} }","movedMessagePlaceholderUrlText":"moved.","messageStatus":"Status: ","statusChanged":"Status changed: {previousStatus} to {currentStatus}","statusAdded":"Status added: {status}","statusRemoved":"Status removed: {status}","labelExpand":"expand replies","labelCollapse":"collapse replies","unhelpfulReason.reason1":"Content is outdated","unhelpfulReason.reason2":"Article is missing information","unhelpfulReason.reason3":"Content is for a different Product","unhelpfulReason.reason4":"Doesn't match what I was searching for"},"localOverride":false},"CachedAsset:text:en_US-components/messages/ThreadedReplyList-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/ThreadedReplyList-1745505310103","value":{"title":"{count, plural, one{# Reply} other{# Replies}}","title@board:BLOG":"{count, plural, one{# Comment} other{# Comments}}","title@board:TKB":"{count, plural, one{# Comment} other{# Comments}}","title@board:IDEA":"{count, plural, one{# Comment} other{# Comments}}","title@board:OCCASION":"{count, plural, one{# Comment} other{# Comments}}","noRepliesTitle":"No Replies","noRepliesTitle@board:BLOG":"No Comments","noRepliesTitle@board:TKB":"No Comments","noRepliesTitle@board:IDEA":"No Comments","noRepliesTitle@board:OCCASION":"No Comments","noRepliesDescription":"Be the first to reply","noRepliesDescription@board:BLOG":"Be the first to comment","noRepliesDescription@board:TKB":"Be the first to comment","noRepliesDescription@board:IDEA":"Be the first to comment","noRepliesDescription@board:OCCASION":"Be the first to comment","messageReadOnlyAlert:BLOG":"Comments have been turned off for this post","messageReadOnlyAlert:TKB":"Comments have been turned off for this article","messageReadOnlyAlert:IDEA":"Comments have been turned off for this idea","messageReadOnlyAlert:FORUM":"Replies have been turned off for this discussion","messageReadOnlyAlert:OCCASION":"Comments have been turned off for this event"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageReplyCallToAction-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageReplyCallToAction-1745505310103","value":{"leaveReply":"Leave a reply...","leaveReply@board:BLOG@message:root":"Leave a comment...","leaveReply@board:TKB@message:root":"Leave a comment...","leaveReply@board:IDEA@message:root":"Leave a comment...","leaveReply@board:OCCASION@message:root":"Leave a comment...","repliesTurnedOff.FORUM":"Replies are turned off for this topic","repliesTurnedOff.BLOG":"Comments are turned off for this topic","repliesTurnedOff.TKB":"Comments are turned off for this topic","repliesTurnedOff.IDEA":"Comments are turned off for this topic","repliesTurnedOff.OCCASION":"Comments are turned off for this topic","infoText":"Stop poking me!"},"localOverride":false},"User:user:1301771":{"__typename":"User","id":"user:1301771","uid":1301771,"login":"jsaelices","biography":null,"registrationData":{"__typename":"RegistrationData","status":null,"registrationTime":"2022-02-09T00:15:48.649-08:00"},"deleted":false,"email":"","avatar":{"__typename":"UserAvatar","url":"https://techcommunity.microsoft.com/t5/s/gxcuf89792/images/dS0xMzAxNzcxLTQ2NTgxNWkwOTAzODUzQ0U1MUJGOEEx"},"rank":{"__ref":"Rank:rank:4"},"entityType":"USER","eventPath":"community:gxcuf89792/user:1301771"},"ModerationData:moderation_data:4083697":{"__typename":"ModerationData","id":"moderation_data:4083697","status":"APPROVED","rejectReason":null,"isReportedAbuse":false,"rejectUser":null,"rejectTime":null,"rejectActorType":null},"BlogReplyMessage:message:4083697":{"__typename":"BlogReplyMessage","author":{"__ref":"User:user:1301771"},"id":"message:4083697","revisionNum":1,"uid":4083697,"depth":1,"hasGivenKudo":false,"subscribed":false,"board":{"__ref":"Blog:board:AzureHighPerformanceComputingBlog"},"parent":{"__ref":"BlogTopicMessage:message:4061318"},"conversation":{"__ref":"Conversation:conversation:4061318"},"subject":"Re: Running GPU accelerated workloads with NVIDIA GPU Operator on AKS","moderationData":{"__ref":"ModerationData:moderation_data:4083697"},"body":"

Great article folks. Very detailed and very useful for many customers trying to run this kind of workloads in AKS.

","body@stripHtml({\"removeProcessingText\":false,\"removeSpoilerMarkup\":false,\"removeTocMarkup\":false,\"truncateLength\":200})@stringLength":"116","kudosSumWeight":0,"repliesCount":0,"postTime":"2024-03-13T02:44:20.569-07:00","lastPublishTime":"2024-03-13T02:44:20.569-07:00","metrics":{"__typename":"MessageMetrics","views":2716},"visibilityScope":"PUBLIC","placeholder":false,"originalMessageForPlaceholder":null,"entityType":"BLOG_REPLY","eventPath":"category:Azure/category:products-services/category:communities/community:gxcuf89792board:AzureHighPerformanceComputingBlog/message:4061318/message:4083697","replies":{"__typename":"MessageConnection","pageInfo":{"__typename":"PageInfo","hasNextPage":false,"endCursor":null,"hasPreviousPage":false,"startCursor":null},"edges":[]},"customFields":[],"attachments":{"__typename":"AttachmentConnection","edges":[],"pageInfo":{"__typename":"PageInfo","hasNextPage":false,"endCursor":null,"hasPreviousPage":false,"startCursor":null}}},"CachedAsset:text:en_US-components/community/NavbarDropdownToggle-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/community/NavbarDropdownToggle-1745505310103","value":{"ariaLabelClosed":"Press the down arrow to open the menu"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/common/QueryHandler-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/common/QueryHandler-1745505310103","value":{"title":"Query Handler"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageCoverImage-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageCoverImage-1745505310103","value":{"coverImageTitle":"Cover Image"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/nodes/NodeTitle-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/nodes/NodeTitle-1745505310103","value":{"nodeTitle":"{nodeTitle, select, community {Community} other {{nodeTitle}}} "},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageTimeToRead-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageTimeToRead-1745505310103","value":{"minReadText":"{min} MIN READ"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageSubject-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageSubject-1745505310103","value":{"noSubject":"(no subject)"},"localOverride":false},"CachedAsset:text:en_US-components/users/UserLink-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/users/UserLink-1745505310103","value":{"authorName":"View Profile: {author}","anonymous":"Anonymous"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/users/UserRank-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/users/UserRank-1745505310103","value":{"rankName":"{rankName}","userRank":"Author rank {rankName}"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageTime-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageTime-1745505310103","value":{"postTime":"Published: {time}","lastPublishTime":"Last Update: {time}","conversation.lastPostingActivityTime":"Last posting activity time: {time}","conversation.lastPostTime":"Last post time: {time}","moderationData.rejectTime":"Rejected time: {time}"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageBody-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageBody-1745505310103","value":{"showMessageBody":"Show More","mentionsErrorTitle":"{mentionsType, select, board {Board} user {User} message {Message} other {}} No Longer Available","mentionsErrorMessage":"The {mentionsType} you are trying to view has been removed from the community.","videoProcessing":"Video is being processed. Please try again in a few minutes.","bannerTitle":"Video provider requires cookies to play the video. Accept to continue or {url} it directly on the provider's site.","buttonTitle":"Accept","urlText":"watch"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageCustomFields-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageCustomFields-1745505310103","value":{"CustomField.default.label":"Value of {name}"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageRevision-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageRevision-1745505310103","value":{"lastUpdatedDatePublished":"{publishCount, plural, one{Published} other{Updated}} {date}","lastUpdatedDateDraft":"Created {date}","version":"Version {major}.{minor}"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageReplyButton-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageReplyButton-1745505310103","value":{"repliesCount":"{count}","title":"Reply","title@board:BLOG@message:root":"Comment","title@board:TKB@message:root":"Comment","title@board:IDEA@message:root":"Comment","title@board:OCCASION@message:root":"Comment"},"localOverride":false},"CachedAsset:text:en_US-components/messages/MessageAuthorBio-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/messages/MessageAuthorBio-1745505310103","value":{"sendMessage":"Send Message","actionMessage":"Follow this blog board to get notified when there's new activity","coAuthor":"CO-PUBLISHER","contributor":"CONTRIBUTOR","userProfile":"View Profile","iconlink":"Go to {name} {type}"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/users/UserAvatar-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/users/UserAvatar-1745505310103","value":{"altText":"{login}'s avatar","altTextGeneric":"User's avatar"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/ranks/UserRankLabel-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/ranks/UserRankLabel-1745505310103","value":{"altTitle":"Icon for {rankName} rank"},"localOverride":false},"CachedAsset:text:en_US-components/users/UserRegistrationDate-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/users/UserRegistrationDate-1745505310103","value":{"noPrefix":"{date}","withPrefix":"Joined {date}"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/nodes/NodeAvatar-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/nodes/NodeAvatar-1745505310103","value":{"altTitle":"Node avatar for {nodeTitle}"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/nodes/NodeDescription-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/nodes/NodeDescription-1745505310103","value":{"description":"{description}"},"localOverride":false},"CachedAsset:text:en_US-components/tags/TagView/TagViewChip-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-components/tags/TagView/TagViewChip-1745505310103","value":{"tagLabelName":"Tag name {tagName}"},"localOverride":false},"CachedAsset:text:en_US-shared/client/components/nodes/NodeIcon-1745505310103":{"__typename":"CachedAsset","id":"text:en_US-shared/client/components/nodes/NodeIcon-1745505310103","value":{"contentType":"Content Type {style, select, FORUM {Forum} BLOG {Blog} TKB {Knowledge Base} IDEA {Ideas} OCCASION {Events} other {}} icon"},"localOverride":false}}}},"page":"/blogs/BlogMessagePage/BlogMessagePage","query":{"boardId":"azurehighperformancecomputingblog","messageSubject":"running-gpu-accelerated-workloads-with-nvidia-gpu-operator-on-aks","messageId":"4061318"},"buildId":"HEhyUrv5OXNBIbfCLaOrw","runtimeConfig":{"buildInformationVisible":false,"logLevelApp":"info","logLevelMetrics":"info","openTelemetryClientEnabled":false,"openTelemetryConfigName":"o365","openTelemetryServiceVersion":"25.1.0","openTelemetryUniverse":"prod","openTelemetryCollector":"http://localhost:4318","openTelemetryRouteChangeAllowedTime":"5000","apolloDevToolsEnabled":false,"inboxMuteWipFeatureEnabled":false},"isFallback":false,"isExperimentalCompile":false,"dynamicIds":["./components/community/Navbar/NavbarWidget.tsx","./components/community/Breadcrumb/BreadcrumbWidget.tsx","./components/customComponent/CustomComponent/CustomComponent.tsx","./components/blogs/BlogArticleWidget/BlogArticleWidget.tsx","./components/external/components/ExternalComponent.tsx","./components/messages/MessageView/MessageViewStandard/MessageViewStandard.tsx","./components/messages/ThreadedReplyList/ThreadedReplyList.tsx","../shared/client/components/common/List/UnstyledList/UnstyledList.tsx","./components/messages/MessageView/MessageView.tsx","../shared/client/components/common/List/UnwrappedList/UnwrappedList.tsx","./components/tags/TagView/TagView.tsx","./components/tags/TagView/TagViewChip/TagViewChip.tsx"],"appGip":true,"scriptLoader":[{"id":"analytics","src":"https://techcommunity.microsoft.com/t5/s/gxcuf89792/pagescripts/1730819800000/analytics.js?page.id=BlogMessagePage&entity.id=board%3Aazurehighperformancecomputingblog&entity.id=message%3A4061318","strategy":"afterInteractive"}]}