Blog Post

Azure AI Foundry Blog
10 MIN READ

Enterprise AKS Multi-Instance GPU (MIG) vLLM Deployment Guide

hieunhu's avatar
hieunhu
Icon for Microsoft rankMicrosoft
Sep 02, 2025
Note: This document uses placeholder values like <your-region>, ai-gpu-aks-rg, and ai-h100-cluster. Replace these with your organization's naming conventions and preferred Azure regions.

Executive Summary

This comprehensive guide demonstrates how to deploy AI models using vLLM on Azure Kubernetes Service (AKS) with NVIDIA H100 GPUs and Multi-Instance GPU (MIG) technology. The solution enables running multiple AI models simultaneously on a single GPU with hardware isolation, optimizing cost and resource utilization.

πŸ”’ Isolated AI Zone Architecture: This self-hosted model deployment operates within your own Azure tenant, providing a dedicated isolated zone for specialized workloads. Compatible with all new Azure regions including Indonesia Central, Malaysia West, and other emerging markets. The architecture integrates seamlessly with Azure API Management (APIM) AI Gateway for intelligent traffic routing across Azure AI Foundry Models, Azure GPU models, and on-premises deployments - creating a unified hybrid AI infrastructure.

Single H100 GPU serving 2 models simultaneously
Hardware isolation with guaranteed performance
Production-ready automated management
Dedicated isolated zone in your tenant
Cost savings through GPU sharing
APIM AI Gateway integration ready
 

πŸš€ Quick Start

TL;DR: Complete deployment in ~30 minutes with these essential commands

Prerequisites

  • Azure CLI installed and logged in
  • kubectl installed
  • Sufficient Azure quota for H100 GPUs

Essential Commands Only

# 1. Create infrastructure (5 minutes)
az group create --name ai-gpu-aks-rg --location eastus
az aks create --resource-group ai-gpu-aks-rg --name ai-h100-cluster --location eastus --node-count 1 --generate-ssh-keys
az aks nodepool add --resource-group ai-gpu-aks-rg --cluster-name ai-h100-cluster --name gpupool --node-count 1 --node-vm-size Standard_NC40ads_H100_v5
az aks get-credentials --resource-group ai-gpu-aks-rg --name ai-h100-cluster

# 2. Install GPU components (10 minutes)
helm install --wait --create-namespace -n gpu-operator node-feature-discovery node-feature-discovery --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts --set-json master.config.extraLabelNs='["nvidia.com"]'
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator --set nfd.enabled=false --set driver.enabled=false

# 3. Configure MIG (2 minutes)
az aks nodepool update --cluster-name ai-h100-cluster --resource-group ai-gpu-aks-rg --nodepool-name gpupool --labels "nvidia.com/mig.config"="all-3g.47gb"

# 4. Deploy models (10 minutes)
kubectl create namespace vllm
# Apply the YAML manifests from Phase 4 sections

What You Get

  • 2 AI Models running simultaneously on 1 GPU
  • 47.5GB memory per model instance
  • Hardware isolation between workloads
  • 50% cost savings vs separate GPUs
  • Persistent DNS names for easy API access:
    • GPT-OSS: gpt-oss.<location>.cloudapp.azure.com:8000
    • Phi-4: phi4.<location>.cloudapp.azure.com:8001

Architecture Overview

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Azure Kubernetes Service (AKS)               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚                 vLLM Namespace                              β”‚β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚β”‚
β”‚  β”‚  β”‚ GPT-OSS Service  β”‚    β”‚  Phi-4 Service   β”‚               β”‚β”‚
β”‚  β”‚  β”‚ (Chat/Reasoning) β”‚    β”‚ (Chat/Analysis)  β”‚               β”‚β”‚
β”‚  β”‚  β”‚ Port: 8000       β”‚    β”‚ Port: 8001       β”‚               β”‚β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                              β”‚                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚              GPU Operator Namespace                         β”‚β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚β”‚
β”‚  β”‚  β”‚    NFD     β”‚ β”‚Device Pluginβ”‚ β”‚    MIG Manager         β”‚  β”‚β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                              β”‚                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚                H100 GPU Node Pool                           β”‚β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚β”‚
β”‚  β”‚  β”‚            NVIDIA H100 NVL (94GB VRAM)                  β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚ MIG Instance 1   β”‚    β”‚ MIG Instance 2   β”‚           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚ 47.5GB Memory    β”‚    β”‚ 47.5GB Memory    β”‚           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚ 60 SM Units      β”‚    β”‚ 60 SM Units      β”‚           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚                  β”‚    β”‚                  β”‚           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚ GPT-OSS Model    β”‚    β”‚  Phi-4 Model     β”‚           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚ (~20GB Used)     β”‚    β”‚ (~14GB Used)     β”‚           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚β”‚β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technology Stack

ComponentTechnologyVersionPurpose
Container OrchestrationAzure Kubernetes Service (AKS)1.30Container orchestration platform
GPU HardwareNVIDIA H100 NVL-High-performance AI compute
GPU VirtualizationMulti-Instance GPU (MIG)3g.47gb profilesHardware-level GPU partitioning
GPU ManagementNVIDIA GPU OperatorLatestAutomated GPU software stack
Model ServingvLLMLatestHigh-performance LLM inference
ModelsGPT-OSS 20B, Phi-4 14BLatestReasoning & Chat/Analysis models

Prerequisites

Azure Resources Required

  • Active Azure subscription with sufficient quota
  • Resource group in preferred Azure region (with H100 availability)
  • Azure CLI installed and configured
  • kubectl installed and configured

GPU Quota Requirements

ResourceQuota NeededPurpose
Standard_NC40ads_H100_v51 vCPUH100 GPU instance
Total Regional vCPUs40+Node capacity
Premium Managed Disks200GB+Storage

Phase 1: AKS Cluster Creation

Step 1.1: Create Resource Group

# Create resource group in your preferred region
az group create \
  --name ai-gpu-aks-rg \
  --location <your-region>

Step 1.2: Create AKS Cluster with System Node Pool

# Create AKS cluster with system node pool
az aks create \
  --resource-group ai-gpu-aks-rg \
  --name ai-h100-cluster \
  --location <your-region> \
  --node-count 1 \
  --node-vm-size Standard_D4s_v5 \
  --kubernetes-version 1.30 \
  --enable-managed-identity \
  --network-plugin azure \
  --network-policy azure \
  --node-osdisk-type Managed \
  --node-osdisk-size 100 \
  --generate-ssh-keys

Step 1.3: Add H100 GPU Node Pool

# Add GPU node pool with H100
az aks nodepool add \
  --resource-group ai-gpu-aks-rg \
  --cluster-name ai-h100-cluster \
  --name gpupool \
  --node-count 1 \
  --node-vm-size Standard_NC40ads_H100_v5 \
  --node-osdisk-type Managed \
  --node-osdisk-size 200 \
  --max-pods 110 \
  --kubernetes-version 1.30

Step 1.4: Configure kubectl Access

# Get cluster credentials
az aks get-credentials \
  --resource-group ai-gpu-aks-rg \
  --name ai-h100-cluster \
  --overwrite-existing

# Verify cluster access
kubectl get nodes
Expected Output:
NAME                                STATUS   ROLES    AGE   VERSION
aks-gpupool-xxxxxxxx-vmss000001     Ready    <none>   5m    v1.30.14
aks-nodepool1-xxxxxxxx-vmss000001   Ready    <none>   10m   v1.30.14

Phase 2: GPU Operator Installation

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed for GPU workloads.

Step 2.1: Install Node Feature Discovery (NFD)

# Install NFD as prerequisite for GPU Operator
helm install --wait --create-namespace -n gpu-operator \
  node-feature-discovery node-feature-discovery \
  --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts \
  --set-json master.config.extraLabelNs='["nvidia.com"]' \
  --set-json worker.tolerations='[
    {
      "effect": "NoSchedule",
      "key": "sku",
      "operator": "Equal",
      "value": "gpu"
    },
    {
      "effect": "NoSchedule",
      "key": "mig",
      "value": "notReady",
      "operator": "Equal"
    }
  ]'

Step 2.2: Install GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator \
  --set-json daemonsets.tolerations='[
    {
      "effect": "NoSchedule",
      "key": "sku",
      "operator": "Equal",
      "value": "gpu"
    }
  ]' \
  --set nfd.enabled=false \
  --set driver.enabled=false \
  --set operator.runtimeClass=nvidia-container-runtime

Step 2.3: Verify GPU Operator Installation

# Check all GPU Operator components
kubectl get pods -n gpu-operator
Expected Components:
  • nvidia-device-plugin-daemonset-xxxxx
  • nvidia-mig-manager-xxxxx
  • nvidia-dcgm-exporter-xxxxx
  • gpu-feature-discovery-xxxxx
  • nvidia-container-toolkit-daemonset-xxxxx

Phase 3: MIG Configuration

Multi-Instance GPU (MIG) allows partitioning the H100 into multiple isolated GPU instances.

Step 3.1: Discover Available MIG Profiles

Important: Different GPU models have different MIG profile names. Always check what's available on your specific GPU.
# Check available MIG instance profiles on your GPU
kubectl exec -n gpu-operator \
  $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
  -- nvidia-smi mig -lgip
Common MIG Profiles by GPU Model:
  • H100 NVL: Uses 3g.47gb (46.38 GiB per instance)
  • A100 80GB: Uses 3g.40gb (39.59 GiB per instance)
  • H100 SXM: May vary, check with the command above

Use the profile name exactly as shown in your output when configuring MIG.

Step 3.2: Enable MIG on GPU Node

# Label the GPU node with the CORRECT MIG configuration for your GPU
# For H100 NVL, use "3g.47gb". For A100, use "3g.40gb"
az aks nodepool update \
  --cluster-name ai-h100-cluster \
  --resource-group ai-gpu-aks-rg \
  --nodepool-name gpupool \
  --labels "nvidia.com/mig.config"="all-3g.47gb"

# Configure MIG Manager for reboot (H100 requirement)
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \
  -p '{"spec": {"migManager": {"env": [{"name": "WITH_REBOOT", "value": "true"}]}}}'

Step 3.3: Verify MIG Configuration

# Check MIG instances are created
kubectl exec -n gpu-operator $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi mig -lgi

# Verify GPU resources are available
kubectl describe node -l agentpool=gpupool | grep nvidia.com/gpu

Phase 4: Model Deployments

πŸ“¦ View GPT-OSS Deployment YAML
# gpt-oss-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bge-m3-mig
  namespace: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bge-m3-mig
  template:
    metadata:
      labels:
        app: bge-m3-mig
    spec:
      nodeSelector:
        agentpool: gpupool
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "BAAI/bge-m3"
          - "--trust-remote-code"
          - "--max-model-len"
          - "4096"
          - "--gpu-memory-utilization"
          - "0.85"
          - "--dtype"
          - "float16"
          - "--api-key"
          - "token-abc123"
          - "--port"
          - "8001"
          - "--host"
          - "0.0.0.0"
          - "--enable-prefix-caching"
          - "--max-num-seqs"
          - "256"
        ports:
        - containerPort: 8001
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: VLLM_WORKER_MULTIPROC_METHOD
          value: "spawn"
        - name: OMP_NUM_THREADS
          value: "4"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "30Gi"
            cpu: "6"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
            cpu: "4"
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
  name: bge-m3-service
  namespace: vllm
  annotations:
    service.beta.kubernetes.io/azure-dns-label-name: "vllm-bge-m3"
spec:
  type: LoadBalancer
  selector:
    app: bge-m3-mig
  ports:
  - port: 8001
    targetPort: 8001
πŸ“¦ View Phi-4 Deployment YAML
# granite-vision-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite-vision-mig
  namespace: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: granite-vision-mig
  template:
    metadata:
      labels:
        app: granite-vision-mig
    spec:
      nodeSelector:
        agentpool: gpupool
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "ibm-granite/granite-vision-3.3-2b"
          - "--trust-remote-code"
          - "--max-model-len"
          - "8192"
          - "--gpu-memory-utilization"
          - "0.85"
          - "--dtype"
          - "auto"
          - "--api-key"
          - "token-abc123"
          - "--port"
          - "8000"
          - "--host"
          - "0.0.0.0"
          - "--enable-prefix-caching"
        ports:
        - containerPort: 8000
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: VLLM_WORKER_MULTIPROC_METHOD
          value: "spawn"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "30Gi"
            cpu: "6"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
            cpu: "4"
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
  name: granite-vision-service
  namespace: vllm
  annotations:
    service.beta.kubernetes.io/azure-dns-label-name: "vllm-granite-vision"
spec:
  type: LoadBalancer
  selector:
    app: granite-vision-mig
  ports:
  - port: 8000
    targetPort: 8000

Deploy Models

# Create namespace
kubectl create namespace vllm

# Deploy GPT-OSS
kubectl apply -f gpt-oss-deployment.yaml

# Deploy Phi-4
kubectl apply -f phi4-deployment.yaml

# Check deployment status
kubectl get pods,svc -n vllm

Test Model Services

# Test GPT-OSS reasoning model (using DNS name)
curl -X POST http://gpt-oss.<location>.cloudapp.azure.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer token-abc123" \
  -d '{
    "model": "microsoft/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain the benefits of MIG technology for enterprise AI workloads"}],
    "max_tokens": 200,
    "temperature": 0.7
  }' | jq .

# Test Phi-4 small language model (using DNS name)  
curl -X POST http://phi4.<location>.cloudapp.azure.com:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer token-abc123" \
  -d '{
    "model": "microsoft/phi-4-14b",
    "messages": [{"role": "user", "content": "Analyze cost savings from GPU virtualization in cloud environments"}],
    "max_tokens": 150,
    "temperature": 0.5
  }' | jq .

Monitoring and Operations

GPU Utilization Monitoring

# Check GPU utilization across MIG instances
kubectl exec -n gpu-operator \
  $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
  -- nvidia-smi

Application Monitoring

# Monitor pod resource usage
kubectl top pods -n vllm

# Check application logs
kubectl logs -f deployment/gpt-oss-mig -n vllm
kubectl logs -f deployment/phi4-mig -n vllm

# Monitor service health
kubectl get endpoints -n vllm

Troubleshooting

πŸ”§ MIG Instances Not Created
Symptoms:
  • nvidia.com/gpu: 1 instead of 2 in node allocatable
  • Models sharing same GPU without isolation
Solution:
# Check MIG configuration
kubectl describe node -l agentpool=gpupool | grep mig.config

# Restart MIG manager if needed
kubectl delete pod -l app=nvidia-mig-manager -n gpu-operator

# Verify GPU processes
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi
πŸ”§ Pod Pending with GPU Resource Issues
Symptoms:
  • Pods stuck in Pending state
  • Event: Insufficient nvidia.com/gpu
Solution:
# Check GPU resource availability
kubectl describe node -l agentpool=gpupool | grep -A10 Allocatable

# Verify device plugin is running
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
πŸ”§ Model Loading Failures
Symptoms:
  • vLLM containers crashing during model load
  • OOM (Out of Memory) errors
Solution:
# Check available memory per MIG instance
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi mig -lgi

# Adjust gpu-memory-utilization in deployment
# Reduce from 0.85 to 0.7 if needed

Cost Analysis

50% Cost Reduction with MIG

Without MIG: 2x H100 GPUs = ~$4,800/month

With MIG: 1x H100 GPU = ~$2,400/month

Monthly Savings: $2,400

Infrastructure Costs (Example: Switzerland North)

ResourceTypeQuantityMonthly Cost (USD)
AKS ClusterManagement1Free
System Node PoolStandard_D4s_v51~$120
GPU Node PoolStandard_NC40ads_H100_v51~$2,400
Managed DisksPremium SSD300GB~$60
Load BalancerStandard2~$40
Total  ~$2,620

Security Considerations

βœ… Built-in Security Features

  • API Authentication: Bearer token required (token-abc123)
  • Hardware Isolation: MIG provides GPU-level isolation
  • Network Isolation: Kubernetes namespace separation
  • TLS: HTTPS endpoints via LoadBalancer
  • Data Residency: All processing within your chosen Azure region
  • No Data Persistence: Models don't store request/response data
  • Encrypted Communication: TLS/HTTPS for all API calls
πŸ”’ Advanced Security (Production Recommended)

Additional Security Measures

  1. Change API Keys: Replace token-abc123 with secure tokens
  2. Rate Limiting: Add ingress controllers with rate limits
  3. Input Validation: Implement request/response validation
  4. Audit Logging: Enable Azure Monitor for comprehensive logging
  5. Private Endpoints: Use private AKS clusters for sensitive workloads

Best Practices

Resource Management

  1. Right-sizing: Monitor actual usage and adjust resource requests/limits
  2. Node Affinity: Use node selectors to ensure GPU workloads run on GPU nodes
  3. Horizontal Scaling: Plan for multiple replicas with additional GPU nodes
  4. Vertical Scaling: Adjust MIG profiles based on workload requirements

Operational Excellence

  1. GitOps: Store all configurations in version control
  2. CI/CD Integration: Automate deployments with proper testing
  3. Monitoring: Implement comprehensive monitoring and alerting
  4. Backup/Recovery: Regular backup of configuration and state

Conclusion

This implementation provides organizations with:

  • Cost-Effective AI Infrastructure: 50% cost reduction through GPU sharing
  • Production-Ready Platform: Automated management and monitoring
  • Scalable Architecture: Easy to extend with additional models/nodes
  • Enterprise Security: Comprehensive security and compliance features
  • Operational Excellence: Full observability and troubleshooting capabilities
  • Persistent DNS Access: Stable FQDNs that survive cluster restarts

The MIG-enabled AKS cluster successfully demonstrates how modern GPU virtualization can optimize AI workload deployments while maintaining strict isolation and performance guarantees.

Next Steps

  1. Production Readiness: Implement comprehensive monitoring and alerting
  2. Model Expansion: Add additional AI models as business requires
  3. Automation: Develop CI/CD pipelines for model deployment
  4. Optimization: Continuous performance tuning based on usage patterns
  5. Scaling: Plan for multi-node GPU clusters as demand grows
Updated Sep 02, 2025
Version 3.0
No CommentsBe the first to comment