Azure AI Foundry Blog

10 MIN READ

Enterprise AKS Multi-Instance GPU (MIG) vLLM Deployment Guide

Microsoft

Sep 02, 2025

Note: This document uses placeholder values like <your-region>, ai-gpu-aks-rg, and ai-h100-cluster. Replace these with your organization's naming conventions and preferred Azure regions.

Executive Summary

This comprehensive guide demonstrates how to deploy AI models using vLLM on Azure Kubernetes Service (AKS) with NVIDIA H100 GPUs and Multi-Instance GPU (MIG) technology. The solution enables running multiple AI models simultaneously on a single GPU with hardware isolation, optimizing cost and resource utilization.

🔒 Isolated AI Zone Architecture: This self-hosted model deployment operates within your own Azure tenant, providing a dedicated isolated zone for specialized workloads. Compatible with all new Azure regions including Indonesia Central, Malaysia West, and other emerging markets. The architecture integrates seamlessly with Azure API Management (APIM) AI Gateway for intelligent traffic routing across Azure AI Foundry Models, Azure GPU models, and on-premises deployments - creating a unified hybrid AI infrastructure.

Single H100 GPU serving 2 models simultaneously

Hardware isolation with guaranteed performance

Production-ready automated management

Dedicated isolated zone in your tenant

Cost savings through GPU sharing

APIM AI Gateway integration ready

Quick Start
Architecture Overview
Prerequisites
Phase 1: AKS Cluster Creation
Phase 2: GPU Operator Installation
Phase 3: MIG Configuration
Phase 4: Model Deployments
Monitoring and Operations
Troubleshooting
Cost Analysis
Security Considerations

🚀 Quick Start

TL;DR: Complete deployment in ~30 minutes with these essential commands

Prerequisites

Azure CLI installed and logged in
kubectl installed
Sufficient Azure quota for H100 GPUs

Essential Commands Only

# 1. Create infrastructure (5 minutes)
az group create --name ai-gpu-aks-rg --location eastus
az aks create --resource-group ai-gpu-aks-rg --name ai-h100-cluster --location eastus --node-count 1 --generate-ssh-keys
az aks nodepool add --resource-group ai-gpu-aks-rg --cluster-name ai-h100-cluster --name gpupool --node-count 1 --node-vm-size Standard_NC40ads_H100_v5
az aks get-credentials --resource-group ai-gpu-aks-rg --name ai-h100-cluster

# 2. Install GPU components (10 minutes)
helm install --wait --create-namespace -n gpu-operator node-feature-discovery node-feature-discovery --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts --set-json master.config.extraLabelNs='["nvidia.com"]'
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator --set nfd.enabled=false --set driver.enabled=false

# 3. Configure MIG (2 minutes)
az aks nodepool update --cluster-name ai-h100-cluster --resource-group ai-gpu-aks-rg --nodepool-name gpupool --labels "nvidia.com/mig.config"="all-3g.47gb"

# 4. Deploy models (10 minutes)
kubectl create namespace vllm
# Apply the YAML manifests from Phase 4 sections

What You Get

2 AI Models running simultaneously on 1 GPU
47.5GB memory per model instance
Hardware isolation between workloads
50% cost savings vs separate GPUs
Persistent DNS names for easy API access:
- GPT-OSS: gpt-oss.<location>.cloudapp.azure.com:8000
- Phi-4: phi4.<location>.cloudapp.azure.com:8001

Architecture Overview

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Azure Kubernetes Service (AKS)               │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                 vLLM Namespace                              ││
│  │  ┌──────────────────┐    ┌──────────────────┐               ││
│  │  │ GPT-OSS Service  │    │  Phi-4 Service   │               ││
│  │  │ (Chat/Reasoning) │    │ (Chat/Analysis)  │               ││
│  │  │ Port: 8000       │    │ Port: 8001       │               ││
│  │  └──────────────────┘    └──────────────────┘               ││
│  └─────────────────────────────────────────────────────────────┘│
│                              │                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │              GPU Operator Namespace                         ││
│  │  ┌────────────┐ ┌─────────────┐ ┌────────────────────────┐  ││
│  │  │    NFD     │ │Device Plugin│ │    MIG Manager         │  ││
│  │  └────────────┘ └─────────────┘ └────────────────────────┘  ││
│  └─────────────────────────────────────────────────────────────┘│
│                              │                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                H100 GPU Node Pool                           ││
│  │  ┌─────────────────────────────────────────────────────────┐││
│  │  │            NVIDIA H100 NVL (94GB VRAM)                  │││
│  │  │  ┌──────────────────┐    ┌──────────────────┐           │││
│  │  │  │ MIG Instance 1   │    │ MIG Instance 2   │           │││
│  │  │  │ 47.5GB Memory    │    │ 47.5GB Memory    │           │││
│  │  │  │ 60 SM Units      │    │ 60 SM Units      │           │││
│  │  │  │                  │    │                  │           │││
│  │  │  │ GPT-OSS Model    │    │  Phi-4 Model     │           │││
│  │  │  │ (~20GB Used)     │    │ (~14GB Used)     │           │││
│  │  │  └──────────────────┘    └──────────────────┘           │││
│  │  └─────────────────────────────────────────────────────────┘││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Technology Stack

Component	Technology	Version	Purpose
Container Orchestration	Azure Kubernetes Service (AKS)	1.30	Container orchestration platform
GPU Hardware	NVIDIA H100 NVL	-	High-performance AI compute
GPU Virtualization	Multi-Instance GPU (MIG)	3g.47gb profiles	Hardware-level GPU partitioning
GPU Management	NVIDIA GPU Operator	Latest	Automated GPU software stack
Model Serving	vLLM	Latest	High-performance LLM inference
Models	GPT-OSS 20B, Phi-4 14B	Latest	Reasoning & Chat/Analysis models

Prerequisites

Azure Resources Required

Active Azure subscription with sufficient quota
Resource group in preferred Azure region (with H100 availability)
Azure CLI installed and configured
kubectl installed and configured

GPU Quota Requirements

Resource	Quota Needed	Purpose
Standard_NC40ads_H100_v5	1 vCPU	H100 GPU instance
Total Regional vCPUs	40+	Node capacity
Premium Managed Disks	200GB+	Storage

Phase 1: AKS Cluster Creation

Step 1.1: Create Resource Group

# Create resource group in your preferred region
az group create \
  --name ai-gpu-aks-rg \
  --location <your-region>

Step 1.2: Create AKS Cluster with System Node Pool

# Create AKS cluster with system node pool
az aks create \
  --resource-group ai-gpu-aks-rg \
  --name ai-h100-cluster \
  --location <your-region> \
  --node-count 1 \
  --node-vm-size Standard_D4s_v5 \
  --kubernetes-version 1.30 \
  --enable-managed-identity \
  --network-plugin azure \
  --network-policy azure \
  --node-osdisk-type Managed \
  --node-osdisk-size 100 \
  --generate-ssh-keys

Step 1.3: Add H100 GPU Node Pool

# Add GPU node pool with H100
az aks nodepool add \
  --resource-group ai-gpu-aks-rg \
  --cluster-name ai-h100-cluster \
  --name gpupool \
  --node-count 1 \
  --node-vm-size Standard_NC40ads_H100_v5 \
  --node-osdisk-type Managed \
  --node-osdisk-size 200 \
  --max-pods 110 \
  --kubernetes-version 1.30

Step 1.4: Configure kubectl Access

# Get cluster credentials
az aks get-credentials \
  --resource-group ai-gpu-aks-rg \
  --name ai-h100-cluster \
  --overwrite-existing

# Verify cluster access
kubectl get nodes

Expected Output:

NAME                                STATUS   ROLES    AGE   VERSION
aks-gpupool-xxxxxxxx-vmss000001     Ready    <none>   5m    v1.30.14
aks-nodepool1-xxxxxxxx-vmss000001   Ready    <none>   10m   v1.30.14

Phase 2: GPU Operator Installation

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed for GPU workloads.

Step 2.1: Install Node Feature Discovery (NFD)

# Install NFD as prerequisite for GPU Operator
helm install --wait --create-namespace -n gpu-operator \
  node-feature-discovery node-feature-discovery \
  --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts \
  --set-json master.config.extraLabelNs='["nvidia.com"]' \
  --set-json worker.tolerations='[
    {
      "effect": "NoSchedule",
      "key": "sku",
      "operator": "Equal",
      "value": "gpu"
    },
    {
      "effect": "NoSchedule",
      "key": "mig",
      "value": "notReady",
      "operator": "Equal"
    }
  ]'

Step 2.2: Install GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator \
  --set-json daemonsets.tolerations='[
    {
      "effect": "NoSchedule",
      "key": "sku",
      "operator": "Equal",
      "value": "gpu"
    }
  ]' \
  --set nfd.enabled=false \
  --set driver.enabled=false \
  --set operator.runtimeClass=nvidia-container-runtime

Step 2.3: Verify GPU Operator Installation

# Check all GPU Operator components
kubectl get pods -n gpu-operator

Expected Components:

nvidia-device-plugin-daemonset-xxxxx
nvidia-mig-manager-xxxxx
nvidia-dcgm-exporter-xxxxx
gpu-feature-discovery-xxxxx
nvidia-container-toolkit-daemonset-xxxxx

Phase 3: MIG Configuration

Multi-Instance GPU (MIG) allows partitioning the H100 into multiple isolated GPU instances.

Step 3.1: Discover Available MIG Profiles

Important: Different GPU models have different MIG profile names. Always check what's available on your specific GPU.

# Check available MIG instance profiles on your GPU
kubectl exec -n gpu-operator \
  $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
  -- nvidia-smi mig -lgip

Common MIG Profiles by GPU Model:

H100 NVL: Uses 3g.47gb (46.38 GiB per instance)
A100 80GB: Uses 3g.40gb (39.59 GiB per instance)
H100 SXM: May vary, check with the command above

Use the profile name exactly as shown in your output when configuring MIG.

Step 3.2: Enable MIG on GPU Node

# Label the GPU node with the CORRECT MIG configuration for your GPU
# For H100 NVL, use "3g.47gb". For A100, use "3g.40gb"
az aks nodepool update \
  --cluster-name ai-h100-cluster \
  --resource-group ai-gpu-aks-rg \
  --nodepool-name gpupool \
  --labels "nvidia.com/mig.config"="all-3g.47gb"

# Configure MIG Manager for reboot (H100 requirement)
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \
  -p '{"spec": {"migManager": {"env": [{"name": "WITH_REBOOT", "value": "true"}]}}}'

Step 3.3: Verify MIG Configuration

# Check MIG instances are created
kubectl exec -n gpu-operator $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi mig -lgi

# Verify GPU resources are available
kubectl describe node -l agentpool=gpupool | grep nvidia.com/gpu

Phase 4: Model Deployments

📦 View GPT-OSS Deployment YAML

# gpt-oss-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bge-m3-mig
  namespace: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bge-m3-mig
  template:
    metadata:
      labels:
        app: bge-m3-mig
    spec:
      nodeSelector:
        agentpool: gpupool
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "BAAI/bge-m3"
          - "--trust-remote-code"
          - "--max-model-len"
          - "4096"
          - "--gpu-memory-utilization"
          - "0.85"
          - "--dtype"
          - "float16"
          - "--api-key"
          - "token-abc123"
          - "--port"
          - "8001"
          - "--host"
          - "0.0.0.0"
          - "--enable-prefix-caching"
          - "--max-num-seqs"
          - "256"
        ports:
        - containerPort: 8001
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: VLLM_WORKER_MULTIPROC_METHOD
          value: "spawn"
        - name: OMP_NUM_THREADS
          value: "4"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "30Gi"
            cpu: "6"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
            cpu: "4"
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
  name: bge-m3-service
  namespace: vllm
  annotations:
    service.beta.kubernetes.io/azure-dns-label-name: "vllm-bge-m3"
spec:
  type: LoadBalancer
  selector:
    app: bge-m3-mig
  ports:
  - port: 8001
    targetPort: 8001

📦 View Phi-4 Deployment YAML

# granite-vision-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite-vision-mig
  namespace: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: granite-vision-mig
  template:
    metadata:
      labels:
        app: granite-vision-mig
    spec:
      nodeSelector:
        agentpool: gpupool
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "ibm-granite/granite-vision-3.3-2b"
          - "--trust-remote-code"
          - "--max-model-len"
          - "8192"
          - "--gpu-memory-utilization"
          - "0.85"
          - "--dtype"
          - "auto"
          - "--api-key"
          - "token-abc123"
          - "--port"
          - "8000"
          - "--host"
          - "0.0.0.0"
          - "--enable-prefix-caching"
        ports:
        - containerPort: 8000
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: VLLM_WORKER_MULTIPROC_METHOD
          value: "spawn"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "30Gi"
            cpu: "6"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
            cpu: "4"
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
  name: granite-vision-service
  namespace: vllm
  annotations:
    service.beta.kubernetes.io/azure-dns-label-name: "vllm-granite-vision"
spec:
  type: LoadBalancer
  selector:
    app: granite-vision-mig
  ports:
  - port: 8000
    targetPort: 8000

Deploy Models

# Create namespace
kubectl create namespace vllm

# Deploy GPT-OSS
kubectl apply -f gpt-oss-deployment.yaml

# Deploy Phi-4
kubectl apply -f phi4-deployment.yaml

# Check deployment status
kubectl get pods,svc -n vllm

Test Model Services

# Test GPT-OSS reasoning model (using DNS name)
curl -X POST http://gpt-oss.<location>.cloudapp.azure.com:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer token-abc123" \
  -d '{
    "model": "microsoft/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain the benefits of MIG technology for enterprise AI workloads"}],
    "max_tokens": 200,
    "temperature": 0.7
  }' | jq .

# Test Phi-4 small language model (using DNS name)  
curl -X POST http://phi4.<location>.cloudapp.azure.com:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer token-abc123" \
  -d '{
    "model": "microsoft/phi-4-14b",
    "messages": [{"role": "user", "content": "Analyze cost savings from GPU virtualization in cloud environments"}],
    "max_tokens": 150,
    "temperature": 0.5
  }' | jq .

Monitoring and Operations

GPU Utilization Monitoring

# Check GPU utilization across MIG instances
kubectl exec -n gpu-operator \
  $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
  -- nvidia-smi

Application Monitoring

# Monitor pod resource usage
kubectl top pods -n vllm

# Check application logs
kubectl logs -f deployment/gpt-oss-mig -n vllm
kubectl logs -f deployment/phi4-mig -n vllm

# Monitor service health
kubectl get endpoints -n vllm

Troubleshooting

🔧 MIG Instances Not Created

Symptoms:

nvidia.com/gpu: 1 instead of 2 in node allocatable
Models sharing same GPU without isolation

Solution:

# Check MIG configuration
kubectl describe node -l agentpool=gpupool | grep mig.config

# Restart MIG manager if needed
kubectl delete pod -l app=nvidia-mig-manager -n gpu-operator

# Verify GPU processes
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi

🔧 Pod Pending with GPU Resource Issues

Symptoms:

Pods stuck in Pending state
Event: Insufficient nvidia.com/gpu

Solution:

# Check GPU resource availability
kubectl describe node -l agentpool=gpupool | grep -A10 Allocatable

# Verify device plugin is running
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

🔧 Model Loading Failures

Symptoms:

vLLM containers crashing during model load
OOM (Out of Memory) errors

Solution:

# Check available memory per MIG instance
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi mig -lgi

# Adjust gpu-memory-utilization in deployment
# Reduce from 0.85 to 0.7 if needed

Cost Analysis

50% Cost Reduction with MIG

Without MIG: 2x H100 GPUs = ~$4,800/month

With MIG: 1x H100 GPU = ~$2,400/month

Monthly Savings: $2,400

Infrastructure Costs (Example: Switzerland North)

Resource	Type	Quantity	Monthly Cost (USD)
AKS Cluster	Management	1	Free
System Node Pool	Standard_D4s_v5	1	~$120
GPU Node Pool	Standard_NC40ads_H100_v5	1	~$2,400
Managed Disks	Premium SSD	300GB	~$60
Load Balancer	Standard	2	~$40
Total			~$2,620

Security Considerations

✅ Built-in Security Features

API Authentication: Bearer token required (token-abc123)
Hardware Isolation: MIG provides GPU-level isolation
Network Isolation: Kubernetes namespace separation
TLS: HTTPS endpoints via LoadBalancer
Data Residency: All processing within your chosen Azure region
No Data Persistence: Models don't store request/response data
Encrypted Communication: TLS/HTTPS for all API calls

🔒 Advanced Security (Production Recommended)

Additional Security Measures

Change API Keys: Replace token-abc123 with secure tokens
Rate Limiting: Add ingress controllers with rate limits
Input Validation: Implement request/response validation
Audit Logging: Enable Azure Monitor for comprehensive logging
Private Endpoints: Use private AKS clusters for sensitive workloads

Best Practices

Resource Management

Right-sizing: Monitor actual usage and adjust resource requests/limits
Node Affinity: Use node selectors to ensure GPU workloads run on GPU nodes
Horizontal Scaling: Plan for multiple replicas with additional GPU nodes
Vertical Scaling: Adjust MIG profiles based on workload requirements

Operational Excellence

GitOps: Store all configurations in version control
CI/CD Integration: Automate deployments with proper testing
Monitoring: Implement comprehensive monitoring and alerting
Backup/Recovery: Regular backup of configuration and state

Conclusion

This implementation provides organizations with:

Cost-Effective AI Infrastructure: 50% cost reduction through GPU sharing
Production-Ready Platform: Automated management and monitoring
Scalable Architecture: Easy to extend with additional models/nodes
Enterprise Security: Comprehensive security and compliance features
Operational Excellence: Full observability and troubleshooting capabilities
Persistent DNS Access: Stable FQDNs that survive cluster restarts

The MIG-enabled AKS cluster successfully demonstrates how modern GPU virtualization can optimize AI workload deployments while maintaining strict isolation and performance guarantees.

Next Steps

Production Readiness: Implement comprehensive monitoring and alerting
Model Expansion: Add additional AI models as business requires
Automation: Develop CI/CD pipelines for model deployment
Optimization: Continuous performance tuning based on usage patterns
Scaling: Plan for multi-node GPU clusters as demand grows

Updated Sep 02, 2025

Version 3.0

ai solutions

hieunhu

Microsoft

Joined November 10, 2024

View Profile

Azure AI Foundry Blog

Follow this blog board to get notified when there's new activity

Blog Post

Enterprise AKS Multi-Instance GPU (MIG) vLLM Deployment Guide

Executive Summary

Table of Contents

🚀 Quick Start

Prerequisites

Essential Commands Only

What You Get

Architecture Overview

High-Level Architecture

Technology Stack

Prerequisites

Azure Resources Required

GPU Quota Requirements

Phase 1: AKS Cluster Creation

Step 1.1: Create Resource Group

Step 1.2: Create AKS Cluster with System Node Pool

Step 1.3: Add H100 GPU Node Pool

Step 1.4: Configure kubectl Access

Phase 2: GPU Operator Installation

Step 2.1: Install Node Feature Discovery (NFD)

Step 2.2: Install GPU Operator

Step 2.3: Verify GPU Operator Installation

Phase 3: MIG Configuration

Step 3.1: Discover Available MIG Profiles

Step 3.2: Enable MIG on GPU Node

Step 3.3: Verify MIG Configuration

Phase 4: Model Deployments

Deploy Models

Test Model Services

Monitoring and Operations

GPU Utilization Monitoring

Application Monitoring

Troubleshooting

Cost Analysis

50% Cost Reduction with MIG

Infrastructure Costs (Example: Switzerland North)

Security Considerations

✅ Built-in Security Features

Additional Security Measures

Best Practices

Resource Management

Operational Excellence

Conclusion

Next Steps