<your-region>
, ai-gpu-aks-rg
, and ai-h100-cluster
. Replace these with your organization's naming conventions and preferred Azure regions.Executive Summary
This comprehensive guide demonstrates how to deploy AI models using vLLM on Azure Kubernetes Service (AKS) with NVIDIA H100 GPUs and Multi-Instance GPU (MIG) technology. The solution enables running multiple AI models simultaneously on a single GPU with hardware isolation, optimizing cost and resource utilization.
π Isolated AI Zone Architecture: This self-hosted model deployment operates within your own Azure tenant, providing a dedicated isolated zone for specialized workloads. Compatible with all new Azure regions including Indonesia Central, Malaysia West, and other emerging markets. The architecture integrates seamlessly with Azure API Management (APIM) AI Gateway for intelligent traffic routing across Azure AI Foundry Models, Azure GPU models, and on-premises deployments - creating a unified hybrid AI infrastructure.
Table of Contents
π Quick Start
TL;DR: Complete deployment in ~30 minutes with these essential commands
Prerequisites
- Azure CLI installed and logged in
- kubectl installed
- Sufficient Azure quota for H100 GPUs
Essential Commands Only
# 1. Create infrastructure (5 minutes)
az group create --name ai-gpu-aks-rg --location eastus
az aks create --resource-group ai-gpu-aks-rg --name ai-h100-cluster --location eastus --node-count 1 --generate-ssh-keys
az aks nodepool add --resource-group ai-gpu-aks-rg --cluster-name ai-h100-cluster --name gpupool --node-count 1 --node-vm-size Standard_NC40ads_H100_v5
az aks get-credentials --resource-group ai-gpu-aks-rg --name ai-h100-cluster
# 2. Install GPU components (10 minutes)
helm install --wait --create-namespace -n gpu-operator node-feature-discovery node-feature-discovery --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts --set-json master.config.extraLabelNs='["nvidia.com"]'
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator --set nfd.enabled=false --set driver.enabled=false
# 3. Configure MIG (2 minutes)
az aks nodepool update --cluster-name ai-h100-cluster --resource-group ai-gpu-aks-rg --nodepool-name gpupool --labels "nvidia.com/mig.config"="all-3g.47gb"
# 4. Deploy models (10 minutes)
kubectl create namespace vllm
# Apply the YAML manifests from Phase 4 sections
What You Get
- 2 AI Models running simultaneously on 1 GPU
- 47.5GB memory per model instance
- Hardware isolation between workloads
- 50% cost savings vs separate GPUs
- Persistent DNS names for easy API access:
- GPT-OSS:
gpt-oss.<location>.cloudapp.azure.com:8000
- Phi-4:
phi4.<location>.cloudapp.azure.com:8001
- GPT-OSS:
Architecture Overview
High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Azure Kubernetes Service (AKS) β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β vLLM Namespace ββ β β ββββββββββββββββββββ ββββββββββββββββββββ ββ β β β GPT-OSS Service β β Phi-4 Service β ββ β β β (Chat/Reasoning) β β (Chat/Analysis) β ββ β β β Port: 8000 β β Port: 8001 β ββ β β ββββββββββββββββββββ ββββββββββββββββββββ ββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β GPU Operator Namespace ββ β β ββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββββ ββ β β β NFD β βDevice Pluginβ β MIG Manager β ββ β β ββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββββ ββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β H100 GPU Node Pool ββ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β NVIDIA H100 NVL (94GB VRAM) βββ β β β ββββββββββββββββββββ ββββββββββββββββββββ βββ β β β β MIG Instance 1 β β MIG Instance 2 β βββ β β β β 47.5GB Memory β β 47.5GB Memory β βββ β β β β 60 SM Units β β 60 SM Units β βββ β β β β β β β βββ β β β β GPT-OSS Model β β Phi-4 Model β βββ β β β β (~20GB Used) β β (~14GB Used) β βββ β β β ββββββββββββββββββββ ββββββββββββββββββββ βββ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Technology Stack
Component | Technology | Version | Purpose |
---|---|---|---|
Container Orchestration | Azure Kubernetes Service (AKS) | 1.30 | Container orchestration platform |
GPU Hardware | NVIDIA H100 NVL | - | High-performance AI compute |
GPU Virtualization | Multi-Instance GPU (MIG) | 3g.47gb profiles | Hardware-level GPU partitioning |
GPU Management | NVIDIA GPU Operator | Latest | Automated GPU software stack |
Model Serving | vLLM | Latest | High-performance LLM inference |
Models | GPT-OSS 20B, Phi-4 14B | Latest | Reasoning & Chat/Analysis models |
Prerequisites
Azure Resources Required
- Active Azure subscription with sufficient quota
- Resource group in preferred Azure region (with H100 availability)
- Azure CLI installed and configured
- kubectl installed and configured
GPU Quota Requirements
Resource | Quota Needed | Purpose |
---|---|---|
Standard_NC40ads_H100_v5 | 1 vCPU | H100 GPU instance |
Total Regional vCPUs | 40+ | Node capacity |
Premium Managed Disks | 200GB+ | Storage |
Phase 1: AKS Cluster Creation
Step 1.1: Create Resource Group
# Create resource group in your preferred region
az group create \
--name ai-gpu-aks-rg \
--location <your-region>
Step 1.2: Create AKS Cluster with System Node Pool
# Create AKS cluster with system node pool
az aks create \
--resource-group ai-gpu-aks-rg \
--name ai-h100-cluster \
--location <your-region> \
--node-count 1 \
--node-vm-size Standard_D4s_v5 \
--kubernetes-version 1.30 \
--enable-managed-identity \
--network-plugin azure \
--network-policy azure \
--node-osdisk-type Managed \
--node-osdisk-size 100 \
--generate-ssh-keys
Step 1.3: Add H100 GPU Node Pool
# Add GPU node pool with H100
az aks nodepool add \
--resource-group ai-gpu-aks-rg \
--cluster-name ai-h100-cluster \
--name gpupool \
--node-count 1 \
--node-vm-size Standard_NC40ads_H100_v5 \
--node-osdisk-type Managed \
--node-osdisk-size 200 \
--max-pods 110 \
--kubernetes-version 1.30
Step 1.4: Configure kubectl Access
# Get cluster credentials
az aks get-credentials \
--resource-group ai-gpu-aks-rg \
--name ai-h100-cluster \
--overwrite-existing
# Verify cluster access
kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-gpupool-xxxxxxxx-vmss000001 Ready <none> 5m v1.30.14
aks-nodepool1-xxxxxxxx-vmss000001 Ready <none> 10m v1.30.14
Phase 2: GPU Operator Installation
The NVIDIA GPU Operator automates the management of all NVIDIA software components needed for GPU workloads.
Step 2.1: Install Node Feature Discovery (NFD)
# Install NFD as prerequisite for GPU Operator
helm install --wait --create-namespace -n gpu-operator \
node-feature-discovery node-feature-discovery \
--repo https://kubernetes-sigs.github.io/node-feature-discovery/charts \
--set-json master.config.extraLabelNs='["nvidia.com"]' \
--set-json worker.tolerations='[
{
"effect": "NoSchedule",
"key": "sku",
"operator": "Equal",
"value": "gpu"
},
{
"effect": "NoSchedule",
"key": "mig",
"value": "notReady",
"operator": "Equal"
}
]'
Step 2.2: Install GPU Operator
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator \
--set-json daemonsets.tolerations='[
{
"effect": "NoSchedule",
"key": "sku",
"operator": "Equal",
"value": "gpu"
}
]' \
--set nfd.enabled=false \
--set driver.enabled=false \
--set operator.runtimeClass=nvidia-container-runtime
Step 2.3: Verify GPU Operator Installation
# Check all GPU Operator components
kubectl get pods -n gpu-operator
nvidia-device-plugin-daemonset-xxxxx
nvidia-mig-manager-xxxxx
nvidia-dcgm-exporter-xxxxx
gpu-feature-discovery-xxxxx
nvidia-container-toolkit-daemonset-xxxxx
Phase 3: MIG Configuration
Multi-Instance GPU (MIG) allows partitioning the H100 into multiple isolated GPU instances.
Step 3.1: Discover Available MIG Profiles
# Check available MIG instance profiles on your GPU
kubectl exec -n gpu-operator \
$(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
-- nvidia-smi mig -lgip
- H100 NVL: Uses
3g.47gb
(46.38 GiB per instance) - A100 80GB: Uses
3g.40gb
(39.59 GiB per instance) - H100 SXM: May vary, check with the command above
Use the profile name exactly as shown in your output when configuring MIG.
Step 3.2: Enable MIG on GPU Node
# Label the GPU node with the CORRECT MIG configuration for your GPU
# For H100 NVL, use "3g.47gb". For A100, use "3g.40gb"
az aks nodepool update \
--cluster-name ai-h100-cluster \
--resource-group ai-gpu-aks-rg \
--nodepool-name gpupool \
--labels "nvidia.com/mig.config"="all-3g.47gb"
# Configure MIG Manager for reboot (H100 requirement)
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \
-p '{"spec": {"migManager": {"env": [{"name": "WITH_REBOOT", "value": "true"}]}}}'
Step 3.3: Verify MIG Configuration
# Check MIG instances are created
kubectl exec -n gpu-operator $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi mig -lgi
# Verify GPU resources are available
kubectl describe node -l agentpool=gpupool | grep nvidia.com/gpu
Phase 4: Model Deployments
π¦ View GPT-OSS Deployment YAML
# gpt-oss-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: bge-m3-mig
namespace: vllm
spec:
replicas: 1
selector:
matchLabels:
app: bge-m3-mig
template:
metadata:
labels:
app: bge-m3-mig
spec:
nodeSelector:
agentpool: gpupool
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- "--model"
- "BAAI/bge-m3"
- "--trust-remote-code"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.85"
- "--dtype"
- "float16"
- "--api-key"
- "token-abc123"
- "--port"
- "8001"
- "--host"
- "0.0.0.0"
- "--enable-prefix-caching"
- "--max-num-seqs"
- "256"
ports:
- containerPort: 8001
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn"
- name: OMP_NUM_THREADS
value: "4"
resources:
limits:
nvidia.com/gpu: 1
memory: "30Gi"
cpu: "6"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
cpu: "4"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
name: bge-m3-service
namespace: vllm
annotations:
service.beta.kubernetes.io/azure-dns-label-name: "vllm-bge-m3"
spec:
type: LoadBalancer
selector:
app: bge-m3-mig
ports:
- port: 8001
targetPort: 8001
π¦ View Phi-4 Deployment YAML
# granite-vision-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: granite-vision-mig
namespace: vllm
spec:
replicas: 1
selector:
matchLabels:
app: granite-vision-mig
template:
metadata:
labels:
app: granite-vision-mig
spec:
nodeSelector:
agentpool: gpupool
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- "--model"
- "ibm-granite/granite-vision-3.3-2b"
- "--trust-remote-code"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.85"
- "--dtype"
- "auto"
- "--api-key"
- "token-abc123"
- "--port"
- "8000"
- "--host"
- "0.0.0.0"
- "--enable-prefix-caching"
ports:
- containerPort: 8000
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn"
resources:
limits:
nvidia.com/gpu: 1
memory: "30Gi"
cpu: "6"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
cpu: "4"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
name: granite-vision-service
namespace: vllm
annotations:
service.beta.kubernetes.io/azure-dns-label-name: "vllm-granite-vision"
spec:
type: LoadBalancer
selector:
app: granite-vision-mig
ports:
- port: 8000
targetPort: 8000
Deploy Models
# Create namespace
kubectl create namespace vllm
# Deploy GPT-OSS
kubectl apply -f gpt-oss-deployment.yaml
# Deploy Phi-4
kubectl apply -f phi4-deployment.yaml
# Check deployment status
kubectl get pods,svc -n vllm
Test Model Services
# Test GPT-OSS reasoning model (using DNS name)
curl -X POST http://gpt-oss.<location>.cloudapp.azure.com:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer token-abc123" \
-d '{
"model": "microsoft/gpt-oss-20b",
"messages": [{"role": "user", "content": "Explain the benefits of MIG technology for enterprise AI workloads"}],
"max_tokens": 200,
"temperature": 0.7
}' | jq .
# Test Phi-4 small language model (using DNS name)
curl -X POST http://phi4.<location>.cloudapp.azure.com:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer token-abc123" \
-d '{
"model": "microsoft/phi-4-14b",
"messages": [{"role": "user", "content": "Analyze cost savings from GPU virtualization in cloud environments"}],
"max_tokens": 150,
"temperature": 0.5
}' | jq .
Monitoring and Operations
GPU Utilization Monitoring
# Check GPU utilization across MIG instances
kubectl exec -n gpu-operator \
$(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
-- nvidia-smi
Application Monitoring
# Monitor pod resource usage
kubectl top pods -n vllm
# Check application logs
kubectl logs -f deployment/gpt-oss-mig -n vllm
kubectl logs -f deployment/phi4-mig -n vllm
# Monitor service health
kubectl get endpoints -n vllm
Troubleshooting
π§ MIG Instances Not Created
nvidia.com/gpu: 1
instead of2
in node allocatable- Models sharing same GPU without isolation
# Check MIG configuration
kubectl describe node -l agentpool=gpupool | grep mig.config
# Restart MIG manager if needed
kubectl delete pod -l app=nvidia-mig-manager -n gpu-operator
# Verify GPU processes
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi
π§ Pod Pending with GPU Resource Issues
- Pods stuck in
Pending
state - Event:
Insufficient nvidia.com/gpu
# Check GPU resource availability
kubectl describe node -l agentpool=gpupool | grep -A10 Allocatable
# Verify device plugin is running
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
π§ Model Loading Failures
- vLLM containers crashing during model load
- OOM (Out of Memory) errors
# Check available memory per MIG instance
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi mig -lgi
# Adjust gpu-memory-utilization in deployment
# Reduce from 0.85 to 0.7 if needed
Cost Analysis
50% Cost Reduction with MIG
Without MIG: 2x H100 GPUs = ~$4,800/month
With MIG: 1x H100 GPU = ~$2,400/month
Monthly Savings: $2,400
Infrastructure Costs (Example: Switzerland North)
Resource | Type | Quantity | Monthly Cost (USD) |
---|---|---|---|
AKS Cluster | Management | 1 | Free |
System Node Pool | Standard_D4s_v5 | 1 | ~$120 |
GPU Node Pool | Standard_NC40ads_H100_v5 | 1 | ~$2,400 |
Managed Disks | Premium SSD | 300GB | ~$60 |
Load Balancer | Standard | 2 | ~$40 |
Total | ~$2,620 |
Security Considerations
β Built-in Security Features
- API Authentication: Bearer token required (
token-abc123
) - Hardware Isolation: MIG provides GPU-level isolation
- Network Isolation: Kubernetes namespace separation
- TLS: HTTPS endpoints via LoadBalancer
- Data Residency: All processing within your chosen Azure region
- No Data Persistence: Models don't store request/response data
- Encrypted Communication: TLS/HTTPS for all API calls
π Advanced Security (Production Recommended)
Additional Security Measures
- Change API Keys: Replace
token-abc123
with secure tokens - Rate Limiting: Add ingress controllers with rate limits
- Input Validation: Implement request/response validation
- Audit Logging: Enable Azure Monitor for comprehensive logging
- Private Endpoints: Use private AKS clusters for sensitive workloads
Best Practices
Resource Management
- Right-sizing: Monitor actual usage and adjust resource requests/limits
- Node Affinity: Use node selectors to ensure GPU workloads run on GPU nodes
- Horizontal Scaling: Plan for multiple replicas with additional GPU nodes
- Vertical Scaling: Adjust MIG profiles based on workload requirements
Operational Excellence
- GitOps: Store all configurations in version control
- CI/CD Integration: Automate deployments with proper testing
- Monitoring: Implement comprehensive monitoring and alerting
- Backup/Recovery: Regular backup of configuration and state
Conclusion
This implementation provides organizations with:
- Cost-Effective AI Infrastructure: 50% cost reduction through GPU sharing
- Production-Ready Platform: Automated management and monitoring
- Scalable Architecture: Easy to extend with additional models/nodes
- Enterprise Security: Comprehensive security and compliance features
- Operational Excellence: Full observability and troubleshooting capabilities
- Persistent DNS Access: Stable FQDNs that survive cluster restarts
The MIG-enabled AKS cluster successfully demonstrates how modern GPU virtualization can optimize AI workload deployments while maintaining strict isolation and performance guarantees.
Next Steps
- Production Readiness: Implement comprehensive monitoring and alerting
- Model Expansion: Add additional AI models as business requires
- Automation: Develop CI/CD pipelines for model deployment
- Optimization: Continuous performance tuning based on usage patterns
- Scaling: Plan for multi-node GPU clusters as demand grows