Deploying OpenAI’s First Open-Source Model on Azure AKS with KAITO

maljazaery

Microsoft

Aug 15, 2025

In this tutorial, we’ll walk through deploying the openai/gpt-oss-20B model on Azure’s Standard_NV36ads_A10_v5 GPU instances using vLLM for fast inference — all running on AKS via KAITO. By the end, you’ll have a public endpoint where you can send API requests in OpenAI’s format.

Special Thanks
Thanks to Andrew Thomas, Kurt Niebuhr, and Sachi Desai for their invaluable support, insightful discussions, and for providing the compute resources that made testing this deployment possible. Your contributions were essential in bringing this project to life.

Introduction

OpenAI recently released GPT-OSS, its first open-source large language model. With the rise of high-performance GPUs in the cloud, running advanced AI inference workloads has become easier than ever.

Microsoft Azure’s AKS (Azure Kubernetes Service) paired with KAITO (Kubernetes AI Toolchain Operator) provides a powerful, scalable environment for deploying such models. KAITO simplifies provisioning GPU nodes, managing inference workloads, and integrating AI-optimized runtimes like vLLM.

In this tutorial, we’ll walk through deploying the openai/gpt-oss-20B model on Azure’s Standard_NV36ads_A10_v5 GPU instances using vLLM for fast inference — all running on AKS via KAITO. By the end, you’ll have a public endpoint where you can send API requests in OpenAI’s format.

Step-by-Step Deployment

Before you begin, ensure you have an active Azure subscription with permissions to create resource groups, virtual networks, and AKS clusters. You’ll need to request and be approved for NVIDIA GPU quotas in your target region — specifically for Standard_NVads_A10_v5 or similar GPU SKUs. Basic familiarity with Kubernetes (kubectl) and Azure CLI will help you follow the steps smoothly.

Please upgrade to Azure CLI v2.76.0 or above.

1. Set Up Environment Variables

We’ll define reusable variables for resource naming and region.

export RANDOM_ID="33000"
export REGION="swedencentral"
export AZURE_RESOURCE_GROUP="myKaitoResourceGroup$RANDOM_ID"
export CLUSTER_NAME="myClusterName$RANDOM_ID"

2. Create Resource Group

az group create \
    --name $AZURE_RESOURCE_GROUP \
    --location $REGION

3. Create AKS Cluster with AI Toolchain Operator

az aks create \
    --location $REGION \
    --resource-group $AZURE_RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --node-count 1 \
    --enable-ai-toolchain-operator \
    --enable-oidc-issuer \
    --generate-ssh-keys

4. Connect kubectl to the Cluster

az aks get-credentials \
    --resource-group ${AZURE_RESOURCE_GROUP} \
    --name ${CLUSTER_NAME}

5. Create KAITO Workspace for GPT-OSS with vLLM

Save the following to workspace-gptoss.yaml:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-gpt-oss-vllm-nv-a10
resource:
  instanceType: "Standard_NV36ads_A10_v5"
  count: 1
  labelSelector:
    matchLabels:
      app: gpt-oss-20b-vllm
inference:
  template:
    spec:
      containers:
        - name: vllm-openai
          image: vllm/vllm-openai:gptoss
          imagePullPolicy: IfNotPresent
          args:
            - --model
            - openai/gpt-oss-20b
            - --swap-space
            - "4"
            - --gpu-memory-utilization
            - "0.85"
            - --port
            - "5000"
          ports:
            - name: http
              containerPort: 5000
          resources:
            limits:
              nvidia.com/gpu: 1
              cpu: "36"
              memory: "440Gi"
            requests:
              nvidia.com/gpu: 1
              cpu: "18"
              memory: "220Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: 5000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 5000
            initialDelaySeconds: 600
            periodSeconds: 20
          env:  #special configs for A10 gpu
            - name: VLLM_ATTENTION_BACKEND
              value: "TRITON_ATTN_VLLM_V1"
            - name: VLLM_DISABLE_SINKS
              value: "1"

Apply it:

kubectl apply -f workspace-gptoss.yaml

6. Expose the Service Publicly

Expose via LoadBalancer

kubectl expose deployment workspace-gpt-oss-vllm-nv-a10 \
    --type=LoadBalancer \
    --name=workspace-gpt-oss-vllm-nv-a10-pub

Check the IP:

kubectl get svc workspace-gpt-oss-vllm-nv-a10-pub

7. Test the Endpoint

Once the LoadBalancer IP is ready:

export CLUSTERIP=<IP>


kubectl run -it --rm --restart=Never curl \
  --image=curlimages/curl -- \
  curl -X POST http://$CLUSTERIP/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 50,
    "temperature": 0
  }'

Using OpenAI python SDK:

from openai import OpenAI
 
client = OpenAI(
    base_url="http://<ip>:5000/v1/",
    api_key="EMPTY"
)
 
result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)
 
print(result.choices[0].message)

8. Load Testing the GPT-OSS Endpoint

To evaluate real-world performance, we used the llm-load-test-azure tool, which is designed for stress-testing Azure-hosted LLM endpoints.

Input Tokens per Request: 250
Output Tokens per Request: ~1,500 (target)
Test Duration: ~10 minutes
Concurrency: 10
Model: 20B
GPU: A10

Metric	Value	Description
TT_ACK	0.60 s	Time to acknowledge the request
TTFT	0.77 s	Time to first token
ITL	0.038 s	Idle time per request
TPOT	0.039 s	Time per output token
Avg. Response Time	57.77 s	Total latency for a single request
Output Tokens Throughput	257.77 tokens/sec	Avg output tokens per seconds

Performance Comparison - Concurrency 10 vs 5:

There is always a trade between latency and throughput- By reducing the target concurrency from 10 to 5, we can achieve lower response time. The choice between the two depends on whether lower latency per request or higher overall throughput is the priority for your workload.

Metric	Concurrency 10	Concurrency 5	Observation
Average Response Time	57.77 s	45.80 s	Lower concurrency reduces per-request latency by ~21%
Output Tokens per Second	257.77	162.75	Higher concurrency yields higher total throughput
Completed Requests/sec	0.173	0.109	Concurrency 10 processes ~58% more requests per second