Blog Post

Azure AI Foundry Blog
4 MIN READ

Deploying OpenAI’s First Open-Source Model on Azure AKS with KAITO

maljazaery's avatar
maljazaery
Icon for Microsoft rankMicrosoft
Aug 15, 2025

In this tutorial, we’ll walk through deploying the openai/gpt-oss-20B model on Azure’s Standard_NV36ads_A10_v5 GPU instances using vLLM for fast inference — all running on AKS via KAITO. By the end, you’ll have a public endpoint where you can send API requests in OpenAI’s format.

Special Thanks
Thanks to Andrew Thomas, Kurt Niebuhr, and Sachi Desai for their invaluable support, insightful discussions, and for providing the compute resources that made testing this deployment possible. Your contributions were essential in bringing this project to life.

Introduction

OpenAI recently released GPT-OSS, its first open-source large language model. With the rise of high-performance GPUs in the cloud, running advanced AI inference workloads has become easier than ever.

Microsoft Azure’s AKS (Azure Kubernetes Service) paired with KAITO (Kubernetes AI Toolchain Operator) provides a powerful, scalable environment for deploying such models. KAITO simplifies provisioning GPU nodes, managing inference workloads, and integrating AI-optimized runtimes like vLLM.

In this tutorial, we’ll walk through deploying the openai/gpt-oss-20B model on Azure’s Standard_NV36ads_A10_v5 GPU instances using vLLM for fast inference — all running on AKS via KAITO. By the end, you’ll have a public endpoint where you can send API requests in OpenAI’s format.

Step-by-Step Deployment

Before you begin, ensure you have an active Azure subscription with permissions to create resource groups, virtual networks, and AKS clusters. You’ll need to request and be approved for NVIDIA GPU quotas in your target region — specifically for Standard_NVads_A10_v5 or similar GPU SKUs. Basic familiarity with Kubernetes (kubectl) and Azure CLI will help you follow the steps smoothly. 

Please upgrade to Azure CLI v2.76.0 or above.

1. Set Up Environment Variables

We’ll define reusable variables for resource naming and region.

 

export RANDOM_ID="33000"
export REGION="swedencentral"
export AZURE_RESOURCE_GROUP="myKaitoResourceGroup$RANDOM_ID"
export CLUSTER_NAME="myClusterName$RANDOM_ID"

 

2. Create Resource Group

az group create \
    --name $AZURE_RESOURCE_GROUP \
    --location $REGION

3. Create AKS Cluster with AI Toolchain Operator

az aks create \
    --location $REGION \
    --resource-group $AZURE_RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --node-count 1 \
    --enable-ai-toolchain-operator \
    --enable-oidc-issuer \
    --generate-ssh-keys

4. Connect kubectl to the Cluster

az aks get-credentials \
    --resource-group ${AZURE_RESOURCE_GROUP} \
    --name ${CLUSTER_NAME}

5. Create KAITO Workspace for GPT-OSS with vLLM

Save the following to workspace-gptoss.yaml: 

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-gpt-oss-vllm-nv-a10
resource:
  instanceType: "Standard_NV36ads_A10_v5"
  count: 1
  labelSelector:
    matchLabels:
      app: gpt-oss-20b-vllm
inference:
  template:
    spec:
      containers:
        - name: vllm-openai
          image: vllm/vllm-openai:gptoss
          imagePullPolicy: IfNotPresent
          args:
            - --model
            - openai/gpt-oss-20b
            - --swap-space
            - "4"
            - --gpu-memory-utilization
            - "0.85"
            - --port
            - "5000"
          ports:
            - name: http
              containerPort: 5000
          resources:
            limits:
              nvidia.com/gpu: 1
              cpu: "36"
              memory: "440Gi"
            requests:
              nvidia.com/gpu: 1
              cpu: "18"
              memory: "220Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: 5000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 5000
            initialDelaySeconds: 600
            periodSeconds: 20
          env:  #special configs for A10 gpu
            - name: VLLM_ATTENTION_BACKEND
              value: "TRITON_ATTN_VLLM_V1"
            - name: VLLM_DISABLE_SINKS
              value: "1"

 

Apply it:

kubectl apply -f workspace-gptoss.yaml

6. Expose the Service Publicly

Expose via LoadBalancer

kubectl expose deployment workspace-gpt-oss-vllm-nv-a10 \
    --type=LoadBalancer \
    --name=workspace-gpt-oss-vllm-nv-a10-pub

Check the IP:

kubectl get svc workspace-gpt-oss-vllm-nv-a10-pub

7. Test the Endpoint

Once the LoadBalancer IP is ready:

export CLUSTERIP=<IP>


kubectl run -it --rm --restart=Never curl \
  --image=curlimages/curl -- \
  curl -X POST http://$CLUSTERIP/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 50,
    "temperature": 0
  }'

Using OpenAI python SDK:

from openai import OpenAI
 
client = OpenAI(
    base_url="http://<ip>:5000/v1/",
    api_key="EMPTY"
)
 
result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)
 
print(result.choices[0].message)

8. Load Testing the GPT-OSS Endpoint

To evaluate real-world performance, we used the llm-load-test-azure tool, which is designed for stress-testing Azure-hosted LLM endpoints.

  • Input Tokens per Request: 250
  • Output Tokens per Request: ~1,500 (target)
  • Test Duration: ~10 minutes 
  • Concurrency: 10
  • Model: 20B
  • GPU: A10
MetricValueDescription
TT_ACK0.60 sTime to acknowledge the request
TTFT0.77 sTime to first token
ITL0.038 sIdle time per request
TPOT0.039 sTime per output token
Avg. Response Time57.77 sTotal latency for a single request
Output Tokens Throughput257.77 tokens/secAvg output tokens per seconds
Performance Comparison - Concurrency 10 vs 5:

There is always a trade between latency and throughput- By reducing the target concurrency from 10 to 5, we can achieve lower response time. The choice between the two depends on whether lower latency per request or higher overall throughput is the priority for your workload.

MetricConcurrency 10Concurrency 5Observation
Average Response Time57.77 s45.80 sLower concurrency reduces per-request latency by ~21%
Output Tokens per Second257.77162.75Higher concurrency yields higher total throughput
Completed Requests/sec0.1730.109Concurrency 10 processes ~58% more requests per second
 

Conclusion

With these steps, we’ve successfully:

  • Provisioned an Azure AKS cluster with NVIDIA A10 GPUs
  • Installed KAITO to manage AI workloads
  • Deployed the openai/gpt-oss-20B model with vLLM for fast inference
  • Exposed it to the internet using a LoadBalancer
  • Sent an OpenAI-style API request to our own hosted model

This setup can be scaled by increasing GPU count, tuning vLLM parameters, or integrating with your existing application pipelines.

 

Updated Aug 15, 2025
Version 5.0