In this tutorial, we’ll walk through deploying the openai/gpt-oss-20B model on Azure’s Standard_NV36ads_A10_v5 GPU instances using vLLM for fast inference — all running on AKS via KAITO. By the end, you’ll have a public endpoint where you can send API requests in OpenAI’s format.
Special Thanks
Thanks to Andrew Thomas, Kurt Niebuhr, and Sachi Desai for their invaluable support, insightful discussions, and for providing the compute resources that made testing this deployment possible. Your contributions were essential in bringing this project to life.
Introduction
OpenAI recently released GPT-OSS, its first open-source large language model. With the rise of high-performance GPUs in the cloud, running advanced AI inference workloads has become easier than ever.
Microsoft Azure’s AKS (Azure Kubernetes Service) paired with KAITO (Kubernetes AI Toolchain Operator) provides a powerful, scalable environment for deploying such models. KAITO simplifies provisioning GPU nodes, managing inference workloads, and integrating AI-optimized runtimes like vLLM.
In this tutorial, we’ll walk through deploying the openai/gpt-oss-20B model on Azure’s Standard_NV36ads_A10_v5 GPU instances using vLLM for fast inference — all running on AKS via KAITO. By the end, you’ll have a public endpoint where you can send API requests in OpenAI’s format.
Step-by-Step Deployment
Before you begin, ensure you have an active Azure subscription with permissions to create resource groups, virtual networks, and AKS clusters. You’ll need to request and be approved for NVIDIA GPU quotas in your target region — specifically for Standard_NVads_A10_v5 or similar GPU SKUs. Basic familiarity with Kubernetes (kubectl) and Azure CLI will help you follow the steps smoothly.
Please upgrade to Azure CLI v2.76.0 or above.
1. Set Up Environment Variables
We’ll define reusable variables for resource naming and region.
export RANDOM_ID="33000"
export REGION="swedencentral"
export AZURE_RESOURCE_GROUP="myKaitoResourceGroup$RANDOM_ID"
export CLUSTER_NAME="myClusterName$RANDOM_ID"
2. Create Resource Group
az group create \
--name $AZURE_RESOURCE_GROUP \
--location $REGION
3. Create AKS Cluster with AI Toolchain Operator
az aks create \
--location $REGION \
--resource-group $AZURE_RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 1 \
--enable-ai-toolchain-operator \
--enable-oidc-issuer \
--generate-ssh-keys
4. Connect kubectl to the Cluster
az aks get-credentials \
--resource-group ${AZURE_RESOURCE_GROUP} \
--name ${CLUSTER_NAME}
5. Create KAITO Workspace for GPT-OSS with vLLM
Save the following to workspace-gptoss.yaml:
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-gpt-oss-vllm-nv-a10
resource:
instanceType: "Standard_NV36ads_A10_v5"
count: 1
labelSelector:
matchLabels:
app: gpt-oss-20b-vllm
inference:
template:
spec:
containers:
- name: vllm-openai
image: vllm/vllm-openai:gptoss
imagePullPolicy: IfNotPresent
args:
- --model
- openai/gpt-oss-20b
- --swap-space
- "4"
- --gpu-memory-utilization
- "0.85"
- --port
- "5000"
ports:
- name: http
containerPort: 5000
resources:
limits:
nvidia.com/gpu: 1
cpu: "36"
memory: "440Gi"
requests:
nvidia.com/gpu: 1
cpu: "18"
memory: "220Gi"
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 600
periodSeconds: 20
env: #special configs for A10 gpu
- name: VLLM_ATTENTION_BACKEND
value: "TRITON_ATTN_VLLM_V1"
- name: VLLM_DISABLE_SINKS
value: "1"
Apply it:
kubectl apply -f workspace-gptoss.yaml
6. Expose the Service Publicly
Expose via LoadBalancer
kubectl expose deployment workspace-gpt-oss-vllm-nv-a10 \
--type=LoadBalancer \
--name=workspace-gpt-oss-vllm-nv-a10-pub
Check the IP:
kubectl get svc workspace-gpt-oss-vllm-nv-a10-pub
7. Test the Endpoint
Once the LoadBalancer IP is ready:
export CLUSTERIP=<IP>
kubectl run -it --rm --restart=Never curl \
--image=curlimages/curl -- \
curl -X POST http://$CLUSTERIP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "What is Kubernetes?"}],
"max_tokens": 50,
"temperature": 0
}'
Using OpenAI python SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://<ip>:5000/v1/",
api_key="EMPTY"
)
result = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)
print(result.choices[0].message)
8. Load Testing the GPT-OSS Endpoint
To evaluate real-world performance, we used the llm-load-test-azure tool, which is designed for stress-testing Azure-hosted LLM endpoints.
- Input Tokens per Request: 250
- Output Tokens per Request: ~1,500 (target)
- Test Duration: ~10 minutes
- Concurrency: 10
- Model: 20B
- GPU: A10
Metric | Value | Description |
---|---|---|
TT_ACK | 0.60 s | Time to acknowledge the request |
TTFT | 0.77 s | Time to first token |
ITL | 0.038 s | Idle time per request |
TPOT | 0.039 s | Time per output token |
Avg. Response Time | 57.77 s | Total latency for a single request |
Output Tokens Throughput | 257.77 tokens/sec | Avg output tokens per seconds |
Performance Comparison - Concurrency 10 vs 5:
There is always a trade between latency and throughput- By reducing the target concurrency from 10 to 5, we can achieve lower response time. The choice between the two depends on whether lower latency per request or higher overall throughput is the priority for your workload.
Metric | Concurrency 10 | Concurrency 5 | Observation |
---|---|---|---|
Average Response Time | 57.77 s | 45.80 s | Lower concurrency reduces per-request latency by ~21% |
Output Tokens per Second | 257.77 | 162.75 | Higher concurrency yields higher total throughput |
Completed Requests/sec | 0.173 | 0.109 | Concurrency 10 processes ~58% more requests per second |
Conclusion
With these steps, we’ve successfully:
- Provisioned an Azure AKS cluster with NVIDIA A10 GPUs
- Installed KAITO to manage AI workloads
- Deployed the openai/gpt-oss-20B model with vLLM for fast inference
- Exposed it to the internet using a LoadBalancer
- Sent an OpenAI-style API request to our own hosted model
This setup can be scaled by increasing GPU count, tuning vLLM parameters, or integrating with your existing application pipelines.