As AI workloads grow in complexity and scale, deploying large language models like GPT-OSS efficiently becomes critical. In this post, we’ll walk through how to deploy GPT-OSS using Azure Machine Learning (Azure ML) Online Endpoint on a managed compute (NV-A10 & NC-H100) —leveraging a streamlined, script-driven approach.
❓Why Azure ML Online Endpoints?
Azure ML online endpoints provide a fully managed, scalable, and secure way to serve models like GPT-OSS. They support production-grade features like blue-green deployments for safe rollouts and traffic mirroring to test new versions without impacting live traffic. With built-in autoscaling, authentication, monitoring, and seamless REST API integration, they’re ideal for deploying large models on managed compute with minimal operational overhead.
🧰 What You’ll Need
Before diving in, make sure you have the following:
- Azure CLI installed and authenticated
- An Azure ML workspace set up
- Contributor or Owner permissions on your Azure subscription
- GPU Quota in Azure ML studio
# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# Login to Azure
az login
📁 Clone the code
git clone https://github.com/maljazaery/AzureML_LLM_Endpoint_Deployment_Script.git
The repo includes one env example and 2 deployment configs:
- AML_env/gpt_oss/: Dockerfile and environment setup for GPT-OSS
- configs/gpt_oss/config_a10.conf: Sample config for NV-A10 GPU
- configs/gpt_oss/config_h100.conf: Sample config for NC-H100 GPU
⚙️ Configuration
Create a config.conf file tailored to your Azure environment ( check "configs" folder for examples):
# Azure subscription & workspace settings
AZ_SUBSCRIPTION_ID="your-subscription-id"
AZ_RESOURCE_GROUP="your-resource-group"
AZ_ML_WORKSPACE="your-workspace-name"
# Endpoint and deployment settings
AZ_ENDPOINT_NAME="gptoss-endpoint-h100"
AZ_INSTANCE_TYPE="Standard_NC40ads_H100_v5"
# ... other settings
🚦 Deployment Options
Option 1: Full Automated Deployment
chmod +x deploy-main.sh
./deploy-main.sh config.conf
Option 2: Step-by-Step Deployment
# Create environment only
./deploy-main.sh --env-only config.conf
# Create endpoint and deployment
./deploy-main.sh --endpoint-only config.conf
🧪 Testing the Endpoint
Using curl
az ml online-endpoint get-credentials --name your-endpoint-name
curl -X POST "https://your-endpoint.region.inference.ml.azure.com" \
-H "Authorization: Bearer <your-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
Using OpenAI SDK
from openai import OpenAI
client = OpenAI(
baseurl="https://your-endpoint.region.inference.ml.azure.com",
apikey="your-key"
)
result = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(result.choices[0].message)
📊 Monitoring & Management
# List endpoints
az ml online-endpoint list
# Show endpoint details
az ml online-endpoint show --name your-endpoint-name
# View deployment logs
az ml online-deployment get-logs --name current --endpoint-name your-endpoint-name --lines 100 --follow
📈 Load Testing:
We benchmarked gpt-oss-20B on 1xH100 GPU across concurrency levels of 25, 50, and 75. Input and output token sizes were held constant across all runs.
Settings:
- Input Tokens per Request: 250
- Output Tokens per Request: ~1,500 (target)
- Test Duration: ~10 minutes
- Model: gpt-oss-20B
- concurrency: 25, 50, and 75
- VM: 1x NC-H100 (Standard_NC40ads_H100_v5)
Results:
- Latency (response time) increased with higher concurrency: ~13.7 ms → 16.7 ms → 22.0 ms.
- Throughput (output tokens/sec) scaled significantly with concurrency: ~2.6k → 4.1k → 4.6k tokens/sec.
- Request throughput also improved: ~1.8 → 2.8 → 3.1 requests/sec.
Conclusion: The H100 scales throughput efficiently as concurrency grows, but with a trade-off of gradually increasing response latency. At higher concurrency (50 → 75), throughput gains begin to plateau, indicating near-saturation of GPU capacity.