Deploying GPT-OSS as Azure ML Online Endpoint

maljazaery

Microsoft

Sep 10, 2025

As AI workloads grow in complexity and scale, deploying large language models like GPT-OSS efficiently becomes critical. In this post, we’ll walk through how to deploy GPT-OSS using Azure Machine Learning (Azure ML) Online Endpoint on a managed compute (NV-A10 & NC-H100) —leveraging a streamlined, script-driven approach.

❓Why Azure ML Online Endpoints?

Azure ML online endpoints provide a fully managed, scalable, and secure way to serve models like GPT-OSS. They support production-grade features like blue-green deployments for safe rollouts and traffic mirroring to test new versions without impacting live traffic. With built-in autoscaling, authentication, monitoring, and seamless REST API integration, they’re ideal for deploying large models on managed compute with minimal operational overhead.

🧰 What You’ll Need

Before diving in, make sure you have the following:

Azure CLI installed and authenticated
An Azure ML workspace set up
Contributor or Owner permissions on your Azure subscription
GPU Quota in Azure ML studio

# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# Login to Azure
az login

📁 Clone the code

git clone https://github.com/maljazaery/AzureML_LLM_Endpoint_Deployment_Script.git

The repo includes one env example and 2 deployment configs:

AML_env/gpt_oss/: Dockerfile and environment setup for GPT-OSS
configs/gpt_oss/config_a10.conf: Sample config for NV-A10 GPU
configs/gpt_oss/config_h100.conf: Sample config for NC-H100 GPU

⚙️ Configuration

Create a config.conf file tailored to your Azure environment ( check "configs" folder for examples):

# Azure subscription & workspace settings
AZ_SUBSCRIPTION_ID="your-subscription-id"
AZ_RESOURCE_GROUP="your-resource-group"
AZ_ML_WORKSPACE="your-workspace-name"

# Endpoint and deployment settings
AZ_ENDPOINT_NAME="gptoss-endpoint-h100"
AZ_INSTANCE_TYPE="Standard_NC40ads_H100_v5"
# ... other settings

🚦 Deployment Options

Option 1: Full Automated Deployment

chmod +x deploy-main.sh

./deploy-main.sh config.conf

Option 2: Step-by-Step Deployment

# Create environment only
./deploy-main.sh --env-only config.conf

# Create endpoint and deployment
./deploy-main.sh --endpoint-only config.conf

🧪 Testing the Endpoint

Using curl

az ml online-endpoint get-credentials --name your-endpoint-name

curl -X POST "https://your-endpoint.region.inference.ml.azure.com" \
  -H "Authorization: Bearer <your-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    baseurl="https://your-endpoint.region.inference.ml.azure.com",
    apikey="your-key"
)

result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(result.choices[0].message)

📊 Monitoring & Management

# List endpoints
az ml online-endpoint list

# Show endpoint details
az ml online-endpoint show --name your-endpoint-name

# View deployment logs
az ml online-deployment get-logs --name current --endpoint-name your-endpoint-name --lines 100 --follow

📈 Load Testing:

We benchmarked gpt-oss-20B on 1xH100 GPU across concurrency levels of 25, 50, and 75. Input and output token sizes were held constant across all runs.

Settings:

Input Tokens per Request: 250
Output Tokens per Request: ~1,500 (target)
Test Duration: ~10 minutes
Model: gpt-oss-20B
concurrency: 25, 50, and 75
VM: 1x NC-H100 (Standard_NC40ads_H100_v5)

Results:

Latency (response time) increased with higher concurrency: ~13.7 ms → 16.7 ms → 22.0 ms.
Throughput (output tokens/sec) scaled significantly with concurrency: ~2.6k → 4.1k → 4.6k tokens/sec.
Request throughput also improved: ~1.8 → 2.8 → 3.1 requests/sec.

Conclusion: The H100 scales throughput efficiently as concurrency grows, but with a trade-off of gradually increasing response latency. At higher concurrency (50 → 75), throughput gains begin to plateau, indicating near-saturation of GPU capacity.

Updated Sep 15, 2025

Version 3.0

maljazaery

Microsoft

Joined May 07, 2024

View Profile

Azure AI Foundry Blog

Follow this blog board to get notified when there's new activity