Blog Post

Azure AI Foundry Blog
3 MIN READ

Deploying GPT-OSS as Azure ML Online Endpoint

maljazaery's avatar
maljazaery
Icon for Microsoft rankMicrosoft
Sep 10, 2025
As AI workloads grow in complexity and scale, deploying large language models like GPT-OSS efficiently becomes critical. In this post, we’ll walk through how to deploy GPT-OSS using Azure Machine Learning (Azure ML) Online Endpoint on a managed compute (NV-A10 & NC-H100) —leveraging a streamlined, script-driven approach.

❓Why Azure ML Online Endpoints?


Azure ML online endpoints provide a fully managed, scalable, and secure way to serve models like GPT-OSS. They support production-grade features like blue-green deployments for safe rollouts and traffic mirroring to test new versions without impacting live traffic. With built-in autoscaling, authentication, monitoring, and seamless REST API integration, they’re ideal for deploying large models on managed compute with minimal operational overhead.

🧰 What You’ll Need

Before diving in, make sure you have the following:

  • Azure CLI installed and authenticated
  • An Azure ML workspace set up
  • Contributor or Owner permissions on your Azure subscription
  • GPU Quota in Azure ML studio
# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# Login to Azure
az login

📁 Clone the code

git clone https://github.com/maljazaery/AzureML_LLM_Endpoint_Deployment_Script.git

The repo includes one env example and 2 deployment configs:

  • AML_env/gpt_oss/: Dockerfile and environment setup for GPT-OSS
  • configs/gpt_oss/config_a10.conf: Sample config for NV-A10 GPU
  • configs/gpt_oss/config_h100.conf: Sample config for NC-H100 GPU

⚙️ Configuration

Create a config.conf file tailored to your Azure environment ( check "configs" folder for examples):

# Azure subscription & workspace settings
AZ_SUBSCRIPTION_ID="your-subscription-id"
AZ_RESOURCE_GROUP="your-resource-group"
AZ_ML_WORKSPACE="your-workspace-name"

# Endpoint and deployment settings
AZ_ENDPOINT_NAME="gptoss-endpoint-h100"
AZ_INSTANCE_TYPE="Standard_NC40ads_H100_v5"
# ... other settings

 

🚦 Deployment Options

Option 1: Full Automated Deployment

chmod +x deploy-main.sh

./deploy-main.sh config.conf

 

Option 2: Step-by-Step Deployment

# Create environment only
./deploy-main.sh --env-only config.conf

# Create endpoint and deployment
./deploy-main.sh --endpoint-only config.conf

 

🧪 Testing the Endpoint

Using curl

az ml online-endpoint get-credentials --name your-endpoint-name

curl -X POST "https://your-endpoint.region.inference.ml.azure.com" \
  -H "Authorization: Bearer <your-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

 

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    baseurl="https://your-endpoint.region.inference.ml.azure.com",
    apikey="your-key"
)

result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(result.choices[0].message)

 

📊 Monitoring & Management

# List endpoints
az ml online-endpoint list

# Show endpoint details
az ml online-endpoint show --name your-endpoint-name

# View deployment logs
az ml online-deployment get-logs --name current --endpoint-name your-endpoint-name --lines 100 --follow

 

📈  Load Testing:

We benchmarked gpt-oss-20B on 1xH100 GPU across concurrency levels of 25, 50, and 75. Input and output token sizes were held constant across all runs.

Settings:

  • Input Tokens per Request: 250
  • Output Tokens per Request: ~1,500 (target)
  • Test Duration: ~10 minutes 
  • Model: gpt-oss-20B
  • concurrency: 25, 50, and 75
  • VM: 1x NC-H100 (Standard_NC40ads_H100_v5)

Results:

  • Latency (response time) increased with higher concurrency: ~13.7 ms → 16.7 ms → 22.0 ms.
  • Throughput (output tokens/sec) scaled significantly with concurrency: ~2.6k → 4.1k → 4.6k tokens/sec.
  • Request throughput also improved: ~1.8 → 2.8 → 3.1 requests/sec.

 

Conclusion: The H100 scales throughput efficiently as concurrency grows, but with a trade-off of gradually increasing response latency. At higher concurrency (50 → 75), throughput gains begin to plateau, indicating near-saturation of GPU capacity.

Updated Sep 15, 2025
Version 3.0
No CommentsBe the first to comment