Open AI's gpt-oss models on Azure Container Apps serverless GPUs

Cary_Chai

Microsoft

Aug 07, 2025

Deploy OpenAI gpt-oss-120b and gpt-oss-20b models to Azure Container Apps serverless GPUs with Ollama

Just yesterday, OpenAI announced the release of gpt-oss-120b and gpt-oss-20b, two new state-of-the-art open-weight language models. These models are designed to run on lighter weight GPU resources making them highly accessible for developers who want to self-host powerful language models within their own environments.

If you’re looking to deploy these models in the cloud, Azure Container Apps serverless GPUs are a great option. With support for both A100 and T4s, serverless GPUs support both the gpt-oss-120b and gpt-oss-20b models, providing a cost-efficient and scalable platform with minimal infrastructure overhead.

Open AI gpt-oss-120b running on Azure Container Apps serverless GPUs

In this blog post, we’ll walk through:

Understanding the benefits of using serverless GPUs for open-source model hosting
Choosing the right gpt-oss model for you
Deploying the Ollama container on Azure Container Apps serverless GPUs
Running OpenAI’s gpt-oss models in a scalable, cost-effective environment

Why use Azure Container Apps serverless GPUs?

Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. With serverless GPU support, you can bring your own containers and deploy them to GPU-backed environments that automatically scale based on demand.

Key benefits:

Autoscaling – scale to zero when idle, scale out with usage
Pay-per-second billing – pay only for the compute you use
Ease of use - accelerate developer velocity and easily bring any container and run it on GPUs in the cloud
No infrastructure management – focus on your model and app
Enterprise-grade features – out of the box support for bringing your own virtual networks, managed identity, private endpoints and more with full data governance

Choosing the right gpt-oss model

The gpt-oss models deliver strong performance across common language benchmarks and are optimized for different use cases:

gpt-oss-120b is comparable to OpenAI's gpt-o4-mini and is a powerful reasoning model suitable for high-performance workloads. The model can be run on A100 GPUs on Azure Container Apps serverless GPUs.
gpt-oss-20b is comparable to gpt-o3-mini and is ideal for lighter-weight small language model (SLM) apps and has excellent performance for the cost. This model can run efficiently and cheaper on T4 GPUs or faster on A100 GPUs.

Deploy Azure Container Apps resources

Go to the Azure Portal.
Click Create a resource.
Search for Azure Container Apps.
Select Container App and Create.
On the Basics tab, you can leave most of the defaults. The region you’ll want to select will depend on the gpt-oss model that you want to use. To run the 120B parameter model, select one of the A100 regions. To run the 20B model, select either a T4 or A100 region.

Region A100 T4
West US ✅
West US 3 ✅ ✅
Sweden Central ✅ ✅
Australia East ✅ ✅
West Europe ✅

In the Container tab, fill in the following details for the Ollama container.

Field	Value
Image source	Docker hub or other registries
Image type	Public
Registry login server	Docker.io
Image and tag	ollama/ollama:latest
Workload profile	Consumption
GPU	Check the box
GPU type	A100 for gpt-oss:120B. T4 or A100 for gpt-oss:20B

*By default, pay-as-you-go and EA customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, request quota here.

In the Ingress tab, fill in the following details:

Field Value
Ingress Enabled
Ingress traffic Accepting traffic from anywhere
Target port 11434
Select Review + Create at the bottom of the page, then select Create.

Use your gpt-oss model

Once your deployment is complete, select Go to resource.
Select the Application Url for your container app. This will launch the container.
(Optional) The following steps will show how to interact with the thinking models through the Azure Container Apps console. Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application may scale back in after a set period of time. If you want to have the container app remain active for a longer duration while going through the following, you can go to the scaling blade under Application and set the min replica count to 1 or increase the cooldown period duration. If you set the min replica count to 1, please ensure you reset it to 0 when not in use, or your app will not scale back in, and you will be billed for the duration it is active.
In the Azure portal, select the Monitoring dropdown. Then, select Console.
Under Choose start up command, select Connect.
Run the below command to pull the gpt-oss model. Use 120b or 20b depending on which model you want to run:
```
ollama pull gpt-oss:120b
```
Run the below command to run the gpt-oss model:
```
ollama run gpt-oss:120b
```

Input your prompt to see the model in action:

Can you explain LLMs and recent developments in AI the last few years?

Congratulations! You've successfully run an Open AI gpt-oss model on Azure Container Apps serverless GPUs!

(Optional) Call the Ollama gpt-oss API endpoint from your local machine

The following curl commands can be used from your local machine to call the container app endpoint and interact with the Ollama gpt-oss endpoint.

Open your local shell
Copy your container app URL
Run the following command to set the OLLAMA_URL environment variable
```
export OLLAMA_URL="{Your application URL}"
```

Run the following command to prompt the gpt-oss model. This curl request has streaming set to false, so it will return the fully generated response.

curl -X POST "$OLLAMA_URL/api/generate" -H "Content-Type: application/json" -d '{ 
 "model": "gpt-oss:120b", 
 "prompt": "Can you explain LLMs and recent developments in AI the last few years?", 
 "stream": false 
}'

Congratulations!

You have successfully run a gpt-oss model on Azure Container Apps! You can follow these same steps to run any model that you can find in Ollama's library. In addition, Azure Container Apps is a completely agnostic compute platform. You can bring any Linux-based container for your AI workloads and run them on serverless GPUs.

Please comment below to let us know what you think of the experience and any AI workloads you're deploying to Azure Container Apps.

Next steps

Azure Container Apps is fully ephemeral and doesn't have a mounted storage. In order to persist your data and conversations, you can add a volume mount to your Azure Container App. For steps on how to add a volume mount, follow steps here. To learn more about Azure Container Apps serverless GPUs, see the documentation.

Updated Aug 27, 2025

Version 17.0

Microsoft

Joined July 06, 2022