Blog Post

Apps on Azure Blog
5 MIN READ

Open AI's gpt-oss models on Azure Container Apps serverless GPUs

Cary_Chai's avatar
Cary_Chai
Icon for Microsoft rankMicrosoft
Aug 07, 2025

Deploy OpenAI gpt-oss-120b and gpt-oss-20b models to Azure Container Apps serverless GPUs with Ollama

Just yesterday, OpenAI announced the release of gpt-oss-120b and gpt-oss-20b, two new state-of-the-art open-weight language models. These models are designed to run on lighter weight GPU resources making them highly accessible for developers who want to self-host powerful language models within their own environments.

If you’re looking to deploy these models in the cloud, Azure Container Apps serverless GPUs are a great option. With support for both A100 and T4s, serverless GPUs support both the gpt-oss-120b and gpt-oss-20b models, providing a cost-efficient and scalable platform with minimal infrastructure overhead.

Open AI gpt-oss-120b running on Azure Container Apps serverless GPUs

In this blog post, we’ll walk through:

  • Understanding the benefits of using serverless GPUs for open-source model hosting
  • Choosing the right gpt-oss model for you
  • Deploying the Ollama container on Azure Container Apps serverless GPUs
  • Running OpenAI’s gpt-oss models in a scalable, cost-effective environment

Why use Azure Container Apps serverless GPUs?

Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. With serverless GPU support, you can bring your own containers and deploy them to GPU-backed environments that automatically scale based on demand.

Key benefits:

  • Autoscaling – scale to zero when idle, scale out with usage
  • Pay-per-second billing – pay only for the compute you use
  • Ease of use - accelerate developer velocity and easily bring any container and run it on GPUs in the cloud
  • No infrastructure management – focus on your model and app
  • Enterprise-grade features – out of the box support for bringing your own virtual networks, managed identity, private endpoints and more with full data governance

Choosing the right gpt-oss model

The gpt-oss models deliver strong performance across common language benchmarks and are optimized for different use cases:

  • gpt-oss-120b is comparable to OpenAI's gpt-o4-mini and is a powerful reasoning model suitable for high-performance workloads. The model can be run on A100 GPUs on Azure Container Apps serverless GPUs.
  • gpt-oss-20b is comparable to gpt-o3-mini and is ideal for lighter-weight small language model (SLM) apps and has excellent performance for the cost. This model can run efficiently and cheaper on T4 GPUs or faster on A100 GPUs.

Deploy Azure Container Apps resources

  1. Go to the Azure Portal.

  2. Click Create a resource.

  3. Search for Azure Container Apps.

  4. Select Container App and Create.

  5. On the Basics tab, you can leave most of the defaults. The region you’ll want to select will depend on the gpt-oss model that you want to use. To run the 120B parameter model, select one of the A100 regions. To run the 20B model, select either a T4 or A100 region.
    RegionA100T4
    West US  
    West US 3
    Sweden Central
    Australia East
    West Europe 


  6. In the Container tab, fill in the following details for the Ollama container.
    FieldValue
    Image sourceDocker hub or other registries
    Image typePublic
    Registry login serverDocker.io
    Image and tagollama/ollama:latest
    Workload profileConsumption
    GPUCheck the box
    GPU typeA100 for gpt-oss:120B. T4 or A100 for gpt-oss:20B
    *By default, pay-as-you-go and EA customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, request quota here

  7. In the Ingress tab, fill in the following details:
    FieldValue
    IngressEnabled
    Ingress trafficAccepting traffic from anywhere
    Target port11434


  8. Select Review + Create at the bottom of the page, then select Create.

Use your gpt-oss model

  1. Once your deployment is complete, select Go to resource.

  2. Select the Application Url for your container app. This will launch the container.

  3. (Optional) The following steps will show how to interact with the thinking models through the Azure Container Apps console. Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application may scale back in after a set period of time. If you want to have the container app remain active for a longer duration while going through the following, you can go to the scaling blade under Application and set the min replica count to 1 or increase the cooldown period duration. If you set the min replica count to 1, please ensure you reset it to 0 when not in use, or your app will not scale back in, and you will be billed for the duration it is active.

  4. In the Azure portal, select the Monitoring dropdown. Then, select Console.

  5. Under Choose start up command, select Connect.

  6. Run the below command to pull the gpt-oss model. Use 120b or 20b depending on which model you want to run:
    ollama pull gpt-oss:120b

     

  7. Run the below command to run the gpt-oss model:
    ollama run gpt-oss:120b

     

  8. Input your prompt to see the model in action:
    Can you explain LLMs and recent developments in AI the last few years?

     

  9. Congratulations! You've successfully run an Open AI gpt-oss model on Azure Container Apps serverless GPUs!

(Optional) Call the Ollama gpt-oss API endpoint from your local machine

The following curl commands can be used from your local machine to call the container app endpoint and interact with the Ollama gpt-oss endpoint.

  1. Open your local shell

  2. Copy your container app URL

     

  3. Run the following command to set the OLLAMA_URL environment variable
    export OLLAMA_URL="{Your application URL}"

     

  4. Run the following command to prompt the gpt-oss model. This curl request has streaming set to false, so it will return the fully generated response.

    curl -X POST "$OLLAMA_URL/api/generate" -H "Content-Type: application/json" -d '{ 
     "model": "gpt-oss:120b", 
     "prompt": "Can you explain LLMs and recent developments in AI the last few years?", 
     "stream": false 
    }'

Congratulations!

You have successfully run a gpt-oss model on Azure Container Apps! You can follow these same steps to run any model that you can find in Ollama's library. In addition, Azure Container Apps is a completely agnostic compute platform. You can bring any Linux-based container for your AI workloads and run them on serverless GPUs.

Please comment below to let us know what you think of the experience and any AI workloads you're deploying to Azure Container Apps. 

Next steps

Azure Container Apps is fully ephemeral and doesn't have a mounted storage. In order to persist your data and conversations, you can add a volume mount to your Azure Container App. For steps on how to add a volume mount, follow steps here. To learn more about Azure Container Apps serverless GPUs, see the documentation.

Updated Aug 27, 2025
Version 17.0

3 Comments

  • Under "Use your gpt-oss model" step 8, update the command line to =>  

    ollama run <model name e.g. gpt-oss:120b> "Can you explain LLMs and recent developments in AI the last few years?" 

     

  • powerofzero's avatar
    powerofzero
    Copper Contributor

    This is great but since we can deploy the 120B serverlessly already, what would be really cool is access to the model via responses API so we can get the Assistants vector store and search, code intepreter, MCP support etc. When is this likely to happen?

    • Cary_Chai's avatar
      Cary_Chai
      Icon for Microsoft rankMicrosoft

      Hi powerofzero​, Azure Container Apps provides GPUs to run your containerized applications. This post shows the simplest path to run gpt-oss via an Ollama container on Azure Container Apps. However, if you prefer to not use Ollama or need additional functionality that the model deployed through Ollama doesn't provide, you can deploy your own containerized version of gpt-oss today where you can create your own Responses API server. If you want to use Ollama still but also have some of the Responses API functionality, solutions like Hugging Face's responses.js should be able to act as a proxy by mapping Responses API calls to chat completions for an Ollama-hosted gpt-oss. You can deploy all of these solutions on container apps today, just switch out the container/ingress port details in this post.