Build lightweight AI Apps on Azure App Service with gpt-oss-20b

Microsoft

Aug 13, 2025

OpenAI recently introduced gpt-oss as an open-weight language model that delivers strong real-world performance at low cost. Available under the flexible Apache 2.0 license, these models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware; see the announcement: https://openai.com/index/introducing-gpt-oss/.

It’s an excellent choice for scenarios where you want the security and efficiency of a smaller model running on your application instance — while still getting impressive reasoning capabilities.

By hosting it on Azure App Service, you can take advantage of enterprise-grade features without worrying about managing infrastructure:

Built-in autoscaling
Integration with VNet
Enterprise-grade security and compliance
Easy CI/CD integration
Choice of deployment methods

In this post, we’ll walk through a complete sample that uses gpt-oss-20b as a sidecar container running alongside a Python Flask app on Azure App Service.

All the source code and Bicep templates are available here:
📂 Azure-Samples/appservice-ai-samples/gpt-oss-20b-sample

Architecture of our sample at a glance

Web app (Flask) runs as a code-based App Service.
Model runs in a sidecar container (Ollama) in the same App Service.
The Flask app calls the model over localhost:11434.
Bicep provisions the Web App and an Azure Container Registry (ACR). You push your model image to ACR and attach it as a sidecar in the Portal.

1. Wrapping gpt-oss-20b in a Container

Code location:
/gpt-oss-20b-sample/ollama-image in the sample repo: https://github.com/Azure-Samples/appservice-ai-samples/tree/main/gpt-oss-20b-sample/ollama-image.

What this image does (at a glance)

Starts the Ollama server
Pulls the gpt-oss:20b model on first run
Exposes port 11434 for the Flask app to call locally

Dockerfile:

FROM ollama/ollama
 EXPOSE 11434
 COPY startup.sh /
 RUN chmod +x /startup.sh
 ENTRYPOINT ["./startup.sh"]

startup.sh

# Start Ollama in the background
ollama serve &
sleep 5

# Pull and run gpt-oss:20b
ollama pull gpt-oss:20b

# Restart ollama and run it in the foreground
pkill -f "ollama"
ollama serve

Build the image

Choose one of the two common paths:

A. Build locally with Docker

From the ollama-image folder:

# 1) (optional) pick a registry/image name up-front
ACR_NAME=<your-acr-name>           # e.g., myacr123
IMAGE=ollama-gpt-oss:20b

# 2) build locally
docker build -t $IMAGE .

If you’re new to building images, see Docker’s build docs for options and examples.

B. Build in Azure (no local Docker required) with ACR Tasks

Run a cloud build directly from the repo or your working directory:

ACR_NAME=<your-acr-name>
az acr build \
  --registry $ACR_NAME \
  --image ollama-gpt-oss:20b \
  ./gpt-oss-20b-sample/ollama-image

ACR Tasks build the image in Azure and push it straight into your registry.

Push the image to Azure Container Registry (ACR)

If you built locally, tag and push to your ACR:

# login (CLI recommended)
az acr login --name $ACR_NAME

# tag and push (note: all-lowercase FQDN)
docker tag ollama-gpt-oss:20b $ACR_NAME.azurecr.io/ollama-gpt-oss:20b
docker push $ACR_NAME.azurecr.io/ollama-gpt-oss:20b

Full “push/pull with Docker CLI” quickstart is here if you need it.

2. The Flask Application

Our main app is a simple Python Flask service that connects to the model running in the sidecar.

Since the sidecar shares the same network namespace as the main app, we can call it at http://localhost:11434.

OLLAMA_HOST = "http://localhost:11434"
MODEL_NAME = "gpt-oss:20b"

@app.route("/chat", methods=["POST"])
def chat():
    data = request.get_json()
    prompt = data.get("prompt", "")
    payload = {
        "model": MODEL_NAME,
        "messages": [{"role": "user", "content": prompt}],
        "stream": True
    }

    def generate():
        with requests.post(f"{OLLAMA_HOST}/api/chat", json=payload, stream=True) as r:
            for line in r.iter_lines(decode_unicode=True):
                if line:
                    event = json.loads(line)
                    if "message" in event:
                        yield event["message"]["content"]

    return Response(generate(), mimetype="text/plain")

This allows your app to stream responses back to the browser in real-time — giving a chat-like experience.

3. Deploying to Azure App Service

Code location:
/gpt-oss-20b-sample/flask-app in the sample repo: https://github.com/Azure-Samples/appservice-ai-samples/tree/main/gpt-oss-20b-sample/flask-app

You can deploy the Flask app using your preferred method — VS Code, GitHub Actions, az webapp up, or via Bicep.

We’ve included a Bicep template that sets up:

An Azure Container Registry for your sidecar image
An Azure Web App running on Premium V4 for best performance and cost efficiency
🔗 Azure App Service Premium V4 now in Public Preview

If you want to use the azd template, pull down the repo and run these commands from the folder.

azd init
azd up

Open the Web App in Azure Portal and add a sidecar:

How-to: https://learn.microsoft.com/azure/app-service/configure-sidecar
Choose your ACR image (the one you created in Step 1), set port to 11434

First startup note: the sidecar downloads the gpt-oss-20b model on first run; cold start will take time. Subsequent restarts will be faster because the model layers will not need to be pulled down.

Try it, then open your site—it’s a chat UI backed by gpt-oss-20b running locally as a sidecar on Azure App Service.

Conclusion

With GPT-OSS-20B running as a sidecar on Azure App Service, you get the best of both worlds — the flexibility of open-source models and the reliability, scalability, and security of a fully managed platform. This setup makes it easy to integrate AI capabilities into your applications without having to provision or manage custom infrastructure.

Whether you’re building a lightweight chat experience, prototyping a new AI-powered feature, or experimenting with domain-specific fine-tuning, this approach provides a robust foundation. You can scale your application based on demand, swap out models as needed, and take advantage of the full Azure ecosystem for networking, observability, and deployment automation.

Next Steps & Resources

Here are some useful resources to help you go further: