Blog Post

Apps on Azure Blog
5 MIN READ

Gemma 4 on Azure Container Apps Serverless GPU

simonjj's avatar
simonjj
Icon for Microsoft rankMicrosoft
Apr 15, 2026

Secure, private and performant OpenCode agent configuration -- batteries included.

Every prompt you send to a hosted AI service leaves your tenant. Your code, your architecture decisions, your proprietary logic — all of it crosses a network boundary you don't control. For teams building in regulated industries or handling sensitive IP, that's not a philosophical concern. It's a compliance blocker.

What if you could spin up a fully private AI coding agent — running on your own GPU, in your own Azure subscription — with a single command?

That's exactly what this template does. One azd up, 15 minutes, and you have Google's Gemma 4 running on Azure Container Apps serverless GPU with an OpenAI-compatible API, protected by auth, and ready to power OpenCode as your terminal-based coding agent. No data leaves your environment. No third-party model provider sees your code. Full control.

Why Self-Hosted AI on ACA?

Azure Container Apps serverless GPU gives you on-demand GPU compute without managing VMs, Kubernetes clusters, or GPU drivers. You get a container, a GPU, and an HTTPS endpoint — Azure handles the rest.

Here's what makes this approach different from calling a hosted model API:

  • Complete data privacy — your code and prompts never leave your Azure subscription. No PII exposure, no data leakage, no third-party processing. For teams navigating HIPAA, SOC 2, or internal IP policies, this is the simplest path to compliant AI-assisted development.
  • Predictable costs — you pay for GPU compute time, not per-token. Run as many prompts as you want against your deployed model.
  • No rate limits — the GPU is yours. No throttling, no queue, no waiting for capacity.
  • Model flexibility — swap models in minutes. Start with the 4B parameter Gemma 4 for fast iteration, scale up to 26B for complex reasoning tasks.

This isn't a tradeoff between convenience and privacy. ACA serverless GPU makes self-hosted AI as easy to deploy as any SaaS endpoint — but the data stays yours.

What You're Building

What does the configuration look like to run Gemma 4 + Ollama securely on ACA serverless GPU

The template deploys two containers into an Azure Container Apps environment:

  1. Ollama + Gemma 4 — running on a serverless GPU (NVIDIA T4 or A100), serving an OpenAI-compatible API
  2. Nginx auth proxy — a lightweight reverse proxy that adds basic authentication and exposes the endpoint over HTTPS

The Ollama container pulls the Gemma 4 model on first start, so there's nothing to pre-build or upload. The nginx proxy runs on the free Consumption profile — only the Ollama container needs GPU.

After deployment, you get a single HTTPS endpoint that works with curl, any OpenAI-compatible SDK, or OpenCode — a terminal-based AI coding agent that turns the whole thing into a private GitHub Copilot alternative.

Step 1: Deploy with azd up

You need the Azure CLI and Azure Developer CLI (azd) installed.

git clone https://github.com/simonjj/gemma4-on-aca.git
cd gemma4-on-aca
azd up

The setup walks you through three choices:

GPU selection — T4 (16 GB VRAM) for smaller models, or A100 (80 GB VRAM) for the full Gemma 4 lineup.

Model selection — depends on your GPU choice. The defaults are tuned for the best quality-to-speed ratio on each GPU tier.

Proxy password — protects your endpoint with basic auth.

Region availability: Serverless GPUs are available in various regoins such as australiaeast, brazilsouth, canadacentral, eastus, italynorth, swedencentral, uksouth, westus, and westus3. Pick one of these when prompted for location.

That's it. Provisioning takes about 10 minutes — mostly waiting for the ACA environment to create and the model to download.

The deployment output

Choose Your Model

Gemma 4 ships in four sizes. The right choice depends on your GPU and workload:

ModelParamsArchitectureContextModalitiesDisk Size
gemma4:e2b~2BDense128KText, Image, Audio~7 GB
gemma4:e4b~4BDense128KText, Image, Audio~10 GB
gemma4:26b26BMoE (4B active)256KText, Image~18 GB
gemma4:31b31BDense256KText, Image~20 GB

Real-World Performance on ACA

We benchmarked every model on both GPU tiers using Ollama v0.20 with Q4_K_M quantization and 32K context in Sweden Central:

ModelGPUTokens/secTTFTNotes
gemma4:e2bT4~81~15msFastest on T4
gemma4:e4bT4~51~17msDefault T4 choice — best quality/speed
gemma4:e2bA100~184~9msUltra-fast
gemma4:e4bA100~129~12msGreat for lighter workloads
gemma4:26bA100~113~14msDefault A100 choice — strong reasoning
gemma4:31bA100~40~30msHighest quality, slower

51 tokens/second on a T4 with the 4B model is fast enough for interactive coding assistance. The 26B model on A100 delivers 113 tokens/second with noticeably better reasoning — ideal for complex refactoring, architecture questions, and multi-file changes.

The 26B and 31B models require A100 — they don't fit in T4's 16 GB VRAM.

Step 2: Verify Your Endpoint

After azd up completes, the post-provision hook prints your endpoint URL. Test it:

curl -u admin:<YOUR_PASSWORD> \
  https://<YOUR_PROXY_ENDPOINT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:e4b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

You should get a JSON response with Gemma 4's reply. The endpoint is fully OpenAI-compatible — it works with any tool or SDK that speaks the OpenAI API format.

Step 3: Connect OpenCode

Here's where it gets powerful. OpenCode is a terminal-based AI coding agent — think GitHub Copilot, but running in your terminal and pointing at whatever model backend you choose.

The azd up post-provision hook automatically generates an opencode.json in your project directory with the correct endpoint and credentials. If you need to create it manually:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "gemma4-aca": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Gemma 4 on ACA",
      "options": {
        "baseURL": "https://<YOUR_PROXY_ENDPOINT>/v1",
        "headers": {
          "Authorization": "Basic <BASE64_OF_admin:YOUR_PASSWORD>"
        }
      },
      "models": {
        "gemma4:e4b": {
          "name": "Gemma 4 e4b (4B)"
        }
      }
    }
  }
}

Generate the Base64 value: echo -n "admin:YOUR_PASSWORD" | base64

Now run it:

opencode run -m "gemma4-aca/gemma4:e4b" "Write a binary search in Rust"

That command sends your prompt to Gemma 4 running on your ACA GPU, and streams the response back to your terminal. Every token is generated on your infrastructure. Nothing leaves your subscription.

For interactive sessions, launch the TUI:

opencode

Select your model with /models, pick Gemma 4, and start coding. OpenCode supports file editing, code generation, refactoring, and multi-turn conversations — all powered by your private Gemma 4 instance.

The Privacy Case

This matters most for teams that can't send code to external APIs:

  • HIPAA-regulated healthcare apps — patient data in code, schema definitions, and test fixtures stays in your Azure subscription
  • Financial services — proprietary trading algorithms and risk models never leave your network boundary
  • Defense and government — classified or CUI-adjacent codebases get AI assistance without external data processing agreements
  • Startups with sensitive IP — your secret sauce stays secret, even while you use AI to build faster

With ACA serverless GPU, you're not running a VM or managing a Kubernetes cluster to get this privacy. It's a managed container with a GPU attached. Azure handles the infrastructure, you own the data boundary.

Clean Up

When you're done:

azd down

This tears down all Azure resources. Since ACA serverless GPU bills only while your containers are running, you can also scale to zero replicas to pause costs without destroying the environment.

Get Started

Updated Apr 15, 2026
Version 2.0
No CommentsBe the first to comment