Secure, private and performant OpenCode agent configuration -- batteries included.
Every prompt you send to a hosted AI service leaves your tenant. Your code, your architecture decisions, your proprietary logic — all of it crosses a network boundary you don't control. For teams building in regulated industries or handling sensitive IP, that's not a philosophical concern. It's a compliance blocker.
What if you could spin up a fully private AI coding agent — running on your own GPU, in your own Azure subscription — with a single command?
That's exactly what this template does. One azd up, 15 minutes, and you have Google's Gemma 4 running on Azure Container Apps serverless GPU with an OpenAI-compatible API, protected by auth, and ready to power OpenCode as your terminal-based coding agent. No data leaves your environment. No third-party model provider sees your code. Full control.
Why Self-Hosted AI on ACA?
Azure Container Apps serverless GPU gives you on-demand GPU compute without managing VMs, Kubernetes clusters, or GPU drivers. You get a container, a GPU, and an HTTPS endpoint — Azure handles the rest.
Here's what makes this approach different from calling a hosted model API:
- Complete data privacy — your code and prompts never leave your Azure subscription. No PII exposure, no data leakage, no third-party processing. For teams navigating HIPAA, SOC 2, or internal IP policies, this is the simplest path to compliant AI-assisted development.
- Predictable costs — you pay for GPU compute time, not per-token. Run as many prompts as you want against your deployed model.
- No rate limits — the GPU is yours. No throttling, no queue, no waiting for capacity.
- Model flexibility — swap models in minutes. Start with the 4B parameter Gemma 4 for fast iteration, scale up to 26B for complex reasoning tasks.
This isn't a tradeoff between convenience and privacy. ACA serverless GPU makes self-hosted AI as easy to deploy as any SaaS endpoint — but the data stays yours.
What You're Building
What does the configuration look like to run Gemma 4 + Ollama securely on ACA serverless GPUThe template deploys two containers into an Azure Container Apps environment:
- Ollama + Gemma 4 — running on a serverless GPU (NVIDIA T4 or A100), serving an OpenAI-compatible API
- Nginx auth proxy — a lightweight reverse proxy that adds basic authentication and exposes the endpoint over HTTPS
The Ollama container pulls the Gemma 4 model on first start, so there's nothing to pre-build or upload. The nginx proxy runs on the free Consumption profile — only the Ollama container needs GPU.
After deployment, you get a single HTTPS endpoint that works with curl, any OpenAI-compatible SDK, or OpenCode — a terminal-based AI coding agent that turns the whole thing into a private GitHub Copilot alternative.
Step 1: Deploy with azd up
You need the Azure CLI and Azure Developer CLI (azd) installed.
git clone https://github.com/simonjj/gemma4-on-aca.git
cd gemma4-on-aca
azd up
The setup walks you through three choices:
GPU selection — T4 (16 GB VRAM) for smaller models, or A100 (80 GB VRAM) for the full Gemma 4 lineup.
Model selection — depends on your GPU choice. The defaults are tuned for the best quality-to-speed ratio on each GPU tier.
Proxy password — protects your endpoint with basic auth.
Region availability: Serverless GPUs are available in various regoins such as
australiaeast,brazilsouth,canadacentral,eastus,italynorth,swedencentral,uksouth,westus, andwestus3. Pick one of these when prompted for location.
That's it. Provisioning takes about 10 minutes — mostly waiting for the ACA environment to create and the model to download.
The deployment outputChoose Your Model
Gemma 4 ships in four sizes. The right choice depends on your GPU and workload:
| Model | Params | Architecture | Context | Modalities | Disk Size |
|---|---|---|---|---|---|
gemma4:e2b | ~2B | Dense | 128K | Text, Image, Audio | ~7 GB |
gemma4:e4b | ~4B | Dense | 128K | Text, Image, Audio | ~10 GB |
gemma4:26b | 26B | MoE (4B active) | 256K | Text, Image | ~18 GB |
gemma4:31b | 31B | Dense | 256K | Text, Image | ~20 GB |
Real-World Performance on ACA
We benchmarked every model on both GPU tiers using Ollama v0.20 with Q4_K_M quantization and 32K context in Sweden Central:
| Model | GPU | Tokens/sec | TTFT | Notes |
|---|---|---|---|---|
gemma4:e2b | T4 | ~81 | ~15ms | Fastest on T4 |
gemma4:e4b | T4 | ~51 | ~17ms | Default T4 choice — best quality/speed |
gemma4:e2b | A100 | ~184 | ~9ms | Ultra-fast |
gemma4:e4b | A100 | ~129 | ~12ms | Great for lighter workloads |
gemma4:26b | A100 | ~113 | ~14ms | Default A100 choice — strong reasoning |
gemma4:31b | A100 | ~40 | ~30ms | Highest quality, slower |
51 tokens/second on a T4 with the 4B model is fast enough for interactive coding assistance. The 26B model on A100 delivers 113 tokens/second with noticeably better reasoning — ideal for complex refactoring, architecture questions, and multi-file changes.
The 26B and 31B models require A100 — they don't fit in T4's 16 GB VRAM.
Step 2: Verify Your Endpoint
After azd up completes, the post-provision hook prints your endpoint URL. Test it:
curl -u admin:<YOUR_PASSWORD> \
https://<YOUR_PROXY_ENDPOINT>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:e4b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
You should get a JSON response with Gemma 4's reply. The endpoint is fully OpenAI-compatible — it works with any tool or SDK that speaks the OpenAI API format.
Step 3: Connect OpenCode
Here's where it gets powerful. OpenCode is a terminal-based AI coding agent — think GitHub Copilot, but running in your terminal and pointing at whatever model backend you choose.
The azd up post-provision hook automatically generates an opencode.json in your project directory with the correct endpoint and credentials. If you need to create it manually:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"gemma4-aca": {
"npm": "@ai-sdk/openai-compatible",
"name": "Gemma 4 on ACA",
"options": {
"baseURL": "https://<YOUR_PROXY_ENDPOINT>/v1",
"headers": {
"Authorization": "Basic <BASE64_OF_admin:YOUR_PASSWORD>"
}
},
"models": {
"gemma4:e4b": {
"name": "Gemma 4 e4b (4B)"
}
}
}
}
}
Generate the Base64 value:
echo -n "admin:YOUR_PASSWORD" | base64
Now run it:
opencode run -m "gemma4-aca/gemma4:e4b" "Write a binary search in Rust"
That command sends your prompt to Gemma 4 running on your ACA GPU, and streams the response back to your terminal. Every token is generated on your infrastructure. Nothing leaves your subscription.
For interactive sessions, launch the TUI:
opencode
Select your model with /models, pick Gemma 4, and start coding. OpenCode supports file editing, code generation, refactoring, and multi-turn conversations — all powered by your private Gemma 4 instance.
The Privacy Case
This matters most for teams that can't send code to external APIs:
- HIPAA-regulated healthcare apps — patient data in code, schema definitions, and test fixtures stays in your Azure subscription
- Financial services — proprietary trading algorithms and risk models never leave your network boundary
- Defense and government — classified or CUI-adjacent codebases get AI assistance without external data processing agreements
- Startups with sensitive IP — your secret sauce stays secret, even while you use AI to build faster
With ACA serverless GPU, you're not running a VM or managing a Kubernetes cluster to get this privacy. It's a managed container with a GPU attached. Azure handles the infrastructure, you own the data boundary.
Clean Up
When you're done:
azd down
This tears down all Azure resources. Since ACA serverless GPU bills only while your containers are running, you can also scale to zero replicas to pause costs without destroying the environment.
Get Started
- 📖 gemma4-on-aca on GitHub — clone it, run
azd up, and you're live - 🤖 OpenCode — the terminal AI agent that connects to your Gemma 4 endpoint
- 📌 Gemma 4 docs — model architecture and capabilities
- 📌 ACA serverless GPU — GPU regions and workload profile details