Reusable Azure architecture pattern for serving diffusion workloads on AKS with strong security, scalability, observability, and CI/CD fundamentals.
Diffusion workloads are simple at prototype scale and unforgiving in production. A single demo can run on one GPU-backed VM, but a real platform has to handle bursty demand, long-running jobs, model artifact distribution, secure public access, rollout safety, and hardware-level observability.
Azure Kubernetes Service (AKS) is a strong fit when the requirement is not just to run a model, but to operate a repeatable platform for GPU inference. The reusable pattern is straightforward: keep the API and control layer on CPU nodes, buffer work through a dispatch layer, run inference on isolated GPU capacity, push results to durable storage, and treat security, telemetry, and deployment automation as first-class platform features.
AKS reference architecture for diffusion workloadsThe architecture above shows the core operating model. DNS and an edge layer: Application Gateway with WAF, and optionally Front Door for global entry: route traffic to an AKS CPU pool that hosts the API tier. GPU jobs run on a separate GPU pool, while shared add-ons and CSI drivers run on a system pool. Teams can keep dispatch inside Kubernetes or externalize it through Service Bus plus KEDA, and Azure dependencies should be reached over Private Link with Azure Monitor covering both app and hardware telemetry.
The storage block serves two purposes: durable output storage and, if needed, a shared Hugging Face model cache exposed to GPU pods through PV and PVC mounts.
The reference pattern
The core architecture separates control-plane traffic from GPU execution:
- A lightweight API tier on the CPU node pool receives requests, validates identity, and hands execution work to a dispatch layer.
- That dispatch layer can stay inside Kubernetes using native queueing and controller patterns, or it can publish work to Azure Service Bus for external queue-backed dispatch.
- Scaling can likewise stay AKS-native through cluster and workload autoscaling, or it can use KEDA to react directly to queue backlog.
- GPU work runs on a dedicated GPU node pool, isolated from the API and cluster add-ons.
- GPU workers should mount persistent storage for model caches so Hugging Face assets can survive pod restarts and repeated job submissions.
- Results are stored outside the pod lifecycle in blob-backed storage and returned through a stable status API.
- Edge routing, TLS termination, and WAF inspection happen at the ingress layer, while token validation is typically enforced in the API tier or a dedicated upstream auth component.
This split lets each lane scale on the right signal: request traffic for the API tier, backlog for dispatch, and job demand for GPU workers. It also keeps tuning simpler for cost, latency, and reliability. For single-region deployments, Application Gateway or Application Gateway for Containers is often enough; Azure Front Door becomes more useful for global entry, multi-region failover, or shared edge policy.
In the reference architecture, the CPU pool hosts the externally reachable APIs and control-plane components that submit work into the dispatch layer. The GPU pool hosts the actual model execution components, including short-lived diffusion jobs and longer-lived worker runtimes. A separate system pool hosts shared cluster services such as AGIC and the Secret Store and Blob CSI drivers, while KEDA is added only when teams choose the Service Bus pattern. That keeps platform plumbing off the application and GPU lanes.
The persistence layer is most useful as a model cache rather than as general-purpose application state. There are two practical ways to back it:
- Node-local persistence keeps the cache close to the GPU worker and is the simplest option when jobs benefit from warm data already present on the same node.
- Azure Storage backed persistence is more useful when model download times are long enough that keeping artifacts on shared durable storage materially reduces job startup latency.
Choose the dispatch model
The important design decision is not whether to use one branded queueing technology. It is whether the platform needs a fully Kubernetes-native control loop or an explicit external queue with backlog-driven scaling.
The Kubernetes-native option keeps dispatch inside the cluster:
- The API creates or signals internal Kubernetes work objects.
- A Kubernetes-native queue or controller pattern manages admission and dispatch.
- AKS workload autoscaling and cluster autoscaling handle most scale changes.
- This path is simpler when the team wants fewer external dependencies and the workload shape is already well understood.
The Azure Service Bus plus KEDA option externalizes the control loop:
- The API publishes work to Azure Service Bus.
- Queue consumers or schedulers materialize GPU execution from that queue.
- KEDA scales the scheduling or worker path directly from queue depth.
- This path is better when backlog visibility, queue durability, or burst-driven autoscaling needs to be explicit and independently observable.
Both models can fit the same AKS platform. The GPU isolation, security boundaries, storage pattern, and observability expectations remain the same.
A simple way to choose is:
- Start with Kubernetes-native dispatch when the team wants the fewest moving parts and the job profile is already predictable.
- Choose Azure Service Bus plus KEDA when durable backlog, explicit queue depth, and burst-driven worker scaling are important operating requirements.
- Consider KAITO or the AI toolchain operator add-on when the primary need is managed serving of supported models rather than custom diffusion job orchestration.
Scale by workload lane, not by one generic pool
Not every GPU workload should share the same execution path. Keep short-lived inference, queue-backed workers, and longer-running runtimes in separate lanes so one class does not block another. Where supported, GPU multi-instance configurations can further improve utilization for lighter jobs while leaving full GPUs available for heavier ones.
On AKS, the better pattern is to define separate operating lanes:
- An API admission lane on CPU nodes for authentication, validation, and request submission.
- A scheduling lane that can use either Kubernetes-native queueing with AKS autoscaling or Azure Service Bus with KEDA.
- A GPU execution lane for diffusion jobs and longer-lived worker runtimes.
- Dedicated labels, taints, autoscaling bounds, and dashboards per lane.
Within the GPU execution lane, teams can go one step further and define capacity classes for full-GPU and fractional-GPU jobs. That is useful when some models need the memory and throughput of a whole device, while others can run efficiently on a smaller GPU partition. The NVIDIA device plugin DaemonSet shown in the GPU pool is what advertises those GPU resources (and MIG slices) to the kube-scheduler so pods can request them like any other resource.
That gives platform teams clean capacity isolation and avoids letting one workload class starve another.
Secure the edge, the workload, and the secret path
GPU platforms should treat security as a day-one requirement, not a later add-on.
At the edge, use DNS with Application Gateway and WAF, and add Front Door when global routing is needed. Store public TLS certificates in Azure Key Vault and project them into the cluster through the Secrets Store CSI Driver so renewals do not require redeployment. For protected APIs, validate Microsoft Entra ID tokens in the service or a dedicated auth layer, and keep health probe endpoints separate from business routes.
Inside the cluster, keep authorization scoped tightly:
- Separate system, CPU, and GPU node pools.
- Use namespace boundaries for tenant or environment isolation.
- Give the API only the Kubernetes RBAC it needs to create and monitor jobs, plus Azure permissions only when the external queue option is enabled.
- Prefer Microsoft Entra Workload ID over long-lived credentials for workload access to Azure resources such as Key Vault and Blob Storage, and extend that to Service Bus when the external queue pattern is used.
For operator access, keep the cluster management path separate from the public request path. In the reference architecture, developers come in through Azure Bastion rather than broad direct exposure of cluster endpoints.
For secrets, move away from cluster-local secrets as early as possible. A production-ready path uses Azure Key Vault with the Secrets Store CSI Driver so credentials are not baked into images, manifests, or CI pipelines. If the platform uses Azure Service Bus, queue access should use managed identity as well. Blob-backed result storage should likewise use managed identity and CSI-based integration instead of embedding long-lived credentials into workloads.
For the network path between the cluster and its Azure dependencies, prefer Private Endpoints over public service endpoints. The diagram uses a single PE icon as shorthand for this pattern: in practice, teams usually create private endpoints per service and pair them with private DNS so ACR pulls, Key Vault reads, Blob I/O, and optional Service Bus traffic resolve to private IPs inside the VNet, which keeps platform traffic off the public internet and simplifies firewall and DNS policy.
Observe both the application and the hardware
GPU workloads need two telemetry views: application behavior and hardware behavior. The first tracks request IDs, job IDs, latency, and failures; the second tracks GPU utilization, memory pressure, and device-level performance. Together they show whether the problem is code, dispatch pressure, or hardware saturation.
On Azure, that split maps well to:
- Structured application logs and OpenTelemetry exported to Application Insights.
- Azure Monitor dashboards that include internal queue pressure or Service Bus backlog, plus AKS autoscale or KEDA scale activity depending on the chosen pattern.
- NVIDIA DCGM exporter metrics scraped into Azure Managed Prometheus and visualized in Azure Managed Grafana.
This model is what turns raw GPU hosting into an operable platform. Without it, teams can see requests failing but not whether the root cause is code, dispatch saturation, scheduling, or hardware contention.
The diagram reflects that split clearly. Application Insights and Azure dashboards track service and dispatch behavior, while Prometheus, Grafana, the NVIDIA device plugin, and DCGM exporter track cluster and GPU health. That combination is what allows teams to correlate dispatch delay, AKS or KEDA scale-out, execution time, and failure rates with actual GPU utilization and memory pressure.
Keep CI/CD small, secretless, and reversible
The deployment model does not need to be complex to be production-grade. A practical AKS pattern is:
- Pull request validation for code quality, tests, Dockerfiles, and secret scanning.
- Immutable container tags built from the commit SHA.
- GitHub Actions with OpenID Connect and Azure workload identity federation.
- ACR as the image source of truth.
- Environment-based promotion with approval gates for production.
- Rollout verification with Kubernetes health checks and smoke tests.
The key principle is separation of concerns. CI/CD should roll forward application images and validated configuration, not rebuild the whole platform on every deploy. Shared components such as ingress, node pools, identity, storage, monitoring, and optional KEDA or Service Bus should remain under controlled infrastructure change management.
What makes this pattern reusable
This AKS pattern generalizes well beyond one model family or one product surface because it is built on fundamentals:
- Separate API admission from dispatch and GPU execution.
- Choose the dispatch boundary that fits the workload: Kubernetes-native queueing with AKS autoscaling, or Service Bus plus KEDA.
- Isolate workload classes into different scaling lanes.
- Scale worker capacity from the most useful signal for the chosen model: workload pressure inside AKS or external queue backlog through KEDA.
- Put authentication, TLS, and routing at the edge.
- Use workload identity and externalized secrets.
- Instrument both software behavior and GPU behavior.
- Keep deployments automated, traceable, and easy to roll back.
That is the real architecture story. The model can change. The runner can change. Even the queue and gateway choices can evolve. But the engineering fundamentals stay stable, and that stability is what makes diffusion workloads viable at scale.
Possible alternatives: KAITO and the AI toolchain operator add-on
Some teams do not need a fully custom GPU execution platform. On AKS, two adjacent options are KAITO and the AI toolchain operator add-on.
KAITO is the lighter-weight choice for rapid experimentation with supported model presets. The AI toolchain operator add-on is the more managed option for standardized LLM or multimodal serving with AKS-native operational features. Both are less suitable when the platform needs custom diffusion pipelines, queue-backed job orchestration, artifact-heavy workflows, or application-specific dispatch logic.
Reference documents
- AKS node pools: Create node pools for a cluster in Azure Kubernetes Service (AKS)
- Microsoft Entra Workload ID for AKS: Use Microsoft Entra Workload ID with Azure Kubernetes Service (AKS)
- Key Vault integration for AKS: Use the Azure Key Vault provider for Secrets Store CSI Driver in an Azure Kubernetes Service (AKS) cluster
- Azure Blob and other CSI storage drivers on AKS: Use Container Storage Interface (CSI) drivers on Azure Kubernetes Service (AKS)
- AKS GPU multi-instance support: Use multi-instance GPUs in Azure Kubernetes Service (AKS)
- KEDA on AKS: Simplified application autoscaling with Kubernetes Event-driven Autoscaling (KEDA) add-on
- Application Gateway Ingress Controller: What is Application Gateway Ingress Controller?
- AKS monitoring with Azure Monitor, managed Prometheus, and Grafana: Enable monitoring for Azure Kubernetes Service (AKS) clusters
- Application telemetry with Application Insights: Introduction to Application Insights - OpenTelemetry observability
- Hugging Face Diffusers: Diffusers documentation
- NVIDIA DCGM exporter: dcgm-exporter
- NVIDIA device plugin DaemonSet: NVIDIA k8s-device-plugin
Closing thought
Running diffusion models in production is not mainly a model-hosting problem. It is a platform engineering problem with GPUs in the middle. Teams that treat AKS as the control surface for isolation, observability, identity, and repeatable rollout discipline end up with a system that can scale beyond a benchmark and survive real operational demand.