Blog Post

Apps on Azure Blog
5 MIN READ

Building an Enterprise Platform for Inference at Scale

bobmital's avatar
bobmital
Icon for Microsoft rankMicrosoft
Mar 17, 2026

This is part 3 of a three-part series. Part 1 covered why inference is hard at enterprise scale. Part 2 walked through the optimization stack — GPU utilization, quantization, continuous batching, disaggregated inference, and model selection. Here we cover the architecture decisions and platform foundation that make those optimizations production-safe: parallelism strategy, deployment topology, security, governance, and the metrics that determine whether your inference operation is actually profitable.

Architecture Decisions

With the optimization stack in place, the next layer of decisions is architectural — how you distribute compute across GPUs, nodes, and deployment environments to match your model size and traffic profile.

GPU Parallelism Strategy on AKS

Strategy

How It Works

When to Use

Tradeoff

Tensor Parallelism

Splits weight matrices within each layer across GPUs (intra-layer sharding); all GPUs participate in every forward pass.

Model exceeds single-GPU memory (e.g., 70B on A100 GPUs once weights, KV cache, runtime overhead are included) 

Inter-GPU communication overhead; requires fast interconnects (NVLink on ND-series) — costly to scale beyond a single node without them

Pipeline Parallelism

Distributes layers sequentially across nodes, with each stage processing part of the model

Model exceeds single-node GPU memory — typically unquantized deployments beyond ~70–100B depending on node GPU count and memory

Pipeline “bubbles” reduce utilization. Pipeline parallelism is unfriendly to small batches

Data Parallelism

Replicates full model across GPUs

Scaling throughput / QPS on AKS node pools

Memory-inefficient (full copy per replica); only strategy that scales throughput linearly

Combined

Tensor within node + Pipeline across nodes + Data for throughput scaling

Production at scale on AKS — for any model requiring multi-node deployment, combine TP within each node and PP across nodes

Complexity; standard for large deployments

When a model can be quantized to fit a single GPU or a single node, the performance and cost benefits of avoiding cross-node communication are substantial. When quality permits, quantize before introducing distributed sharding, because fitting on a single GPU or single node often delivers the best latency and cost profile.

In practice, implementing combined parallelism requires coordinating placement of model shards across nodes, managing inter-GPU communication, and ensuring that scaling decisions don't break shard assignments. Anyscale on Azure handles this orchestration layer through Ray's distributed scheduling primitives — specifically placement groups, which allow tensor-parallel shards to be co-located within a node while data-parallel replicas scale independently across node pools. The result is that teams get the throughput benefits of combined parallelism without building and maintaining the scheduling logic themselves.

Deployment Topology

Parallelism strategy determines how you use GPUs inside a deployment. Topology determines where those deployments run.

Cloud (AKS) offers flexibility and elastic scaling across Azure GPU SKUs (ND GB200-v6, ND H100 v5, NC A100 v4). Anyscale on Azure adds managed Ray clusters that run inside the customer’s AKS environment, with Azure billing integration, Microsoft Entra ID integration, and connectivity to Azure storage services.

Edge enables ultra-low latency, avoids per-query cloud inference cost, and supports local data residency—critical in environments such as manufacturing, healthcare, and retail

Hybrid is the pragmatic default for most enterprises. Sensitive data stays local with small quantized models; complex analysis routes to AKS. Azure Arc can extend governance across hybrid deployments.

 

Across all three deployment patterns — cloud, edge, and hybrid — the operational challenge is consistent: managing distributed inference workloads without fragmenting your control plane. Anyscale on AKS addresses this directly. In pure cloud deployments, it provides managed Ray clusters inside your own Azure subscription, eliminating the need to operate Ray infrastructure yourself. In hybrid architectures, Ray clusters on AKS serve as the cloud leg, with Azure Arc extending Azure RBAC, Azure policy for governance, and centralized audit logging to Arc-enabled servers/Kubernetes clusters on the edge infrastructure. The result is a single operational model regardless of where inference is actually executing: scheduling, scaling, and observability are handled by Ray, the network boundary stays inside your Azure environment, and the governance layer stays consistent across locations. Teams that would otherwise maintain separate orchestration stacks for cloud and edge workloads can run both through a unified Ray deployment managed by Anyscale.

The Enterprise Platform — Security, Compliance, and Governance on AKS

The optimizations in this series — quantization, continuous batching, disaggregated inference, MIG partitioning — all assume a platform that meets enterprise requirements for security, compliance, and data governance. Without that foundation, none of the performance work matters. A fraud detection model that leaks customer data is not “cost-efficient.” An inference endpoint exposed to the public internet is not “low-latency.” The platform has to be solid before the optimizations can be useful.

Self-hosting inference on AKS provides that foundation. Every inference request — input prompts, output tokens, KV cache, model weights, fine-tuning data — stays inside the customer’s own Azure subscription and virtual network. Data never traverses third-party infrastructure. This eliminates an entire class of data residency and sovereignty concerns that hosted API services cannot address by design.

Network Isolation and Access Control

AKS supports private clusters in which the Kubernetes API server is exposed through Azure Private Link rather than a public endpoint, limiting API-server access to approved private network paths. All traffic between the API server and GPU node pools stays internal. Network Security Groups (NSGs), Azure Firewall, and Kubernetes network policies enforced through Azure CNI powered by Cilium can restrict traffic between pods, namespaces, and external endpoints, enabling micro-segmentation between inference workloads.

Microsoft Entra ID integration with Kubernetes RBAC handles enterprise identity management: SSO, group-based role assignments, and automatic permission updates when team membership changes. Managed identities eliminate credentials in application code. Azure Key Vault stores secrets, certificates, and API keys with hardware-backed encryption.

The Anyscale on Azure integration inherits this entire stack. Workloads run inside the customer’s AKS cluster — with Entra ID authentication, Azure Blob storage connectivity via private endpoints, and unified Azure billing. There is no separate Anyscale-controlled infrastructure to audit or secure.

The Metrics That Determine Profitability

Metric

What It Measures

Why It Matters

Tokens/second/GPU

Raw hardware throughput

Helps you understand how much work each GPU can do and supports capacity planning on AKS GPU node pools

Tokens/GPU-hour

Unit economics

Tokens generated per Azure VM billing hour — the number your CFO cares about

P95 / P99 latency

Tail latency

Shows the experience of slower requests, which matters more than averages in real production systems.

GPU utilization %

Paid vs. used Azure GPU capacity

Low utilization means you are paying for expensive GPU capacity that is sitting idle or underused.

Output-to-input token ratio

Generation cost ratio

Higher output ratios increase generation time and reduce how many requests each GPU can serve per hour.

KV cache hit rate

Context reuse efficiency

Low hit rates mean more recomputation of prior context, which increases latency and cost.

Product design directly affects inference economics. Defaulting to verbose responses when concise ones suffice consumes more GPU cycles per request, reducing how many requests each GPU can serve per hour.

Conclusion

Base model intelligence is increasingly commoditized. Inference efficiency compounds.

Organizations that treat inference as a first-class engineering and financial discipline win. By deliberately managing the accuracy–latency–cost tradeoff and tracking tokens per GPU-hour like a core unit metric, they deploy AI cheaper, scale faster, and protect margins as usage grows.

 

Links:

Updated Mar 17, 2026
Version 1.0
No CommentsBe the first to comment