Blog Post

Apps on Azure Blog
5 MIN READ

Part 1: Inference at Enterprise Scale: Why LLM Inference Is a Capital Allocation Problem

bobmital's avatar
bobmital
Icon for Microsoft rankMicrosoft
Mar 02, 2026

Microsoft and Anyscale recently announced a strategic partnership that brings Ray — the open-source distributed compute framework powering AI workloads at scale — directly into Azure Kubernetes Service (AKS) as an Azure Native Integration. Azure customers can now provision and manage Anyscale-powered Ray clusters from the Azure Portal, with unified billing and Microsoft Entra ID integration. Workloads run inside the customer's own AKS clusters within their Azure tenant, so you keep full control over your data, compliance posture, and security boundaries.

The serving stack referenced throughout this series is built on two components: Anyscale’s services powered by Ray Serve for inference orchestration and vLLM as the inference engine for high-throughput token generation. Inference — the process of generating output tokens from a trained model — is where enterprise AI investments either compound or collapse. For organizations processing millions of requests daily across copilots, customer-facing assistants, analytics platforms, and agentic workflows, inference is what drives cloud spend and long-term AI unit economics. It is a capital allocation decision, not just an infrastructure one.

This is part one of a three-part series. In this post, we cover the core technical challenges that make inference hard at enterprise scale and how to address them. Part two walks through the optimization stack — including a survey of leading open-source models with a framework for choosing between them — ordered by implementation priority. Part three covers how to build and govern the enterprise platform underneath it all, including a look at how Anyscale on Azure addresses these as an enterprise platform.

One organizing principle ties it all together: inference systems live on a three-way tradeoff between accuracy, latency, and cost — the Pareto frontier of LLMs. Pick two; engineer around the third. You rarely get all three simultaneously, so optimize for two and consciously manage the third, Every architectural decision in this series maps back to that tradeoff while also ensuring the security, compliance and governance that enterprise deployments can’t skip.

Part I: The Challenges — Why Inference Is Hard

Challenge 1: The Pareto Frontier — You Cannot Optimize Everything Simultaneously

Enterprise inference teams run into the same constraint regardless of stack: accuracy, latency, and cost are interdependent. Improving one almost always pressures the others. A larger model improves accuracy but increases latency and GPU costs. A smaller model reduces cost but risks quality degradation. Aggressive optimization for speed sacrifices depth of reasoning.

These pressures play out across three dimensions that define every inference architecture:

Dimension 1: Model quality (accuracy). The baseline capability curve. Larger models, better fine-tuning, and RAG shift you to a higher-quality frontier.

Dimension 2: Throughput per GPU (cost). Tokens per GPU-hour — since self-hosted models on AKS are billed by VM uptime, not per token. Quantization, continuous batching, MIG partitioning, and batch inference all move this number.

Dimension 3: Latency per user (speed). How fast each user gets a response. Speculative decoding, prefix caching, disaggregated prefill/decode, and smaller context windows push this dimension.

In practice, this plays out in two stages. First, you choose the accuracy level your business requires — this is a model selection decision (model size, fine-tuning, RAG, quantization precision). That decision locks you onto a specific cost-latency curve. Second, you optimize along that curve: striving for more tokens per GPU-hour, lower tail latency, or both.

 

The frontier itself isn’t fixed – it shifts outward as your engineering matures. The tradeoffs don't disappear, but they get progressively less painful.

The practical question to anchor on: What is the minimum acceptable accuracy for this business outcome, and how far can I push the throughput-latency frontier at that level?

When your team has answered that, the table below maps directly to engineering levers available.

Priority

Tradeoff

Engineering Bridges

Accuracy + Low Latency

Higher cost

Use smaller models to reduce serving cost; recover accuracy with RAG, fine-tuning, and tool use. Quantization cuts GPU memory footprint further.

Accuracy + Low Cost

Higher latency

Batch inference, async pipelines, and queue-tolerant architectures absorb the latency gracefully.

Low Latency + Low Cost

Accuracy at risk

Smaller or distilled models with quantization; improve accuracy via RAG, fine-tuning

 

Challenge 2: Two Phases, Two Bottlenecks

Inference has two computationally distinct phases, each constrained by different hardware resources.

Prefill processes the entire input prompt in parallel, builds the Key-Value (KV) cache and produces the first output token. It is compute-bound — limited by how fast the GPUs can execute matrix multiplications. Time scales with input length. This phase determines Time to First Token (TTFT).

Decode generates output tokens sequentially, one at a time. Each token depends on all prior tokens, so the GPU reads the full KV cache from memory at each step. It is memory-bandwidth-bound — limited by how fast data moves from GPU memory to processor. This phase determines Time Per Output Token (TPOT).

Total Latency = TTFT + (TPOT × (Output Token Count-1))

These bottlenecks don’t overlap. A document classification workload (long input, short output) is prefill-dominated and compute-bound. A content generation workload (short input, long output) is decode-dominated and memory-bandwidth-bound. Optimizing one phase does not automatically improve the other.

That is why advanced inference stacks now disaggregate these phases across different hardware to optimize each independently – a technique covered in depth in part two of this series

Challenge 3: The KV Cache — The Hidden Cost Driver

Model weights are static — loaded once into GPU VRAM per replica. The KV cache is dynamic: it's allocated at runtime per request, and grows linearly with context length, batch size, and number of attention layers.At high concurrency and long context, it is a frequent primary driver of OOM failures, often amplified by prefill workspace and runtime overhead.

A 7B-parameter model needs roughly 14 GB for weights in FP16. On an NC A100 v4 node on AKS (A100 80GB per GPU), a single idle replica has plenty of headroom. But KV cache scales with concurrent users. KV cache memory per sequence is determined by:

layers × KV_heads × head_dim × tokens × bytes_per_element × 2 (K and V)

A 7B model takes ~14 GB in FP16 — fixed. The KV cache is where things get unpredictable. For Llama 3 8B, a single 8K-token sequence consumes about 1 GB of KV cache. That sounds manageable, but it compounds: 40 concurrent users at 8K context adds ~40 GB. Combined with weights and runtime overhead, you're already at ~58 GB on an 80 GB GPU — and that's before context lengths grow. At 32K tokens per sequence, the same 15 concurrent users produce the same KV pressure as 60 users at 8K. At 128K+, a single sequence can stress the GPU on its own. The weights didn't change. KV cache growth drove the failure.

 

Context length is the sharpest lever you have. Match it to the workload — don't default to max.

Context Length

Typical Use Cases

Memory Impact

4K–8K tokens

Q&A, simple chat

Low KV cache memory

32K–128K tokens

Document analysis, summarization

Moderate — GPU memory pressure begins

128K+ tokens

Multi-step agents, complex reasoning

KV cache dominates VRAM; drives architecture decisions

Challenge 4: Agentic AI Multiplies Everything

Agentic workloads fundamentally change the resource profile. A single user interaction with an AI agent can trigger dozens or hundreds of sequential inference calls — planning, executing, verifying, iterating — each consuming context that grows over the session.

Agentic workloads stress every dimension of the Pareto frontier simultaneously: they need accuracy (autonomous decisions carry risk), low latency (multi-step chains compound delays), and cost efficiency (token consumption scales with autonomy duration).

Challenge 5: GPU Economics — Idle Capacity Is Burned Capital

Production inference traffic is bursty and unpredictable. Idle GPUs equal burned cash. Under-batching means low utilization. Choosing the wrong Azure VM SKU for your workload introduces significant cost inefficiency.

In self-hosted AKS deployments, cost is GPU-hours — you pay for the VM regardless of token throughput. Output tokens are more expensive per token than input tokens because decode is sequential, so generation-heavy workloads require more GPU-hours per request. Product design decisions like response verbosity and default generation length directly affect how many requests each GPU can serve per hour.

Token discipline is cost discipline — not because tokens are priced individually, but because they determine how efficiently you use the GPU-hours you’re already paying for.  These five challenges don't operate in isolation — they compound. An agentic workload running at long context on the wrong GPU SKU hits all five simultaneously. Part two of this series walks through the optimization stack that addresses each one, ordered by implementation priority.

 

Continue to Part 2: The LLM Inference Optimization Stack

 

 

 

Updated Mar 02, 2026
Version 2.0
No CommentsBe the first to comment