Inference at Enterprise Scale - A Three-Part Series
Part 1: Why LLM Inference Is a Capital Allocation Problem (you are here)
Covers the five core technical challenges that make enterprise inference hard, and the tradeoffs that govern every architectural decision.
Part 2: The LLM Inference Optimization Stack
A survey of leading open-source models and a prioritized optimization framework — from continuous batching and quantization to speculative decoding and disaggregated prefill.
Part 3: Building and Governing the Enterprise Platform
How to build the operational layer underneath the inference stack, including how Anyscale on Azure addresses enterprise requirements for security, compliance, and observability.
Most enterprise AI conversations focus on model selection and fine-tuning. The harder problem — and the one that determines whether AI investments produce returns or just costs — is inference: serving those models reliably, at scale, under real production load. For organizations running millions of daily requests across copilots, analytics pipelines, and agentic workflows, inference is what drives cloud spend. It is not purely an infrastructure decision. At scale, it becomes a capital allocation decision.
Microsoft and Anyscale recently announced a strategic partnership that brings Ray — the open-source distributed compute framework powering AI workloads at scale — directly into Azure Kubernetes Service (AKS) as an Azure Native Integration. Azure customers can provision and manage Anyscale-powered Ray clusters from the Azure Portal, with unified billing and Microsoft Entra ID integration. (Sign up for private preview). Workloads run inside the customer's own AKS clusters within their Azure tenant, so you keep full control over your data, compliance posture, and security boundaries.
The serving stack referenced throughout this series is built on two components: Anyscale’s services powered by Ray Serve for inference orchestration and vLLM as the inference engine for high-throughput token generation.
One organizing principle ties all three parts together: inference systems live on a three-way tradeoff between accuracy, latency, and cost — the Pareto frontier of LLMs. Pick two; engineer around the third. Improving one dimension almost always requires tradeoffs in another. A larger model improves accuracy but increases latency and GPU costs. A smaller model reduces cost but risks quality degradation. Optimizing aggressively for speed can sacrifice reasoning depth. The frontier itself is not fixed — it shifts outward as your engineering matures — but the tradeoffs never disappear. Every architectural decision in this series maps back to that constraint, alongside the security, compliance, and governance requirements enterprise deployments cannot skip.
The Challenges — Why Inference Is Hard
Challenge 1: The Pareto Frontier — You Cannot Optimize Everything Simultaneously
Enterprise inference teams run into the same constraint regardless of stack: accuracy, latency, and cost are interdependent. These pressures play out across three dimensions that define every inference architecture:
Dimension 1: Model quality (accuracy). The baseline capability curve. Larger models, better fine-tuning, and RAG shift you to a higher-quality frontier.
Dimension 2: Throughput per GPU (cost). Tokens per GPU-hour — since self-hosted models on AKS are billed by VM uptime, not per token. Quantization, continuous batching, MIG partitioning, and batch inference all move this number.
Dimension 3: Latency per user (speed). How fast each user gets a response. Speculative decoding, prefix caching, disaggregated prefill/decode, and smaller context windows push this dimension.
In practice, this plays out in two stages. First, you choose the accuracy level your business requires — this is a model selection decision (model size, fine-tuning, RAG, quantization precision). That decision locks you onto a specific cost-latency curve. Second, you optimize along that curve: striving for more tokens per GPU-hour, lower tail latency, or both.
The frontier itself isn’t fixed – it shifts outward as your engineering matures. The tradeoffs don't disappear, but they get progressively less painful.
The practical question to anchor on: What is the minimum acceptable accuracy for this business outcome, and how far can I push the throughput-latency frontier at that level?
When your team has answered that, the table below maps directly to engineering levers available.
|
Priority |
Tradeoff |
Engineering Bridges |
|
Accuracy + Low Latency |
Higher cost |
Use smaller models to reduce serving cost; recover accuracy with RAG, fine-tuning, and tool use. Quantization cuts GPU memory footprint further. |
|
Accuracy + Low Cost |
Higher latency |
Batch inference, async pipelines, and queue-tolerant architectures absorb the latency gracefully. |
|
Low Latency + Low Cost |
Accuracy at risk |
Smaller or distilled models with quantization; improve accuracy via RAG, fine-tuning |
Challenge 2: Two Phases, Two Bottlenecks
Inference has two computationally distinct phases, each constrained by different hardware resources.
Prefill processes the entire input prompt in parallel, builds the Key-Value (KV) cache and produces the first output token. It is compute-bound — limited by how fast the GPUs can execute matrix multiplications. Time scales with input length. This phase determines Time to First Token (TTFT).
Decode generates output tokens sequentially, one at a time. Each token depends on all prior tokens, so the GPU reads the full KV cache from memory at each step. It is memory-bandwidth-bound — limited by how fast data moves from GPU memory to processor. This phase determines Time Per Output Token (TPOT).
Total Latency = TTFT + (TPOT × (Output Token Count-1))
These bottlenecks don’t overlap. A document classification workload (long input, short output) is prefill-dominated and compute-bound. A content generation workload (short input, long output) is decode-dominated and memory-bandwidth-bound. Optimizing one phase does not automatically improve the other.
That is why advanced inference stacks now disaggregate these phases across different hardware to optimize each independently – a technique covered in depth in part two of this series
Challenge 3: The KV Cache — The Hidden Cost Driver
Model weights are static — loaded once into GPU VRAM per replica. The KV cache is dynamic: it's allocated at runtime per request, and grows linearly with context length, batch size, and number of attention layers. At high concurrency and long context, it is a frequent primary driver of OOM failures, often amplified by prefill workspace and runtime overhead. In practice, this means LLM serving capacity is constrained less by model size and more by KV cache growth driven by context length and concurrency.
A 7B-parameter model needs roughly 14 GB for weights in FP16. On an NC A100 v4 node on AKS (A100 80GB per GPU), a single idle replica has plenty of headroom. But KV cache scales with concurrent users. KV cache memory per sequence is determined by:
KV_Bytes_total = batch_size * num_layers × num_KV_heads × head_dim × tokens × bytes_per_element × 2 (K and V)
The KV cache is where things get unpredictable. For Llama 3 8B, at ~8B parameters, requires roughly ~16 GB. On an A100 80GB GPU (e.g., Azure NC A100 v4 on AKS), a single low-concurrency replica typically has plenty of headroom — but KV cache scales with concurrency.
For Llama-3 8B, which uses Grouped Query Attention (GQA), an 8K-token sequence consumes roughly ~1 GB of KV cache in FP16/BF16.
This compounds quickly:
- 40 concurrent 8K-token requests → ~40 GB KV cache
- Add model weights (~16 GB) and runtime overhead, and total memory usage can approach ~60 GB+ on an 80 GB GPU, leaving limited headroom.
Because KV memory scales linearly with tokens:
- 15 users at 32K context create roughly the same KV pressure as 60 users at 8K.
- At 128K+ context lengths, even a single long-running sequence can materially reduce safe concurrency.
The model weights never changed — KV cache growth drove the failure.
Context Length Is the Sharpest Lever
Context length is often the largest controllable driver of GPU memory consumption. Rather than defaulting to the maximum context window supported by a model, systems should match context size to the workload
|
Context Length |
Typical Use Cases |
Memory Impact |
|
4K–8K tokens |
Q&A, simple chat |
Low KV cache memory |
|
32K–128K tokens |
Document analysis, summarization |
Moderate — GPU memory pressure begins |
|
128K+ tokens |
Multi-step agents, complex reasoning |
KV cache dominates VRAM; drives architecture decisions |
Context length directly controls KV cache growth, which is often the main cause of GPU memory exhaustion.Increasing context length reduces the number of concurrent sequences a GPU can safely serve — often more sharply than any other single variable. Controlling it at the application layer, through chunking, retrieval (RAG), or enforced limits, is one of the highest-leverage interventions available before reaching for more hardware.
Challenge 4: Agentic AI Multiplies Everything
Agentic workloads fundamentally change the resource profile. A single user interaction with an AI agent can trigger dozens or hundreds of sequential inference calls — planning, executing, verifying, iterating — each consuming context that grows over the session.
Agentic workloads stress every dimension of the Pareto frontier simultaneously: they need accuracy (autonomous decisions carry risk), low latency (multi-step chains compound delays), and cost efficiency (token consumption scales with autonomy duration).
Challenge 5: GPU Economics — Idle Capacity Is Burned Capital
Production inference traffic is bursty and unpredictable. Idle GPUs equal burned cash. Under-batching means low utilization. Choosing the wrong Azure VM SKU for your workload introduces significant cost inefficiency.
In self-hosted AKS deployments, cost is GPU-hours — you pay for the VM regardless of token throughput. Output tokens are more expensive per token than input tokens because decode is sequential, so generation-heavy workloads require more GPU-hours per request. Product design decisions like response verbosity and default generation length directly affect how many requests each GPU can serve per hour.
Token discipline is cost discipline — not because tokens are priced individually, but because they determine how efficiently you use the GPU-hours you’re already paying for. These five challenges don't operate in isolation — they compound. An agentic workload running at long context on the wrong GPU SKU hits all five simultaneously.
Part two of this series walks through the optimization stack that addresses each one, ordered by implementation priority -> Continue to Part 2: The LLM Inference Optimization Stack (will be published soon)