This is part 2 of a three-part series on enterprise LLM inference. Part 1 covered why inference is hard — the Pareto frontier, the two-phase bottleneck, KV cache pressure, agentic workloads, and GPU economics. Here we walk through the optimization stack, ordered by implementation priority, starting with the highest-leverage changes. Part 3 covers the platform, security, and governance layer underneath.
The Solutions — An Optimization Stack for Enterprise Inference
The optimizations below are ordered by implementation priority — starting with the highest-leverage.
The Three-Layer Serving Stack
Most enterprise LLM deployments operate across three layers, each responsible for a different part of the inference pipeline. Understanding which layer a bottleneck belongs to is often the fastest path to improving inference performance.
- Azure Kubernetes Service (AKS) orchestrates the infrastructure — GPU nodes, networking, and container lifecycle.
- Ray Serve provides the distributed model serving layer — handling request routing, autoscaling, batching, replica placement, and multi-model serving.
- Inference engines such as vLLM execute the model forward passes and implement token-generation optimizations such as continuous batching and KV-cache management.
In simple terms: AKS manages infrastructure. Ray Serve manages inference workloads. vLLM generates tokens
With that architecture in mind, we can examine the optimization stack.
1. GPU Utilization: Maximize What You Already Have
Before optimizing models or inference engines, start here: are you fully utilizing the GPUs you’re already paying for? For most enterprise deployments, the answer is no. GPU utilization below 50% means you’re effectively paying double for every token generated.
Autoscaling on inference-specific signals. Autoscaling should be driven by request queue depth, GPU utilization, and P95 latency — not generic CPU or memory metrics, which are poor proxies for LLM serving load. AKS supports GPU-enabled node pools with cluster autoscaler integration across NC-series (A100, H100) and ND-series VMs. Scale to zero during idle periods; scale up based on token-level demand, not container-level metrics.
Inference-aware orchestration. AKS orchestrates infrastructure resources such as GPU nodes, pods, containers. Ray Serve operates one layer above as the inference orchestration framework, managing model replicas, request routing, autoscaling, streaming responses, and backpressure handling, while inference engines like vLLM perform continuous batching and KV-cache management. The distinction matters because LLM serving load doesn't express well in CPU or memory metrics; Ray Serve operates at the level of tokens and requests, not containers.
AKS orchestrates infrastructure; Ray Serve orchestrates model serving. Anyscale Runtime reports faster performance and lower compute cost than self-managed Ray OSS on selected workloads, though gains depend on workload and configuration.
Right-sizing Azure GPU selection.
The default instinct when deploying GenAI in production is often to grab the biggest, fastest hardware available. For inference, that is often the wrong call. For structured output tasks, a well-optimized, quantized 7B model running on an NCads H100 v5 (H100 NVL 94GB) or an NC A100 v4 (A100 80GB) node can easily outperform a generalized 70B model on a full ND allocation—at a fraction of the cost. New deployments should target NCads H100 v5.
The secret to cost-effective inference is matching your VM SKU to your workload's specific bottleneck. For compute-heavy prefill phases or massive multi-GPU parallelism , the ND H100 v5's ultra-fast interconnects are unmatched. However, autoregressive token generation (decode) is primarily bound by memory bandwidth. For single-GPU, decode-heavy workloads, the NCads series is the better fit: the H100 NVL 94GB has higher published HBM bandwidth (3.9 TB/s) than the H100 80GB (3.35 TB/s).
ND H100 v5 remains the right choice when you need multi-GPU sharding, high aggregate throughput, or tightly coupled scale-out inference. You can extend utilization further with MIG partitioning to host multiple small models on a single NVL card, provided your application can tolerate the proportional drop in memory bandwidth per slice.
2. GPU Partitioning: MIG and Fractional GPU Allocation on AKS
For smaller models or moderate-concurrency workloads, dedicating an entire GPU to a single model replica wastes resources. Two techniques address this on AKS.
NVIDIA Multi-Instance GPU (MIG) partitions a single physical GPU into up to seven hardware-isolated instances, each with its own compute cores, memory, cache, and memory bandwidth. Each instance behaves as a standalone GPU with no code changes required. On AKS, MIG is supported on Standard_NC40ads_H100_v5, Standard_ND96isr_H100_v5, and A100 GPU VM sizes, configured at node pool creation using the --gpu-instance-profile parameter (e.g., MIG1g, MIG3g, MIG7g).
Fractional GPU allocation in Ray Serve is a scheduling and placement mechanism, not hardware partitioning. By assigning fractional GPU resources (say, 0.5 GPU per replica) through Ray placement groups, multiple model replicas can share a single physical GPU. Ray Serve propagates the configured fraction to the serving worker (i.e. vLLM), but, unlike MIG, replicas still share the same underlying GPU memory and memory bandwidth. There’s no hard isolation.
Because fractional allocation does not enforce hard VRAM limits, it requires careful memory management: conservative gpu_memory_utilization configuration, controlled concurrency and context length, and enough headroom for KV cache growth, CUDA overhead, and allocator fragmentation. It works best when model weights are relatively small, concurrency is predictable and moderate, and replica counts are stable.
For stronger isolation and guaranteed memory partitioning, use NVIDIA MIG. Fractional allocation is best treated as a GPU packing optimization, not an isolation mechanism.
3. Quantization: The Fastest Path to Cost Reduction
Quantization reduces the numerical precision of model weights, activations, and KV cache entries to shrink memory footprint and increase throughput. FP16 → INT8 roughly halves memory; 4-bit quantization cuts it by approximately 4×.
Post-Training Quantization (PTQ) is the fastest path to production gains. As one example, Llama-3.1-70B-Instruct reduces weight memory from ~140 GB in BF16 to ~70 GB in FP8, which can make single-GPU deployment feasible on an 80GB GPU for low-concurrency or short-context workloads. Production feasibility still depends on KV cache size, engine overhead, and concurrency, so careful capacity planning is required.
4. Inference Engine Optimizations in vLLM
Modern inference engines — particularly vLLM, which powers Anyscale’s Ray Serve on AKS — implement several optimizations that compound to deliver significant throughput improvements.
Continuous batching replaces static batching, where the system waits for all requests in a batch to complete before accepting new ones. With continuous batching, new requests join at every decode iteration, keeping GPUs more fully utilized. Anyscale has demonstrated up to 23x throughput improvement using continuous batching versus static batching (measured on OPT-13B on A100 40GB with varying concurrency levels). In practice, this can push GPU utilization from 30–40% to 80%+ on AKS GPU node pools.
PagedAttention manages KV cache allocation the way an operating system manages RAM — breaking it into small, non-contiguous pages to eliminate fragmentation. Naive KV cache allocation wastes significant reserved memory through internal and external fragmentation. PagedAttention eliminates this, enabling more concurrent requests per GPU. Enabled by default in vLLM.
Prefix caching automatically stores the KV cache of completed requests in a global on-GPU cache. When new requests share common prefixes — system prompts, shared context in RAG — vLLM reuses cached state instead of recomputing it, reducing TTFT and compute load. Anyscale’s PrefixCacheAffinityRouter extends this by routing requests with similar prefixes to the same replica, maximizing cache hit rates across AKS pods.
Chunked prefill breaks large prefill operations into smaller chunks and interleaves them with decode steps. Without it, a long incoming prompt can stall all ongoing decode operations. Chunked prefill keeps streaming responses smooth even when new long prompts arrive, and improves GPU utilization by mixing compute-bound prefill chunks with memory-bound decode. Enabled by default in vLLM V1.
Speculative decoding addresses the sequential decode bottleneck directly. A smaller, faster “draft” model proposes multiple tokens ahead; the larger “target” model verifies them in parallel in a single forward pass. When the draft predicts correctly — which is frequent for routine language patterns — multiple tokens are generated in one step. Output quality is identical because every token is verified by the target model. Particularly effective for code completion, where token patterns are highly predictable.
5. Disaggregated Prefill and Decode
Since prefill is compute-bound and decode is memory-bandwidth-bound, running both on the same GPU forces a compromise — the hardware is optimized for neither. Disaggregated inference separates these phases across different hardware resources.
vLLM supports disaggregated prefill and decode and Ray Serve can orchestrate separate worker pools for each phase. In practice, this means Ray Serve routes each incoming request to a prefill worker first, then hands off the resulting KV cache to a dedicated decode worker — without the application layer needing to manage that handoff. This capability is evolving and should be validated against your Ray and vLLM versions before deploying to production. With MIG or separate node pools, prefill and decode resources can be isolated to better match each phase’s hardware requirements.
For large-scale deployments, Azure ND GB200-v6 VMs (each VM includes four NVIDIA GB200 Tensor Core GPUs on the Blackwell architecture) are designed for rack-scale NVLink connectivity (GB200 NVL72), enabling high-bandwidth GPU-to-GPU communication that disaggregated prefill/decode architectures depend on for KV-cache movement.
6. Multi-LoRA Adapters: Serve Many Use Cases from One Deployment
Fine-tuned Low-Rank Adaptation (LoRA) adapters for different domains can share a single base model in GPU memory, with lightweight task-specific layers swapped at inference time. Legal, HR, finance, and engineering copilots are served from one AKS GPU deployment instead of four separate ones.
This is a direct cost multiplier: instead of provisioning N separate model deployments for N departments, you provision one base model and swap adapters per request. Ray Serve and vLLM both support multi-LoRA serving on AKS.
Open-Source Models for Enterprise Inference
The open-source model ecosystem has matured to the point where self-hosted inference on open-weight models - running on AKS with Ray Serve and vLLM - is a viable and often preferable alternative to proprietary API access. The strategic advantages are significant: full control over data residency and privacy (workloads run inside your Azure subscription), no per-token API fees (cost shifts to Azure GPU infrastructure), the ability to fine-tune and distill for domain-specific accuracy, no vendor lock-in, and predictable cost structures that don’t scale with usage volume.
Leading Open-Source Model Families
Meta Llama (Llama 3.1, Llama 4) is the most widely adopted open-weight model family. Llama 3.1 offers dense models from 8B to 405B parameters; Llama 4 introduces MoE variants. Strong general-purpose performance with native vLLM integration. The 70B variant hits a reasonable quality-to-serving cost for most enterprise use cases. Available under Meta’s community license.
Qwen (Alibaba) excels in multilingual and reasoning tasks. Qwen3-235B is a MoE model activating roughly 22B parameters per token — delivering frontier-class quality at a fraction of dense-model inference cost. Strong on code, math, and structured output. Apache 2.0 license on most variants.
Mistral models are optimized for efficiency and inference speed. Mistral 7B remains one of the highest-performing models at its size class, making it well-suited for cost-sensitive, high-throughput deployments on smaller Azure GPU SKUs. Mixtral 8x22B provides MoE-based quality scaling. Mistral Large (123B) competes with frontier proprietary models. Licensing varies: most smaller models are Apache 2.0, while some larger releases use research or commercial licensing terms. Verify the license for the specific model prior to production deployment.
DeepSeek (DeepSeek AI) introduced aggressive MoE architectures with cost-efficient training. DeepSeek-V3 (671B total, 37B active per token) delivers strong reasoning quality at significantly lower per-token inference cost than dense models of comparable capability. Strong on math, code, and multilingual tasks. DeepSeek models are developed by a Chinese AI research lab. Organizations in regulated industries should evaluate applicable data sovereignty, export control, and vendor risk policies before deploying DeepSeek weights in production.
The examples below are illustrative starting points rather than fixed recommendations. Actual model and infrastructure choices should be validated against workload-specific latency, accuracy, and cost requirements.
Model Selection Examples
|
Workload |
Recommended Model Class |
Azure Infrastructure |
Rationale |
|
Internal copilots, high-throughput APIs |
7B–13B (Llama 8B, Mistral 7B, Qwen 7B) |
NCads H100 v5 with MIG, or NC A100 v4 (existing deployments) |
10–30x cheaper serving; recover accuracy via RAG and fine-tuning
|
|
Customer-facing assistants |
30B–70B (Llama 70B, Qwen 72B, Mistral Large) |
NC A100 v4 (80GB – existing deployments ) or ND H100 v5 |
Quality directly impacts revenue and trust
|
|
Frontier quality at sub-frontier cost |
MoE (Qwen3-235B, DeepSeek-R1, Mixtral) |
ND H100 v5 or ND GB200-v6 |
Active parameters determine inference cost, not total model size |
|
Code completion and engineering copilots |
Code-specialized (DeepSeek-Coder, Qwen-Coder) |
NCads H100 v5 with MIG |
Domain models outperform larger general models at lower cost |
|
Multilingual |
Qwen, DeepSeek |
Matches workload size above |
Strongest non-English performance in open-weight ecosystem |
|
Edge / on-device |
Sub-7B (Phi-4, Gemma 2B, Llama 8B quantized) |
Azure IoT Edge / local hardware |
Fits within edge memory and power envelopes |
The rule of thumb: start with the smallest model that meets your quality threshold. Add RAG, caching, fine-tuning, and batching before scaling model size. Treat model choice as an ongoing decision —the open-source ecosystem evolves fast enough that what’s optimal today may not be in six months. Actual performance varies by workload, so these model and size recommendations should be validated through testing in your target environment.
All leading open-weight models are natively supported by vLLM and Ray Serve / Anyscale on AKS, with out-of-the-box quantization, multi-GPU parallelism, and Multi-LoRA support.
The optimizations above assume a platform that is already secure, governed, and production-hardened. Continuous batching on an exposed endpoint is not a production system. Part three covers the architecture decisions, security controls, and operational metrics that make enterprise inference deployable — and auditable. Continue to Part 3: Building an Enterprise Platform for Inference at Scale →
Part 1: Inference at Enterprise Scale: Why LLM Inference Is a Capital Allocation Problem | Microsoft Community Hub
Part 3: (coming soon)