If you had to point out the top trends of IT these days, two strong candidates would be Generative AI and Cybersecurity. Especially around the latter, sophistication, reach and volume of cyberattacks have seen significant increases in the last years, with added ingredients such as advanced persistent threats, state actors or “crime-as-a-service” providers.
Interestingly enough, both trends go hand in hand: Artificial Intelligence extracts value from your data, and cyber criminals are exactly after the same thing: your data. It is not surprising that organizations have taken steps to protect themselves against data theft or data exfiltration, as it is often described.
In this post we will explore how to deploy in a Kubernetes cluster a Hugging Face-hosted model and a NVIDIA NIM™ microservice, a prebuilt, optimized inference container for rapidly deploying the latest AI models, and at the same time protect your infrastructure against data theft. You can find more information about NVIDIA NIM here: https://developer.nvidia.com/nim.
We will outline the process for deploying a Kubernetes cluster in Azure with the highest level of network security to prevent data exfiltration, and we will also demonstrate the deployment of container images and required model parameters for both options
Why Kubernetes clusters
Unless you have been living under a rock, you are probably aware that Kubernetes has taken the IT world by storm, and the AI ecosystem is not an exception. Kubernetes makes it extremely easy to package and deploy applications over any infrastructure and hence it has become one the most popular platforms to run AI workloads, especially AI inferencing.
Azure Kubernetes Service (AKS) is an Azure service that makes it easy to run Kubernetes clusters in Azure. Over time, AKS has introduced multiple deployment options to meet increasingly stringent requirements, particularly around security. One such option is the private cluster, where no public IP addresses are assigned to the Kubernetes control plane or nodes.
To understand this evolution, let’s have a look at what a “public” AKS cluster looks like:
|
Figure 1- public AKS API enabled cluster |
As the previous figure shows, there are multiple traffic flows that go over the public Internet:
- In the bootstrap phase, the nodes get images from Microsoft Container Registry, as well as potentially from other repositories such as Ubuntu.
- The Kubernetes administrator operates the cluster accessing the Kubernetes API provided by Microsoft with a public IP address.
- When pulling container images, node clusters can get them from publicly available repositories such as Docker hub or Azure Container Registry (if configured to be publicly accessible).
- Lastly, administrators are allowed to expose applications that run in the cluster via public IP addresses, so that users will access them over the Internet too.
The first evolution of this concept towards a more restrictive environment was a commonly used pattern consisting of a combination of private clusters (https://learn.microsoft.com/azure/aks/private-clusters) and Azure Firewall to limit egress traffic (https://learn.microsoft.com/azure/aks/limit-egress-traffic) and prevent data exfiltration.
In this model, there are no longer any inbound connections to the cluster:
Figure 2- AKS private cluster
- The AKS API control plane is fully integrated in the virtual network.
- Azure Container Registry and other Azure services such as Azure Storage or Azure Key Vault are also integrated with the virtual network through the Private Link technology (https://aka.ms/privatelink).
However, there are still outbound flows from the cluster nodes to the Internet, for example during the cluster creation process or the deployment of images stored in public repositories, which need to be explicitly allowed by the egress firewall.
Air-gapped clusters
It can be argued that using private clusters only provides security up to the robustness of your firewall ruleset: it essentially acts as a fail-open mechanism. If there’s a misconfiguration in the firewall rules, you may unintentionally allow data exfiltration or theft.
To address this, AKS offers an even greater degree of isolation with network-isolated clusters (https://learn.microsoft.com/azure/aks/concepts-network-isolated), where all outbound connections are completely blocked without the need of a firewall:
Figure 3- AKS isolated cluster
In this mode, AKS nodes are configured in a way so that no outbound flows to the Internet can exist.
If you are curious about what you need to do to make sure of that in Azure, here is the list:
- No public Azure load balancer attached to the AKS nodes.
- No NAT gateway attached to the AKS node subnet.
- No public IP address attached to the AKS nodes.
- The AKS node subnet configured for no default outbound access (https://learn.microsoft.com/azure/virtual-network/ip-services/default-outbound-access).
An important consideration is understanding how AKS nodes receive updates or how images are retrieved from public repositories (e.g. docker.io or nvcr.io). This is achieved through an Azure Container Registry feature known as “artifact cache”: https://learn.microsoft.com/azure/container-registry/artifact-cache-overview.
However, a challenge arises when considering Large Language Models (LLMs): LLM container images sourced from Hugging Face or NVIDIA (or any other source) typically include the inference runtime (for example vLLM) but not the model weights. Instead, model artifacts are downloaded dynamically when the container starts.
Consequently, Azure Container Registry cannot cache these assets. The question then becomes: how can these model weights be made available within an air-gapped Kubernetes environment?
The Model Weights Challenge
While the model weight (re) load on container startup is a flexible approach in connected environments, it fails in air-gapped clusters where outbound network access is blocked.
To address this, we consider two viable strategies:
- Constructing a container image that includes all required components and pushing it to the container registry accessible by the isolated cluster
- Pre-downloading model artifacts to a private file share connected to the virtual network of the isolated cluster, and accessing these resources as needed.
Both methods will be demonstrated in detail, but before proceeding, however, we will further outline the example scenario. To provide context aligned with current priorities among our financial clients and organizations operating within regulated sectors, this demonstration focuses on the process of model deployment and the configuration of an isolated cluster for LLM inferencing.
For inferencing we use Llama-3.1-8B-Instruct-FP8 served by vLLM, a high-performance inference runtime designed specifically for large language models. In simple terms, vLLM is responsible for efficiently loading the model onto the GPU and handling incoming inference requests with very low latency and high throughput. vLLM is typically packaged as a container image, which can be sourced either from Hugging Face or from NVIDIA (in our examples), the latter being highly optimized for NVIDIA GPUs and CUDA®. As described earlier, these images usually contain the inference runtime and dependencies, but not the model weights themselves.
Instead, the model weights and other model-specific artifacts are downloaded dynamically when the container starts, allowing the same container image to be reused across different models and versions while keeping the image size small and deployment flexible. This approach is not suitable in isolated AKS clusters, where network traffic flowing outside of the deployed virtual network is not permitted.
From an architectural perspective, model serving is only one part of the overall inferencing platform, and the design of the underlying GPU infrastructure plays a critical role-especially in isolated AKS clusters. In such environments, challenges are not limited to downloading model weights at container startup; for example, setting up the GPU node pool is another important consideration.
Traditionally, enabling GPUs on AKS requires installing the NVIDIA device plugin for Kubernetes as well as the NVIDIA GPU drivers, most commonly by deploying the NVIDIA GPU Operator, which takes care of both. While the device plugin itself can be installed relatively easily via the artifact cache of an attached container registry, driver installation is more involved, especially in air-gapped or isolated environments: https://learn.microsoft.com/azure/aks/use-nvidia-gpu. NVIDIA also provides detailed guidance on how to deploy the GPU Operator in such scenarios in their documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-air-gapped.html.
While the required procedures are clearly outlined and well documented, configuring the NVIDIA AKS GPU node pool within an air-gapped, isolated cluster continues to be a complex and time-consuming process.
Microsoft has recently unveiled a preview feature that allows users to create fully managed NVIDIA GPU node pools on AKS. With this option, all necessary NVIDIA components including drivers, device plugins, and other supporting software are pre-installed and maintained by Microsoft throughout their lifecycle.
This functionality is supported on isolated AKS clusters operating with Kubernetes v1.34.0 or later. It substantially decreases operational complexity, streamlining the deployment and maintenance of GPU-based AI inferencing solutions accelerated by NVIDIA in restricted environments https://learn.microsoft.com/en-us/azure/aks/aks-managed-gpu-nodes?tabs=add-ubuntu-gpu-node-pool
For example, this Azure CLI command would deploy a managed GPU AKS nodepool to an existing AKS cluster:
|
1 az aks nodepool add \ 2 --resource-group MyResourceGroup \ 3 --cluster-name MyAKSCluster \ 4 --name gpunp \ 5 --node-count 1 \ 6 --node-vm-size <GPU_SKU> \ 7 --node-taints sku=gpu:NoSchedule \ 8 --enable-cluster-autoscaler \ 9 --min-count 1 \ 10 --max-count 3 \ 11 --tags EnableManagedGPUExperience=true |
Keep in mind that if you want complete control over driver versions and related settings, you should follow the guidelines for deploying the NVIDIA GPU Operator in an air-gapped environment. With the managed option, Microsoft takes care of maintaining driver versions for you.
Container and model weight deployment
With the managed GPU node pool set up, we can begin implementing both inferencing scenarios.
Scenario 1: Baking Model Weights into the Container Image
Let’s start with the first one, where the model weights are downloaded from Hugging Face, baked into a container image and then pushed to the attached Container Registry which is reachable from the isolated AKS cluster:
Figure 4- Baking Model Weights into Container Image
Figure 4- Baking Model Weights into Container Image
The easiest way to achieve this, is to trigger the container build directly from the container registry, which will take the local Dockerfile, pull the image and data needed from Hugging Face, and deploy and tag the backed image to the container registry.
Make sure you have acquired a Hugging Face API Key. If you are using a gated model, access must be requested before building the image.
|
1 az acr build \ 2 --registry <ACR_NAME> \ 3 --image llama3-vllm-fat:8b-instruct \ 4 --build-arg HF_TOKEN=$HF_TOKEN \ 5 . |
Note: When you use the “az acr build” command instead of running docker build yourself, it automatically tags your image and pushes it to the Azure Container Registry.
Once this container image is available in the container registry, we can create a simple pod and an internal load balancer to expose the service endpoint to the user. The detailed instructions and code are available here: https://github.com/mocelj/aks-air-gap-vllm-deployment.
You can test the deployment by querying the external IP of the deployed service and interacting with the endpoint in an OpenAPI-compatible way . Note that since this is an isolated cluster, you need to connect to the cluster’s network via VPN or run the curl command in a pod inside of the cluster, see aks-air-gap-vllm-deployment/aks_isolated.sh at main · mocelj/aks-air-gap-vllm-deployment for more details about how to set up a point-to-site VPN in Azure. Here you can see how to get the service IP address and query the completions API:
|
1 # Get the Service IP 2 svc_ip=$(kubectl get svc vllm-llama3-8b -o jsonpath='{.status.loadBalancer.ingress[0].ip}') 3 curl -X POST "http://${svc_ip}:8000/v1/chat/completions" \ 4 -H "accept: application/json" \ 5 -H "Content-Type: application/json" \ 6 -d '{ 7 "messages": [ 8 {"role": "system", "content": "You are a polite and respectful chatbot."}, 9 {"role": "user", "content": "Where should I go for lunch close to the Microsoft office in Pratteln?"} 10 ], 11 "model": "meta/llama3-8b-instruct", 12 "max_tokens": 512, 13 "top_p": 1, 14 "n": 1, 15 "stream": false, 16 "frequency_penalty": 0.0 17 }'
|
Scenario 2: Using a Shared File System for Model Artifacts
In the second scenario, where we are downloading the model weights and other artifacts used by the NVIDIA NIM to a shared NFS drive, we must follow a slightly different strategy.
Figure 5- Pre-download model weights to a private file share
For simplicity, we have used a virtual machine capable of downloading artifacts from the Internet and reaching the internal container registry as well as the shared NFS volume deployed in a virtual network. In this example, we have created a simple NFS share using Azure Files (see the Azure CLI code here to create the share and the endpoint): https://github.com/mocelj/aks-air-gap-vllm-deployment/blob/main/aks_isolated.sh#L352). For large scale inferencing scenarios, you might want to consider other storage options to ensure reasonable startup times, given the weights can be of considerable size.
To facilitate model deployment on NVIDIA A100 GPUs, we have provisioned a jump box equipped with the same GPU type. If you start with a fresh virtual machine, you may need to install the appropriate GPU driver as well as NVIDIA’s container runtime (https://developer.nvidia.com/container-runtime). You can alternatively deploy a Linux virtual machine using DSVM Linux images, where only the container runtime needs to be added to ensure readiness for operation.
To download the container image from nvcr.io we can leverage the caching rules in the container registry and pull the image via our connected container registry. Once the container image is pulled locally, we can download the model profile with the appropriate artifacts and copy everything to the shared folder, e.g. by using rsync.
The artifacts can be downloaded by using the Utilities for NVIDIA NIM for LLMs: https://docs.nvidia.com/nim/large-language-models/latest/utilities.html
|
1 docker run --rm \ 2 --runtime=nvidia \ 3 --gpus all \ 4 -v $LOCAL_NIM_CACHE:/opt/nim/.cache \ 5 -u $(id -u) \ 6 -e NGC_API_KEY \ 7 $TARGET_IMAGE \ 8 download-to-cache |
Important: The NVIDIA API key must not be included in the AKS deployment manifests, as otherwise it would trigger outbound network calls that will fail in air‑gapped environments. The key is only required on the jump box during the download of the model artefacts.
Since this shared folder is reachable from within the Jump box network and the isolated, air-gapped AKS cluster, the only thing we must do is pointing the NVIDIA NIM container to use the model weights found in the shared folder, and not to download it once the pod starts. It is important to note that the NVIDIA API key should not be part of the deployment script, since otherwise it will trigger an outbound connection to pull an image from the nvcr.io registry, which will fail in an air gapped environment.
The service can be tested in a similar way as before. First, we need to find out the external service IP, which we will get via “kubectl get svc vllm-nim-llama3-service -o wide” and then interact with the service in the same way as before, adjusting to the new service IP address.
A more detailed description of the implementation steps can be found in the attached repository: https://github.com/mocelj/aks-air-gap-vllm-deployment.
Summary
This document presents a practical guide for deploying LLM inferencing solutions in isolated Azure Kubernetes Service (AKS) clusters.
It outlines two deployment approaches: one where model weights and artifacts are pre-downloaded and stored in a shared folder accessible by both the jump box and cluster, and another using a shared NFS drive for storing downloaded resources.
Both strategies enable secure, air-gapped deployments without relying on outbound internet access. For step-by-step instructions and further technical details, consult the referenced GitHub repository https://github.com/mocelj/aks-air-gap-vllm-deployment.
For large‑scale or production deployments, more performant storage options—such as local NVMe or other high‑throughput solutions—can be explored; the services used in this guide are intentionally chosen to maximize clarity and reproducibility.