Introduction
This guide provides step-by-step instructions on how to run DeepSeek on Azure Kubernetes Service (AKS). The setup utilizes an ND-H100-v5 VM to accommodate the 4-bit quantized 671-billion parameter model on a single node.
Prerequisites
Before proceeding, ensure you have an AKS cluster with a node pool containing at least one ND-H100-v5 instance. Additionally, make sure the NVIDIA GPU drivers are properly set up.
Important: Set the OS disk size to 1024GB to accommodate the model.
For detailed instructions on creating an AKS cluster, refer to this guide.
Deploying Ollama with DeepSeek on AKS
We will use the PyTorch container from NVIDIA to deploy Ollama. First, configure authentication on AKS:
- Create an NGC API Key
- Visit NGC API Key Setup.
- Generate an API key.
 
- Create a Secret in the AKS Cluster
 Run the following command to create a Kubernetes secret for authentication:kubectl create secret docker-registry nvcr-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<YOUR_NGC_API_KEY>
Deploy the Pod
Use the following Kubernetes manifest to deploy Ollama, download the model, and start serving it:
apiVersion: v1
kind: Pod
metadata:
  name: ollama
spec:
  containers:
  - name: ollama
    image: nvcr.io/nvidia/pytorch:24.09-py3
    imagePullSecrets:
    - name: nvcr-secret
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    volumeMounts:
    - mountPath: /dev/shm
      name: shmem
    resources:
      requests:
        nvidia.com/gpu: 8
        nvidia.com/mlnxnics: 8
      limits:
        nvidia.com/gpu: 8
        nvidia.com/mlnxnics: 8
    command:
      - bash
      - -c
      - |
        export DEBIAN_FRONTEND=noninteractive
        export OLLAMA_MODELS=/mnt/data/models
        mkdir -p $OLLAMA_MODELS
        apt update
        apt install -y curl pciutils net-tools
        # Install Ollama
        curl -fsSL https://ollama.com/install.sh | sh
        # Start Ollama server in the foreground
        ollama serve 2>&1 | tee /mnt/data/ollama.log &
        # Wait for Ollama to fully start before pulling the model
        sleep 5
        # Fix parameter issue (https://github.com/ollama/ollama/issues/8599)
        cat >Modelfile <<EOF
        FROM deepseek-r1:671b
        PARAMETER num_ctx 24576
        PARAMETER num_predict 8192
        EOF
        ollama create deepseek-r1:671b-fixed -f Modelfile
        # Pull model
        ollama pull deepseek-r1:671b-fixed
        # Keep the container running
        wait
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 128Gi
    name: shmemConnecting to Ollama
By default, this setup does not create a service. You can either define one or use port forwarding to connect Ollama to your local machine. To forward the port, run:
kubectl port-forward pod/ollama 11434:11434Now, you can interact with the DeepSeek reasoning model using your favorite chat client. For example, in Chatbox, you can ask:
"Tell me a joke about large language models."