Running DeepSeek on AKS with Ollama

pauledwards

Microsoft

Jan 31, 2025

Introduction

This guide provides step-by-step instructions on how to run DeepSeek on Azure Kubernetes Service (AKS). The setup utilizes an ND-H100-v5 VM to accommodate the 4-bit quantized 671-billion parameter model on a single node.

Prerequisites

Before proceeding, ensure you have an AKS cluster with a node pool containing at least one ND-H100-v5 instance. Additionally, make sure the NVIDIA GPU drivers are properly set up.

Important: Set the OS disk size to 1024GB to accommodate the model.

For detailed instructions on creating an AKS cluster, refer to this guide.

Deploying Ollama with DeepSeek on AKS

We will use the PyTorch container from NVIDIA to deploy Ollama. First, configure authentication on AKS:

Create an NGC API Key
- Visit NGC API Key Setup.
- Generate an API key.

Create a Secret in the AKS Cluster
Run the following command to create a Kubernetes secret for authentication:

kubectl create secret docker-registry nvcr-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password=<YOUR_NGC_API_KEY>

Deploy the Pod

Use the following Kubernetes manifest to deploy Ollama, download the model, and start serving it:

apiVersion: v1
kind: Pod
metadata:
  name: ollama
spec:
  containers:
  - name: ollama
    image: nvcr.io/nvidia/pytorch:24.09-py3
    imagePullSecrets:
    - name: nvcr-secret
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    volumeMounts:
    - mountPath: /dev/shm
      name: shmem
    resources:
      requests:
        nvidia.com/gpu: 8
        nvidia.com/mlnxnics: 8
      limits:
        nvidia.com/gpu: 8
        nvidia.com/mlnxnics: 8
    command:
      - bash
      - -c
      - |
        export DEBIAN_FRONTEND=noninteractive
        export OLLAMA_MODELS=/mnt/data/models
        mkdir -p $OLLAMA_MODELS

        apt update
        apt install -y curl pciutils net-tools

        # Install Ollama
        curl -fsSL https://ollama.com/install.sh | sh

        # Start Ollama server in the foreground
        ollama serve 2>&1 | tee /mnt/data/ollama.log &

        # Wait for Ollama to fully start before pulling the model
        sleep 5

        # Fix parameter issue (https://github.com/ollama/ollama/issues/8599)
        cat >Modelfile <<EOF
        FROM deepseek-r1:671b
        PARAMETER num_ctx 24576
        PARAMETER num_predict 8192
        EOF
        ollama create deepseek-r1:671b-fixed -f Modelfile
        # Pull model
        ollama pull deepseek-r1:671b-fixed

        # Keep the container running
        wait
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 128Gi
    name: shmem

Connecting to Ollama

By default, this setup does not create a service. You can either define one or use port forwarding to connect Ollama to your local machine. To forward the port, run:

kubectl port-forward pod/ollama 11434:11434

Now, you can interact with the DeepSeek reasoning model using your favorite chat client. For example, in Chatbox, you can ask:

"Tell me a joke about large language models."

Updated Jan 31, 2025

Version 2.0

ai infrastructure

hpc

pauledwards

Microsoft

Joined June 19, 2019

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity

Blog Post