Blog Post

Apps on Azure Blog
12 MIN READ

Self Hosted AI Application on AKS in a day with KAITO and CoPilot.

owaino's avatar
owaino
Icon for Microsoft rankMicrosoft
Jan 20, 2025

In this blog post I document my experience of spending a full day using KAITO and Copilot to accelerate deployment and development of a self managed AI enabled chatbot deployed with a fine tuned LLM in a Kubernetes cluster. AI Apps to create AI Apps!

Introduction

In this blog post I document my experience of spending a full day using KAITO and Copilot to accelerate deployment and development of a self managed AI enabled chatbot deployed in a managed cluster. The goal is to showcase how quickly using a mix of AI tooling we can go from zero to a self hosted, tuned LLM and chatbot application. 

At the top of this article I want to share my perspective on the future of projects such as KAITO. At the moment I believe KAITO to be somewhat ahead of its time, as most enterprises begin adopting abstracted artificial intelligence it is brilliant to see projects like KAITO being developed ready for the eventual abstraction pendulum to swing back, motivated by usual factors such as increased skills in the market, cost and governance. Enterprises will undoubtedly in the future look to take centralised control of the AI models being used by their enterprises as GPU's become cheaper, more readily available and powerful. When this shift happens open source projects like KAITO will become common place in enterprises.

It is also my opinion that Kubernetes lends itself perfectly to be the AI platform of the future a position shared by the CNCF (albeit both sources here may be somewhat biased). The resiliency, scaling and existence of Kuberentes  primitives such as "Jobs" mean that Kubernetes is already the de-facto platform for machine learning training and inference. These same reasons also make Kuberentes the best underlying platform for AI development. Companies including DHL, Wayve and even OpenAI all run ML or AI workloads already on Kubernetes. That does not mean that Data Scientists and engineers will suddenly be creating Dockerfiles or exploring admission controllers, Kubernetes instead, as a platform will be multiple layers of abstraction away (Full scale self service platform engineering) however the engineers responsible for running and operating the platform will hail projects like KAITO. 

Short Reading Recommendation

Now hopefully you are feeling suitably excited and ready to dive into my experience spending the day using KAITO and Copilot to create a self hosted chat app. 

Pre-Requisites

The only real pre-requisite for this example is to have GPU Quota for KAITO to use in the cluster deployment location selected. In this blog I use 12vCPU of the NCv3 Series which are powered by NVIDIA Tesla V100 GPUs. To learn how to make changes to your subscription quota please see this link.

This article is structured as a record of my development of the following repository. To deploy the resources in this blog post please use the "Deploy Yourself" section towards the end of the blog. The full file set including the application, setup script and application files can be found in this repository:

https://github.com/owainow/ai-in-a-day

Before we start let's do some level setting on two of the key pieces of AI tooling I have used in this blog post. 

KAITO Summary

KAITO is an open-source project created by Microsoft, it stands for Kubernetes AI Toolchain Operator. KAITO automates the deployment and management of AI/ML model inference and tuning workloads within Kubernetes clusters. It streamlines the integration of large language models (LLMs) by managing model files as container images, providing preset configurations tailored to various GPU hardware, and supporting popular inference runtimes like vLLM and transformers. KAITO uses Custom Resource Definition's (CRD) and a controller design pattern that enables users to define a workspace custom resource. That resource specifies GPU requirements and inference or tuning parameters. The operator then automates the provisioning of GPU nodes and deployment of workloads based on these specifications, simplifying the process of running AI models on Kubernetes. KAITO supports popular open source models like Falcon and Phi. 

 

KAITO Architecture Diagram.

 

GitHub Copilot Summary

GitHub Copilot is an AI-powered code completion and suggestion tool developed by GitHub in collaboration with OpenAI. It integrates directly into code editors like Visual Studio Code, providing real-time assistance by suggesting code snippets, entire functions, or even boilerplate code as you type. Powered by OpenAI's Codex, a model trained on vast amounts of public code and natural language data, Copilot analyzes the context of the current file and related project files to generate contextually relevant suggestions. This enables developers to accelerate coding, reduce repetitive tasks, and focus on solving higher-level problems.

While Copilot is not a silver bullet for developers it is without doubt a great productivity tool that can be leveraged by developers to improve productivity from generating code, to debugging or one of my favourite uses, explaining other peoples code.

Getting Started

To start with I created a new project in VSCode in a new window. Note the date displayed in the terminal for posterity.

Next I need to create my cluster that I would like to host my AI Models. Although usually this would be done with IaC given that the intention of this is to show speed of development and ease of consumption I will be creating the cluster through CLI commands contained in a bash script. 

Cluster Creation - Deploying KAITO

The bash script created above uses environment variables to register the KAITO feature flag, create the resource group, the AKS cluster and then verify connection to the cluster. 

 

We now need to begin setting up some of the additional resources required by KAITO. As we discussed above KAITO uses the Kubernetes CRD/Controller pattern and Karpetner, as a result we now need to create a federated credential to enable KAITO to create the GPU nodes in our created resource group. 

 

The screenshot above shows the KAITO specific commands in the bash script. We created the federated credential and restart the provisioner pods. We then use a very simple provided yaml to inform KAITO what model we want to deploy.  We then loop to check when the workspace is ready. As we are using Karpenter/Node Auto Provisioner under the hood to manage the creation of the GPU nodes at the point the workspace resource is created on the cluster the GPU nodes will spin up and the (Falcon7b in this case) LLM image will be pulled. This can take up to an hour to complete so we loop to check whether the resource is ready yet. 

Finally we get the service IP and run a simple query to ensure the model and inference server are running as expected. 

This is great and gives us a very easy way to deploy optimized self hosted LLM's on Azure however the model is not tuned for any specific purpose and will answer any question that we pass to it. Let us now review KAITO fine tuning.

KAITO Tuning

Now that our workspace is up and running there is one clear problem. The model is a general LLM with not restricted inputs or safeguarding for responses. System prompts can go a long way to prevent LLM's from answering queries that are outside of the models use case. On KAITO we can tune the models that we are deploying. KAITO supports tuning models using either LoRA or QLoRA, passing through data either from a public data set hosted online or data contained in a container image. 

In our workspace example I have generated a very small parquet file (partly to make the tuning run much faster, minutes instead of hours) and uploaded in to a public blob storage container. 

apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-falcon-7b-instruct resource: instanceType: "Standard_NC12s_v3" labelSelector: matchLabels: apps: falcon-7b-instruct tuning: preset: name: falcon-7b method: qlora input: urls: - https://oowpublic.blob.core.windows.net/parquet/microsoft_products_tuning_data.parquet output: image: llmrepooow.azurecr.io/adapters/myadapter:0.0.1 imagePushSecret: acr-secret

Training data must follow the specific hugging face training format and subsequently be passed as a parquet file. 

{ "messages": [ {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."} ] }

 

Parquet Creation

In the application folder of the repository I have included a small example of the python files used to create and validate my parquet. This can be used to run your own tuning with a unique dataset. 

KAITO creates an adapter container image based on the training job that

KAITO Inference Deployment with adapter

is then pushed to the repo you specify using a pull secret. This is all done automatically for you in the setup script provided. 

When adapters are specified in the inference spec, the Kaito controller adds an initcontainer for each adapter in addition to the main container. 

If an image is specified as the adapter source, the corresponding initcontainer uses that image as its container image. These initcontainers ensure all adapter data is available locally before the inference service starts. The main container uses a supported model image, launching the inference_api.py script.

All containers share local volumes by mounting the same EmptyDir volumes, avoiding file copies between containers. This can be seen in the diagram above. 

For a deeper dive into tuning on KAITO review the documents in the KAITO repo: https://github.com/kaito-project/kaito/tree/main/docs/tuning

 

Very much on theme for this blog post when evaluating the creation of the KAITO tuning spec I leveraged GitHub Co-Pilot within the GitHub repository to assist in explaining the tuning spec parameters. 

With the tuning job complete I was then able to create my inference workspace:

apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-falcon-7b-inference-adapter resource: instanceType: Standard_NC12s_v3 labelSelector: matchLabels: apps: falcon-7b-adapter inference: preset: name: falcon-7b-instruct adapters: - source: name: falcon-7b-adapter image: llmrepooow.azurecr.io/adapters/myadapter:0.0.1 imagePullSecrets: - acr-secret strength: 1

 

App Development

In this instance we will be creating a chatbot application to speak to the LLM. We will be creating a Microsoft Shopping agent that can process a web page and based on some specific user details recommend Microsoft Hardware to a customer. To do this we will need a frontend that allows a user to pass through a URL and a query for the specific ask from the customer. I'm sure much to the disappointment of my software engineering lecturers, my software development skills are very rusty. As a result I will be using GitHub Copilot to help me create this application. AI Apps to create AI Apps...

 

To start with I have created a separate folder called "application" in the root of my directory and opened a new terminal (As I am starting this while the bash script is running in another terminal). 

An hour after starting (Not bad as I am writing this in parallel) and I am ready to engage Copilot. 

Parquet Creation

In the application folder of the repository I have included a small example of the python files used to create and validate my parquet. This can be used to run your own tuning with a unique dataset. 

Prompt engineering in itself is emerging as an important skill for all roles in the workplace much like how efficient "Googling" has become second nature in the last couple of decades. As I know the intention for my application and the language I want to write my chatbot in I can be very specific in my prompt.

 

The output contains a tree of files that can be clicked into and examined before creating within my project at the click of a button:

Now that we have created the files we can optionally review to ensure they have been created accurately but to begin with I want to just run the application so again to keep productivity high lets just ask Copilot now how to run the application locally.

After some minor debugging due to Copilot failing to create a .css file for the animated background and subsequently creating it after I passed the error from the terminal into the chat window the application is built and ready for some testing. 

Note, at this point we have no inference server endpoint to use so we are just evaluating what the application looks like and how it is interacted with. 

 

We can see the application running. Lets make this application at little more resilient and ask Copilot to add some error catching to improve the user experience. 

Now that our application is ready we can check the application builds and runs locally with:

npm install npm start

Once that is confirmed we can build our docker image and push it to the container registry. The Dockerfile was created for us, we can see something to keep an eye on when using Copilot which is vesioning, this is because of the dataset being used. In this case the Dockerfile has been created with node:14. This does not matter too much in our case but will vary depending on the training data and prompt used.

FROM node:14 # Set the working directory WORKDIR /app # Copy package.json and package-lock.json COPY package*.json ./ # Install dependencies RUN npm install # Copy the rest of the application code COPY . . # Build the application RUN npm run build # Expose the port the app runs on EXPOSE 3000 # Command to run the application CMD ["npm", "start"]

Once the image is built we can test it locally with Docker and then push it to our Azure Container Registry. I won't detail those steps here but if you do get stuck and your developing yourself that is another great candidate for a productivity saving by asking GitHub Copilot how to run and push locally! The bash script I have created will automatically build and push the application for you! 

The application could do with some fine tuning as far as the responses and layout however for the sake of time we will continue and deploy the application as is.

Conclusion 

Now... I have to admit, this may have taken more then one single continuous day. The reason for this primarily was a lack of GPU Quota in my Azure Subscription having recently changed to a brand new subscription. In terms of elapsed hours working on this including creating the bash script to automate the deployment and creation I am at around 12 hours, a more apt but less catchy title would be AI in a working day and a half or AI in a day involving overtime and no conflicting meetings. 

In all seriousness I have been impressed at the speed my development was accelerated using KAITO and Copliot. I was able to create all these resources in a very short time period, especially impressive when you compare how long a project like this, running a local fine tuned LLM with optimized GPU's and a chatbot application would have taken to develop from scratch. As always I remind myself that during this exercise primarily using GPT 4o (not even 4o1!) that LLM's ability to reason and solve increasingly more complex tasks will only get better from here. 

 

However don't take my word for it, why not try it yourself! Especially now their is a free GitHub Copilot tier for Visual Studio! 

Deploy Yourself

The complete set of application files and bash script can be found at this repository:

https://github.com/owainow/ai-in-a-day

Simply run the bash script and once you complete the login flow and enter your initials for unique resource creation the rest of the deployment will be handled for you from tuning to app deployment. Once complete navigate to the URL shown at the bottom of the terminal and play with the chatbot that has been created. 

The bash script can be run from the infrastructure folder of the cloned directory using the following command:

cd infrastructure . deploycluster.sh

The script will take over an hour to run from start to finish with interaction required for sign in and entering initials at the start. You may also need to interact later on if this is not the first time you have ran the script and you have not cleared your old kubeconfigs. 

KubeTidy

Here's a tool called KubeTidy if you do want to stay on top of unused Kubernetes config files: https://github.com/KubeDeckio/KubeTidy

Once the script is completed you should see the following output. Its now time to test and explore the deployment!

Future Work

This is by no means a complete run through of KAITO. There are some additional areas that were outside of the scope of this article that I believe will be interesting to review going forward. One of those is the intersect of when utilising KAITO vs Azure OpenAI services for specific use cases becomes cost effective. Initially the single SKU of a NCv3 12vCPU running 24/7 will cost around $4800 a month before cost savings, discounts or reserved instance plans. I am unsure number of requests per minute and tokens that could be processed. Would this type of architecture potentially lend itself instead to batch jobs if the content and responses were not required to be immediate. This is an interesting area that I believe will become more relevant as LLM usage and maturity grows across enterprises. 

There is certainly scope for an entirely separate article examining the benefits and drawbacks from a governance and operational standpoint when comparing Azure OpenAI services and self hosted models using KAITO. 

A final thought to support the enterprise ready adoption of KAITO on AKS would be to combine it with a self hosted APIM gateway running on AKS. Although traditionally my advice would be to run APIM separately to your workloads and KAITO using our managed service there are use cases related to latency, performance or financial commitment that would make self hosted API gateways for API management alongside KAITO an attractive architecture.

The application could do with some tweaks to the responses, the responses for the LLM could be further grounded and now with the RAG reranker GA (this week!)  for KAITO we could expand the example with retrieval augmented results. Credit to Ishaan Sehgal and KAITO team for the work on this one. 

Updated Jan 20, 2025
Version 3.0
No CommentsBe the first to comment