Level up your Generative AI development with Microsoft’s AI Toolkit! In the previous blog, we explored how AI Toolkit empowers you to run LLMs/SLMs locally.
AI toolkit lets us to,
- Run pre-optimized AI models locally: Get started quickly with models designed for various setups, including Windows 11 running with DirectML acceleration or direct CPU, Linux with NVIDIA GPUs, or CPU-only environments.
- Test and integrate models seamlessly: Experiment with models in a user-friendly playground or use a REST API to incorporate them directly into your application.
- Fine-tune models for specific needs: Customize pre-trained models (like popular SLMs Phi-3 and Mistral) locally or in the cloud to enhance performance, tailor responses, and control their style.
- Deploy your AI-powered features: Choose between cloud deployment or embedding them within your device applications.
Port Forwarding, a valuable feature within the AI Toolkit, serves as a crucial gateway for seamless communication with the GenAI model. Whether it's through a straightforward API call or leveraging the SDKs, this functionality greatly enhances our ability to harness the power of the LLM/SLM. By enabling Port Forwarding, a plethora of new scenarios unfold, unlocking the full potential of our interactions with the model.
Port forwarding is like setting up a special path for data to travel between two devices over the internet. In the context of AI Toolkit, port forwarding involves configuring a pathway for communication between the LLM and external applications or systems, enabling seamless data exchange and interaction.
AI toolkit auto forwards the port to 5272 by default. If needed this can be modified or new ports can be added. This can be seen the moment we load the AI Toolkit extension on VS Code. A notification can be seen on the right side of the screen stating the “Your application running on port 5272 is available”
The 5272 is default port assigned by AI Toolkit, if we wish to add more ports, that can be done by navigating to the PORTS terminal on VS Code and then clicking on the “Add Ports” button.
In this, we can see the section “Forwarded address”. This is the address which will be used for communicating with the SLM. For this tutorial, Phi-3 will be used. This is a Small Language Model from Microsoft. It can be downloaded from the Model Catalog section.
Testing and comprehending the API Endpoint on POSTMAN:
Testing out an application from the API testing application, gives a clear knowledge about the API specification. This can also be done using the coding language, but for this demonstration, I will be showcasing it through the POSTMAN application. Postman application needs to be downloaded and installed on the local machine. Also sign up and create an account if you don’t have one. Once the application is launched, click on create a new request, it will be displayed as “+’” icon.
In order to do this testing, we would need some basic information about the API that we are testing. Details like Request method, Request URL, request body, type of the request body, authentication type are some of the mandatory things which will be needed. For the Visual Studio Code AI Toolkit API with Phi-3-mini-128k-cuda-int4-onnx model, the details are as follows,
Authentication: None
Request method: POST
Request URL:
http://127.0.0.1:5272/v1/chat/completions
Request Body type: Raw/JSON
Request Body:
{
"model": "Phi-3-mini-128k-cuda-int4-onnx",
"messages": [
{
"role": "user",
"content": "Hi"
}
],
"temperature": 0.7,
"top_p": 1,
"top_k": 10,
"max_tokens": 100,
"stream": false
}
Authentication will be defaulting to None. So, it can be retained here to None as it is. HTTP request method here would be POST. The Request URL must be checked according to the Port that has been assigned, if it is the default port which is assigned by the visual studio ai toolkit, it will be 5272. Incase, it is changed to any other port, it must be changed in the URL as well. This will be a HTTP request and the URL must contain the URL address, followed by the port and the routes must be set as shown above.
The “model” parameter in the request body must have the same model that is being loaded in the VS Code AI Toolkit – playground section.
Request body type must be set to Raw and from the dropdown, select JSON.
The streaming parameter must be set to false, else the response will be streamed and divided into sub-classes of JSON which is hard to comprehend. In-case if the application needs streaming, then the parameter can be set to TRUE.
Once these are ready, use the Send button and wait for the response. If the API call is successful, 200 OK will be displayed and a response body will be visible in the Body tab.
NOTE: The VS Code AI Toolkit must be running in the background, and load the model in the playground before sending the API response in Postman.
The response of the above request is as follows,
Meanwhile, this will also be reflected on the VS Code AI toolkit Output window as well.
The response body carries quite a lot of information, the answer that will be used in the application is in the “content” division. The code must navigate to the first section of the choices then the messages and finally to the content section. In python, it can be done by,
chat_completion.choices[0].messages.content
This will be further used while building the playground application in the later part of this tutorial.
The response body also provides some more useful information like the ID, created stamp, finish reason to name a few. The role parameter shows that the response is from the model and hence shows “assistant”.
Code snippet can also be generated with the POSTMAN app, which can be then used for testing the API. Following is the example for the Python Code using the HTTP Client.
Using the VS Code AI Toolkit with Python:
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:5272/v1/",
api_key="xyz" # required by API but not used
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user","content": "What is the capital of India?”",
}
],
model="Phi-3-mini-128k-cuda-int4-onnx",
)
print(chat_completion.choices[0].message.content)
The Python script interacts with an API (local instance of OpenAI's API) to get a chat completion.
Step-by-step explanation of Python Implementation:
- Importing the OpenAI library: The script starts by importing the OpenAI class from the openai package. This class is used to interact with OpenAI's API. It can be installed using the simple python command as follows.
pip install openai
Although we are not communicating to the OpenAI’s API, we are utilizing the OS library to interact with the model running in local machine.
- Create an OpenAI client instance: Initialize an OpenAI client with a base URL pointing to http://127.0.0.1:5272/v1/, which suggests the API is hosted locally rather than on OpenAI's cloud servers. An API key, "XYZ", is also provided, which is noted as required by the API but not used in this context. The API key can be set to anything and not necessarily XYZ.
- Create a chat completion request: The script then creates a chat completion request using the chat.completions.create method of the client. This method is called with two parameters:
- messages: A list containing a single message dictionary where the role is set to "user" and the content is the question " What is the capital of India?". This structure mimics a chat interaction where a user asks a question.
- model: Specifies the model to use for generating the completion, in this case, "Phi-3-mini-128k-cuda-int4-onnx". This indicates a specific model configuration.
- Print the response: Finally, the script prints the content of the first message from the response's choices. API returns a list of possible completions (choices), and it accesses the content of the message from the first choice to display the answer to the user's question as demonstrated in the POSTMAN section.
Run the code in a new VS Code window. Preferably use a virtual environment to execute this code. Python is a prerequisite. To learn more, click here.
Execute the code using, the command
python <filename>.py #replace filename with the respective filename.
or click on the run button on the top right side of the Visual Studio Code window.
The response is now printed on the terminal.
Developing Basic Application:
Similarly, we can use it in applications as well. To demonstrate, lets build a basic application using Streamlit and python. Streamlit turns python scripts into shareable web apps in minutes. To know more click here. To install streamlit, use the following command in the python terminal/VS Code terminal.
pip install streamlit
The following script creates a web-based chat interface using streamlit where users can input queries, which are then sent to an AI model via a local OpenAI API server. The AI's responses are displayed back in the chat interface, facilitating a conversational interaction.
import streamlit as st
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:5272/v1/",
api_key="xyz" # required by API but not used
)
st.title("Chat with Phi-3")
query = st.chat_input("Enter query:")
if query:
with st.chat_message("user"):
st.write(query)
chat_completion = client.chat.completions.create(
messages=[
{"role": "user","content": "You are a helpful assistant and provides structured answers."},
{"role": "user", "content": query}
],
model="Phi-3-mini-128k-cuda-int4-onnx",
)
with st.chat_message("assistant"):
st.write(chat_completion.choices[0].message.content)
Since the above code is using the streamlit, the startup command is also given in a different syntax, the command is as follows,
streamlit run <filename>.py #replace filename with the respective filename.
Upon successful execution, the streamlit pops up a new window in the Microsoft edge with the webpage.
This is how we can create GenAI applications by using the models running on local VS Code AI Toolkit environment. In the upcoming blog, lets see how to apply retrieval augmented generation using the AI Toolkit framework.