The Phi-3 mini models are AI models. The short context version Phi-3-mini-4k-instruct-onnx has a prompt length of 4k words, while the long context version can accept much longer prompts and produce longer output text.
In this tutorial, we will be using the short context version of the Phi-3 ONNX models ( Phi-3-mini-4k-instruct-onnx) and using the model available from Hugging Face.
Before we begin, it is important to install the git large file system extension and the Hugging Face CLI. These tools are necessary for downloading the ONNX models. Additionally, we will focus this tutorial on using the CPU to run the models. If you have a GPU, you can use DirectML or NVIDIA CUDA GPU setups for optimal performance depending on your operating system.
Setting up your Python Environment
Navigate to your project directory using the cd command.
For example:
cd path/to/your/project
Create a new virtual environment by running the following command:
python -m venv .venv
This will create a .venv directory in your project folder, containing an isolated Python environment.
Activate the virtual environment
On Windows:
.venv\Scripts\activate
On macOS/Linux:
source .venv/bin/activate
You’ll see the virtual environment name in your command prompt (e.g., (venv)). Now you can install Python packages specific to your project without affecting the global Python installation.
Remember to replace <virtual-environment-name> with your preferred name for the virtual environment.
Prerquesties: Install Git Large File System Support
For Windows
First you install some prerequsities
Use the winget tool to install and manage applications | Microsoft Learn
After App Installer is installed, you can run winget by typing 'winget' from a Command Prompt.
winget install -e --id GitHub.GitLFS
For MacOS
brew install git-lfs
For Linux
apt-get install git-lfs
We now need to run the Gif-Lfs
git lfs install
Deploying the Phi-3 model from Hugging Face
Install the Hugging Face CLI
pip install huggingface-hub[cli]
Now were are going to download the Phi-3 model and run this on the device CPU
Dowloading Phi-3 from Hugging Face
Download the Phi-3-mini-4k-instruct-onnx model. Below is the batch script that allows you to download the correct version of the Phi-3 model based on your preference. You can save this script with a .bat extension (e.g., download_phi3_model.bat) and run it:
@echo off
setlocal
REM Select which model to download
echo.
echo Choose an option:
echo 1. Download the Phi-3 Model for CPU
echo 2. Download the Phi-3 Model for Nvidia Cuda
echo 3. Download the Phi-3 Model for DirectML
set /p option=Enter the option number:
if "%option%"=="1" (
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
) else if "%option%"=="2" (
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
) else if "%option%"=="3" (
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .
) else (
echo Invalid option. Please choose 1, 2, or 3.
)
endlocal
This command downloads the model into a folder called cpu_and_mobile
Below is a batch script that allows the user to select the ONNX runtime installation option. . Save this script with a .bat
extension (e.g., install_onnx_runtime.bat
) and run it:
@echo off
setlocal
REM Install numpy libary
pip install numpy
REM Pick which ONNX runtime to install
echo.
echo Choose an option:
echo 1. For CPU (onnxruntime-genai)
echo 2. For GPU (onnxruntime-genai-cuda)
echo 3. For DirectML (onnxruntime-genai-directml)
set /p option=Enter the option number:
if "%option%"=="1" (
pip install --pre onnxruntime-genai
) else if "%option%"=="2" (
pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
) else if "%option%"=="3" (
pip install --pre onnxruntime-genai-directml
) else (
echo Invalid option. Please choose 1, 2, or 3.
)
endlocal
Run the model using a Python Script and switch command for model selection
import onnxruntime_genai as og
import argparse
import time
def main(args):
# If verbose mode is on, print loading model message
if args.verbose: print("Loading model...")
# If timings mode is on, initialize timing variables
if args.timings:
started_timestamp = 0
first_token_timestamp = 0
# Load the model
model = og.Model(f'{args.model}')
if args.verbose: print("Model loaded")
# Initialize the tokenizer with the model
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
if args.verbose: print("Tokenizer created")
# Print a newline for readability if verbose mode is on
if args.verbose: print()
# Create a dictionary of search options from the command line arguments
search_options = {name:getattr(args, name) for name in ['do_sample', 'max_length', 'min_length', 'top_p', 'top_k', 'temperature', 'repetition_penalty'] if name in args}
# Set a default max length if one is not provided
if 'max_length' not in search_options:
search_options['max_length'] = 2048
# Define a template for the chat input
chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'
# Main loop: ask for input and generate responses
while True:
# Get user input
text = input("Input: ")
# If the input is empty, print an error message and continue to the next iteration
if not text:
print("Error, input cannot be empty")
continue
# If timings mode is on, record the start time
if args.timings: started_timestamp = time.time()
# Format the input with the chat template
prompt = f'{chat_template.format(input=text)}'
# Tokenize the input
input_tokens = tokenizer.encode(prompt)
# Set up the generator parameters
params = og.GeneratorParams(model)
params.try_use_cuda_graph_with_max_batch_size(1)
params.set_search_options(**search_options)
params.input_ids = input_tokens
# Create the generator
generator = og.Generator(model, params)
if args.verbose: print("Generator created")
# Print a message if verbose mode is on
if args.verbose: print("Running generation loop ...")
# If timings mode is on, initialize variables for the generation loop
if args.timings:
first = True
new_tokens = []
# Print the output prompt
print()
print("Output: ", end='', flush=True)
If you do install the requirements for DirectML, Cuda and CPU support you can run the Python file above with the following switch
For CPU
python filename.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4
For DirectML
python filename.py -m directml\directml-int4-awq-block-128
For Cuda
python filename.py -m cuda/cuda-int4-rtn-block-32
Running this a simple batch file
Below is the runnable Python script based on your provided code. You can save this script to a .py
file and execute it. Make sure to replace --model
with the actual path to your ONNX model file. You can run this script using python your_script_name.py
import onnxruntime_genai as og
import argparse
import time
def main(args):
# If verbose mode is on, print loading model message
if args.verbose: print("Loading model...")
# If timings mode is on, initialize timing variables
if args.timings:
started_timestamp = 0
first_token_timestamp = 0
# Load the model
model = og.Model(f'{args.model}')
if args.verbose: print("Model loaded")
# Initialize the tokenizer with the model
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
if args.verbose: print("Tokenizer created")
# Print a newline for readability if verbose mode is on
if args.verbose: print()
# Create a dictionary of search options from the command line arguments
search_options = {name:getattr(args, name) for name in ['do_sample', 'max_length', 'min_length', 'top_p', 'top_k', 'temperature', 'repetition_penalty'] if name in args}
# Set a default max length if one is not provided
if 'max_length' not in search_options:
search_options['max_length'] = 2048
# Define a template for the chat input
chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'
# Main loop: ask for input and generate responses
while True:
# Get user input
text = input("Input: ")
# If the input is empty, print an error message and continue to the next iteration
if not text:
print("Error, input cannot be empty")
continue
# If timings mode is on, record the start time
if args.timings: started_timestamp = time.time()
# Format the input with the chat template
prompt = f'{chat_template.format(input=text)}'
# Tokenize the input
input_tokens = tokenizer.encode(prompt)
# Set up the generator parameters
params = og.GeneratorParams(model)
params.try_use_cuda_graph_with_max_batch_size(1)
params.set_search_options(**search_options)
params.input_ids = input_tokens
# Create the generator
generator = og.Generator(model, params)
if args.verbose: print("Generator created")
# Print a message if verbose mode is on
if args.verbose: print("Running generation loop ...")
# If timings mode is on, initialize variables for the generation loop
if args.timings:
first = True
new_tokens = []
# Print the output prompt
print()
print("Output: ", end='', flush=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run the chatbot script")
parser.add_argument("--model", type=str, required=True, help="Path to the ONNX model file")
parser.add_argument("--verbose", action="store_true", help="Enable verbose mode")
parser.add_argument("--timings", action="store_true", help="Enable timings mode")
args = parser.parse_args()
main(args)
In conclusion, the Phi-3 mini models are powerful AI tools for text generation using NLP techniques. These models can be run on a variety of devices, including GPUs and CPUs. By following the instructions in this tutorial, you can easily download and run these models on your own computer.
Resources
Phi-3 Cook Book https://aka.ms/phi-3cookbook
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.