In 2024, with the empowerment of AI, we will enter the era of AI PC. On May 20, Microsoft also released the concept of Copilot + PC, which means that PC can run SLM/LLM more efficiently with the support of NPU. We can use models from different Phi-3 family combined with the new AI PC to build a simple personalized Copilot application for individuals. This content will combine Intel's AI PC, use Intel's OpenVINO, NPU Acceleration Library, and Microsoft's DirectML to create a local Copilot An on-demand recording of Microsoft Copilot +PC event from the May 20 event is available.
Introducing the Phi-3 Family
Phi-3-Mini
Phi-3-Mini is a Transformer-based language model with 3.8 billion parameters. The Phi-3-Mini model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.
Phi-3-mini is a 3.8B parameter language model, available in two context lengths 128K and 4K.
Phi-3-Small
Phi-3-Small is a Transformer-based language model with 7 billion parameters. The Phi-3-Small model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. Phi-3-Small is also trained more intensively on multilingual datasets compared to Phi-3-Mini. The model family has two variants 8K and 128K which is the context length (in tokens) that it can support.
Phi-3-small is a 7B parameter language model, available in two context lengths 128K and 8K.
Phi-3-Medium
Phi-3-Medium is a Transformer-based language model with 14 billion parameters. The Phi-3-Medium model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. The model family has two variants 4K and 128K which is the context length (in tokens) that it can support.
Phi-3-medium is a 14B parameter language model, available in two context lengths 128K and 4K.
Phi-3-Vision
Phi-3-Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
The Phi-3-vision is a 4.2B parameter multimodal model with language and vision capabilities.
For suitable models for AI PC, I personally recommend Phi-3-mini . As for Phi-3-small, Phi-3 Vision and Phi-3-medium, they are more suitable for running on Nvidia CUDA devices.
What's NPU
An NPU (Neural Processing Unit) is a dedicated processor or processing unit on a larger SoC designed specifically for accelerating neural network operations and AI tasks. Unlike general-purpose CPUs and GPUs, NPUs are optimized for a data-driven parallel computing, making them highly efficient at processing massive multimedia data like videos and images and processing data for neural networks. They are particularly adept at handling AI-related tasks, such as speech recognition, background blurring in video calls, and photo or video editing processes like object detection.
NPU vs GPU
While many AI and machine learning workloads run on GPUs, there’s a crucial distinction between GPUs and NPUs. GPUs are known for their parallel computing capabilities, but not all GPUs are equally efficient beyond processing graphics. NPUs, on the other hand, are purpose-built for complex computations involved in neural network operations, making them highly effective for AI tasks.
In summary, NPUs are the math whizzes that turbocharge AI computations, and they play a key role in the emerging era of AI PCs!
This example is based on Intel’s latest Intel Core Ultra Processor
1. Use NPU to run Phi-3 model
Intel® NPU device is an AI inference accelerator integrated with Intel client CPUs, starting from Intel® Core™ Ultra generation of CPUs (formerly known as Meteor Lake). It enables energy-efficient execution of artificial neural network tasks.
Intel NPU Acceleration Library
The Intel NPU Acceleration Library https://github.com/intel/intel-npu-acceleration-library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.
Install the Python Library with pip
pip install intel-npu-acceleration-library
Note The project is still under development, but the reference model is already very complete.
Running Phi-3 with Intel NPU Acceleration Library
Using Intel NPU acceleration, this library does not affect the traditional encoding process. You only need to use this library to quantize the original Phi-3 model, such as FP16, INT4:
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM,pipeline
import intel_npu_acceleration_library
import torch
model_id = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", use_cache=True,trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.float16)
After the quantification is successful, continue execution to call the NPU to run the Phi-3 model.
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
query = "<|system|>You are a helpful AI assistant.<|end|><|user|>Can you introduce yourself?<|end|><|assistant|>"
output = pipe(query, **generation_args)
output[0]['generated_text']
When executing code, we can view the running status of the NPU through Task Manager
2. Use DirectML + ONNX Runtime to run Phi-3 Model
What is DirectML
DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.
Note : The latest DirectML already supports NPU(https://devblogs.microsoft.com/directx/introducing-neural-processor-unit-npu-support-in-directml-developer-preview/)
DirectML and CUDA in terms of their capabilities and performance:
DirectML is a machine learning library developed by Microsoft. It is designed to accelerate machine learning workloads on Windows devices, including desktops, laptops, and edge devices.
- DX12-Based: DirectML is built on top of DirectX 12 (DX12), which provides a wide range of hardware support across GPUs, including both NVIDIA and AMD.
- Wider Support: Since it leverages DX12, DirectML can work with any GPU that supports DX12, even integrated GPUs.
- Image Processing: DirectML processes images and other data using neural networks, making it suitable for tasks like image recognition, object detection, and more.
- Ease of Setup: Setting up DirectML is straightforward, and it doesn’t require specific SDKs or libraries from GPU manufacturers.
- Performance: In some cases, DirectML performs well and can be faster than CUDA, especially for certain workloads.
- Limitations: However, there are instances where DirectML may be slower, particularly for float16 large batch sizes.
CUDA is NVIDIA’s parallel computing platform and programming model. It allows developers to harness the power of NVIDIA GPUs for general-purpose computing, including machine learning and scientific simulations.
- NVIDIA-Specific: CUDA is tightly integrated with NVIDIA GPUs and is specifically designed for them.
- Highly Optimized: It provides excellent performance for GPU-accelerated tasks, especially when using NVIDIA GPUs.
- Widely Used: Many machine learning frameworks and libraries (such as TensorFlow and PyTorch) have CUDA support.
- Customization: Developers can fine-tune CUDA settings for specific tasks, which can lead to optimal performance.
- Limitations: However, CUDA’s dependency on NVIDIA hardware can be limiting if you want broader compatibility across different GPUs.
Choosing Between DirectML and CUDA:
The choice between DirectML and CUDA depends on your specific use case, hardware availability, and preferences. If you’re looking for broader compatibility and ease of setup, DirectML might be a good choice. However, if you have NVIDIA GPUs and need highly optimized performance, CUDA remains a strong contender. In summary, both DirectML and CUDA have their strengths and weaknesses, so consider your requirements and available hardware when making a decision
Generative AI with ONNX Runtime
In the era of AI , the portability of AI models is very important. ONNX Runtime can easily deploy trained models to different devices. Developers do not need to pay attention to the inference framework and use a unified API to complete model inference. In the era of generative AI, ONNX Runtime has also performed code optimization (https: //onnxruntime.ai/docs/genai/). Through the optimized ONNX Runtime, the quantized generative AI model can be inferred on different terminals. In Generative AI with ONNX Runtime, you can inferene AI model API through Python, C#, C / C++. of course,Deployment on iPhone can take advantage of C++'s Generative AI with ONNX Runtime API.
compile generative AI with ONNX Runtime library
winget install --id=Kitware.CMake -e
git clone https://github.com/microsoft/onnxruntime.git
cd .\onnxruntime\
./build.bat --build_shared_lib --skip_tests --parallel --use_dml --config Release
cd ../
git clone https://github.com/microsoft/onnxruntime-genai.git
cd .\onnxruntime-genai\
mkdir ort
cd ort
mkdir include
mkdir lib
copy ..\onnxruntime\include\onnxruntime\core\providers\dml\dml_provider_factory.h ort\include
copy ..\onnxruntime\include\onnxruntime\core\session\onnxruntime_c_api.h ort\include
copy ..\onnxruntime\build\Windows\Release\Release\*.dll ort\lib
copy ..\onnxruntime\build\Windows\Release\Release\onnxruntime.lib ort\lib
python build.py --use_dml
Install library
pip install .\onnxruntime_genai_directml-0.3.0.dev0-cp310-cp310-win_amd64.whl
This is running result
3. Use Intel OpenVino to run Phi-3 Model
What is OpenVINO
OpenVINO is an open-source toolkit for optimizing and deploying deep learning models. It provides boosted deep learning performance for vision, audio, and language models from popular frameworks like TensorFlow, PyTorch, and more. Get started with OpenVINO.OpenVINO can also be used in combination with CPU and GPU to run the Phi3 model.
Note: Currently, OpenVINO does not support NPU at this time.
Install OpenVINO Library
pip install git+https://github.com/huggingface/optimum-intel.git
pip install git+https://github.com/openvinotoolkit/nncf.git
pip install openvino-nightly
Running Phi-3 with OpenVINO
Like NPU, OpenVINO completes the call of generative AI models by running quantitative models. We need to quantize the Phi-3 model first and complete the model quantization on the command line through optimum-cli
INT4
optimum-cli export openvino --model "microsoft/Phi-3-mini-4k-instruct" --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.6 --sym --trust-remote-code ./openvinomodel/phi3/int4
FP16
optimum-cli export openvino --model "microsoft/Phi-3-mini-4k-instruct" --task text-generation-with-past --weight-format fp16 --trust-remote-code ./openvinomodel/phi3/fp16
the converted format , like this
Load model paths(model_dir), related configurations(ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}), and hardware-accelerated devices(GPU.0) through OVModelForCausalLM
ov_model = OVModelForCausalLM.from_pretrained(
model_dir,
device='GPU.0',
ov_config=ov_config,
config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
trust_remote_code=True,
)
When executing code, we can view the running status of the GPU through Task Manager
Note : The above three methods each have their own advantages, but it is recommended to use NPU acceleration for AI PC inference.
Resources
- Phi-3 Microsoft Blog https://aka.ms/phi3blog-april
-
Phi-3 technical report https://aka.ms/phi3-tech-report
- Phi-3 Cookbook https://aka.ms/Phi-3CookBook