Blog Post

Educator Developer Blog

5 MIN READ

Running Phi-3-vision via ONNX on Jetson Platform

Brass Contributor

Jul 19, 2024

Hi, I'm Jambo a Microsoft Learn Student Ambassador.

This article aims to run the quantized Phi-3-vision model in ONNX format on the Jetson platform and successfully perform inference for image+text dialogue tasks.

Writing Environment:

Jetpack 6.0 [L4T 36.3.0]
Compilation Platform: Jetson Orin
Inference Platform: Jetson Orin Nano

What is Jetson?

The Jetson platform, introduced by NVIDIA, consists of small arm64 devices equipped with powerful GPU computing capabilities. Designed specifically for edge computing and AI applications, Jetson devices run on Linux, enabling complex computing tasks with low power consumption. This makes them ideal for developing embedded AI and machine learning projects.

For other versions of the Phi-3 model, we can use llama.cpp to convert them into GUFF format to run on Jetson, and easily switch between different quantizations. Alternatively, you can conveniently use services like ollama or llamaedge which are based on llama.cpp. More information can be found in the Phi-3CookBook.

However, for the vision version, there is currently no way to convert it into GUFF format (#7444). Additionally, resource-constrained edge devices struggle to run the original model without quantization via transformers. Therefore, we can use ONNX Runtime to run the quantized model in ONNX format.

What is ONNX Runtime?

ONNX Runtime is a high-performance inference engine designed to accelerate and execute AI models in the ONNX (Open Neural Network Exchange) format. The onnxruntime-genai is an API specifically built for LLM (Large Language Model) models, providing a simple way to run models like Llama, Phi, Gemma, and Mistral.

When writing this article, onnxruntime-genai does not have a precompiled version for aarch64 + GPU, so we need to compile it ourselves.

Compiling onnxruntime-genai

Preparation

Upgrade CMake

sudo apt purge cmake
pip3 install cmake -U

Install cuDNN 9

Cloning the onnxruntime-genai Repository

git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai

The latest onnxruntime-genai repository cannot be successfully compiled for unknown reasons, so we need to switch to an earlier commit. Below is the latest commit that has been tested and can be successfully compiled:

git checkout 940bc102a317e886f488ad5e120533b96a34ddcd

ONNXRuntime

You can compile ONNXRuntime from the source yourself, but this can be a very time-consuming process for the Jetson platform. Therefore, we will directly use the version compiled by dusty-nv for the Jetson platform. Do not worry about the cu124 in the URL; it runs well on CUDA 12.2.

wget http://jetson.webredirect.org:8000/jp6/cu124/onnxruntime-gpu-1.19.0.tar.gz
mkdir ort
tar -xvf onnxruntime-gpu-1.19.0.tar.gz -C ort

mv ort/include/onnxruntime/onnxruntime_c_api.h ort/include/
rm -rf ort/include/onnxruntime/

Compiling onnxruntime-genai

You should still be in the onnxruntime-genai directory at this point.

Now we need to prepare to build the Python API. You can use Python >=3.6 for the compilation. JetPack 6.0 comes with Python 3.10 by default, but you can switch to other versions for the compilation. The compiled whl can only be installed on the Python version used during the compilation.

Note: The compilation process will require a significant amount of memory. Therefore, if your Jetson device has limited memory (like the Orin NX), do not use the --parallel parameter.

python3 build.py --use_cuda --cuda_home /usr/local/cuda-12.2 --skip_tests --skip_csharp [--parallel]

The compiled files will be located in the build/Linux/Release/dist/wheel directory, and we only need the .whl file. Note that the .whl file should be around 110 MB.

You can copy the whl file to other Jetson platforms with the same environment (CUDA) for installation.

Note: The generated subdirectory may differ, but we only need the .whl file from the build directory.

Installing onnxruntime-genai

If you have multiple CUDA versions, you might need to set the CUDA_PATH environment variable to ensure it points to the same version used during compilation.

export CUDA_PATH=/usr/local/cuda-12.2

Navigate to the directory where the whl file is located, or copy the whl file to another directory for installation using the following command.

pip3 install *.whl

Running the Phi-3-vision Model

Downloading the Model

Download the Phi-3-vision model for onnx-cuda from huggingface.

pip3 install huggingface-hub[cli]

The FP16 model requires 8 GB of VRAM. If you are running on a device with more resources like the Jetson Orin, you can opt for the FP32 model.

The Int 4 model is a quantized version, requiring only 3 GB of VRAM. This is suitable for more compact devices like the Jetson Orin Nano.

huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-fp16/* --local-dir .
# Or
huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir .

Running the Example Script

Download the official example script and an example image.

# Download example script
wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py
# Download example image
wget https://onnxruntime.ai/images/table.png

Run the example script.

python3 phi3v.py -m cuda-int4-rtn-block-32

First, input the path to the image, for example, table.png.

Next, input the prompt text, for example: Convert this image to markdown format.

```markdown
| Product             | Qtr 1    | Qtr 2    | Grand Total |
|---------------------|----------|----------|-------------|
| Chocolade           | $744.60  | $162.56  | $907.16     |
| Gummibarchen        | $5,079.60| $1,249.20| $6,328.80   |
| Scottish Longbreads | $1,267.50| $1,062.50| $2,330.00   |
| Sir Rodney's Scones | $1,418.00| $756.00  | $2,174.00   |
| Tarte au sucre      | $4,728.00| $4,547.92| $9,275.92   |
| Chocolate Biscuits  | $943.89  | $349.60  | $1,293.49   |
| Total               | $14,181.59| $8,127.78| $22,309.37  |
```

The table lists various products along with their sales figures for Qtr 1, Qtr 2, and the Grand Total. The products include Chocolade, Gummibarchen, Scottish Longbreads, Sir Rodney's Scones, Tarte au sucre, and Chocolate Biscuits. The Grand Total column sums up the sales for each product across the two quarters.

Note: The first round of dialogue during script execution might be slow, but subsequent dialogues will be faster.

We can use Jtop to monitor resource usage:

The above inference is run on Jetson Orin Nano using the Int 4 quantized model. As shown, the Python process occupies 5.4 GB of VRAM for inference, with minimal CPU load and nearly full GPU utilization during inference.

We can modify the example script to use the time function at key points to measure the inference speed, which is remarkably fast.

All of this is achieved on a device with a power consumption of just 15W.

Updated Jul 29, 2024

Version 3.0

Brass Contributor

Joined February 12, 2023

View Profile

Educator Developer Blog

Follow this blog board to get notified when there's new activity

RBrown955
Copper Contributor
Aug 29, 2024
Heya! Have we gotten any word from Dusty regarding a container yet?

Jambo0321
Jambo0321
Brass Contributor
Aug 05, 2024
RBrown955 I have not yet posted this process on the NVIDIA forums.

For other models, you can still try running GGUF models using LlamaEdge (vision models are still not supported for conversion). It has better support on Jetson compared to Ollama (but Jetson currently supports CUDA 12).

I also greatly appreciate your efforts in trying out my process, as it helps me significantly in refining the entire workflow.
RBrown955
Copper Contributor
Aug 05, 2024
At anyrate, I appreciate you working through this.
Is this crossposted on the dev forum yet?
Jambo0321
Brass Contributor
Aug 05, 2024
RBrown955 I have encountered this issue on the Orin Nano, but the same script did not have this problem on the Orin (with --parallel). Additionally, when I exported the image built on the Orin and ran it on the Nano, the issue did not occur.

I will continue to try on the Nano, but the best solution would still be for the ONNX team or the NVIDIA team to provide relevant support.

RBrown955

Copper Contributor

Aug 05, 2024

I was able to sucessfully build the container using the docker files after removing the parallel flag. It took 30 minutes.

docker build --tag phi3_vision .
[+] Building 1963.8s (8/8) FINISHED                              docker:default
 => [internal] load build definition from dockerfile                       0.0s
 => => transferring dockerfile: 134B                                       0.0s
 => [internal] load metadata for docker.io/dustynv/onnxruntime:r36.2.0     0.0s
 => [internal] load .dockerignore                                          0.0s
 => => transferring context: 2B                                            0.0s
 => [internal] load build context                                          0.0s
 => => transferring context: 856B                                          0.0s
 => CACHED [1/3] FROM docker.io/dustynv/onnxruntime:r36.2.0                0.0s
 => [2/3] COPY build_genai.sh /tmp/genai/                                  0.0s
 => [3/3] RUN /tmp/genai/build_genai.sh                                 1922.4s
 => exporting to image                                                    41.1s 
 => => exporting layers                                                   41.0s 
 => => writing image sha256:82ebcdac2fc6a77810c1a10c8af3465107b53fe1ad2df  0.0s 
 => => naming to docker.io/library/phi3_vision                             0.0s

The .whl file also installed correctly when installing to Dusty's onnx image.

However, I think the build_genai.sh file forgets to include

pip3 install /ort/*.whl

Before installing, I got

Traceback (most recent call last):
  File "/home/phi3v.py", line 9, in <module>
    import onnxruntime_genai as og
ModuleNotFoundError: No module named 'onnxruntime_genai'

Unfortunately, after installing the whl file, and running the example script I got a different error than you posted

$ python3 phi3v.py -m cuda-int4-rtn-block-32
Loading model...
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
  what():  /opt/onnxruntime/onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/cudaDriverWrapper.cc:42 onnxruntime::contrib::cuda::CUDADriverWrapper::CUDADriverWrapper() handle != nullptr was false. 

Aborted (core dumped)

I went ahead and installed cuDDN 9 but received the same error.

Jambo0321
Brass Contributor
Aug 04, 2024
onnx-genai

This is a wheel file that I compiled. It contains two versions:

genai compiled based on a self-compiled onnxruntime on cuda-12.2.

genai compiled based on the precompiled ort-cuda12.4 library from Dusty.

Additionally, in the version based on Dusty's library, I have included the Dockerfile and the compilation script that I used.

RBrown955
RBrown955
Copper Contributor
Aug 04, 2024
Jambo0321 pardon my ignorance but could you share your whl and I run on my system?
RBrown955
Copper Contributor
Aug 04, 2024
I'm having a lot of difficulty setting up my cross compile environment so hopefully Dusty will be able to come through with a container soon.
RBrown955
Copper Contributor
Jul 30, 2024
Due to memory constraints, compiling on Jetson devices is challenging (Orin requires single-threaded compilation, and Orin Nano is unable to compile).

Maybe this is my issue. I am using orin nano dev kit and it fails at 45%. I will try cross compiling on a different host.
Jambo0321
Brass Contributor
Jul 29, 2024
dlsuper RBrown955 Please try the new process:

Upgrade CMake

sudo apt purge cmake pip3 install cmake -U

Cloning the onnxruntime-genai Repository

git clone https://github.com/microsoft/onnxruntime-genai cd onnxruntime-genai git checkout 940bc102a317e886f488ad5e120533b96a34ddcd

ONNXRuntime

wget http://jetson.webredirect.org:8000/jp6/cu124/onnxruntime-gpu-1.19.0.tar.gz mkdir ort tar -xvf onnxruntime-gpu-1.19.0.tar.gz -C ort mv ort/include/onnxruntime/onnxruntime_c_api.h ort/include/ rm -rf ort/include/onnxruntime/

Compiling onnxruntime-genai

python3 build.py --use_cuda --cuda_home /usr/local/cuda-12.2 --skip_tests --skip_csharp [--parallel]

If you encounter the following issues when runtime, please try Install cuDNN 9

If the above process still doesn’t work, you will have to compile ONNXRuntime yourself to get the library and compile onnx-genai. I will also update my article.

I am still looking for a more convenient process, but the most convenient way is through making a Docker image or having the ONNX team provide a precompiled whl. Currently, Dusty has contacted me and expressed interest in creating the image, but this will take time. I also appreciate you trying my method, as it helps me identify where the issues are.