Imagine transforming your everyday device into an AI powerhouse, capable of handling complex tasks with ease and efficiency. With the introduction of Surface's new line of products, powered by cutting-edge Neural Processing Units (NPUs), this vision is now a reality. These specialized hardware components are set to revolutionize AI performance, bringing unprecedented speed, efficiency, and privacy to your fingertips.
What's an NPU?
The NPU architecture is optimized for matrix operations, executing neural network layers quickly and accelerating inference with remarkable efficiency. By offloading AI workloads from the CPU or GPU, the NPU reduces power consumption, enabling better performance, longer-lasting battery life, and improved thermal management on the device. The NPU runs models locally on the device, reducing latency for real-time applications and giving developers more control over data privacy – the data never has to leave the device.
To learn more, see these Qualcomm resources:
- What is an NPU? And why is it key to unlocking on-device generative AI? | Qualcomm
- Whitepaper describing the need for an NPU and heterogeneous computing
Our journey begins
As data scientists on the Surface development team, we couldn’t wait to get our hands on our new PCs and leverage the NPU for our own models and applications. But before we jump into the deep end, we first need to understand the basics – how to connect to the NPU and run models on it. In this article, we’ll share our story about how we ran a “Hello World” model on the NPU, so you can do it too. We’ll walk you through the steps we took to deploy a classic Convolutional Neural Network (CNN) model, Handwritten Digit Classification, onto an NPU.
Building the model
We started with the creation of a "Hello World” neural network model using the publicly available MNIST dataset. The MNIST dataset contains images of handwritten digits (MNIST — Torchvision main documentation), and there are numerous blog posts online that provide guidance on how to train a classification model. (Here’s the one we used: MNIST Handwritten Digit Recognition in PyTorch – Nextjournal.) Note that you can swap this with any other trained model if you prefer.
Converting to ONNX
Since PyTorch models cannot directly run on the NPU, we converted our model to ONNX format using the PyTorch function `torch.onnx.export`. ONNX is a file format that allows models to be run on various hardware platforms and runtime environments. You can find a detailed tutorial on converting models to ONNX here: Exporting a Model from PyTorch to ONNX. If you use another framework, like TensorFlow, you will need to follow a similar conversion process.
TIP: We found that the NPU on Surface will not take a dynamic axis. Therefore, when you export it to ONNX format, you need to fix all the input dimensions including batch size. Here is a code snip we used to convert our Pytorch model to ONNX format. |
dummy_input = torch.randn(1, 1, 28, 28) # dummy input has the same dimension with actual MNIST data. batch size*C*H*W = 1*1*28*28
torch.onnx.export(model, # model being run
dummy_input, # model input (or a tuple for multiple inputs)
"model.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['input'], # the model's input names
output_names = ['output'] # the model's output names
)
Configuring the development environment for NPU deployment
After converting the model to ONNX, we set up the development environment for running the model on the NPU. We followed the ONNX Runtime documentation for the QNN Execution Provider.
TIPS:
|
Quantizing the model
Next, we quantized the model, converting the 32-bit floating point model to an 8-bit integer model. Quantization balances the trade-offs between model accuracy and computational efficiency. There are two primary types of quantization frameworks: dynamic and static.
The Qualcomm Hexagon NPU on Copilot+ PCs, Surface Pro and Surface Laptop, supports static quantization, which means both the weights and activations are quantized before deployment.
We followed the QNN documentation for pre-processing and quantization: Quantize ONNX Models. Using a static quantization method, we created calibration data with representative data samples to determine optimal quantization parameters. We used the default parameters in the documentation example: uint16 for activations and uint8 for weights. Now, we have a quantized ONNX model ready to run on the NPU!
TIP: Supported ONNX operators. At the time of this blog post, not all operators can be quantized. To fully utilize NPU acceleration, ensure all layers are compatible with QNN. Check the supported operator set here: Supported ONNX operators. |
Running the model
To run our model on the NPU, we started an ONNX Runtime inference. Different parameters can be configured to fit your particular use case: Configuration Options.
session = onnxruntime.InferenceSession("model.qdq.onnx", # path to our model
sess_options=options,
providers=["QNNExecutionProvider"],
provider_options=[{"backend_path": "QnnHtp.dll"}]) #Provide path to Htp dll in QNN SDK
We created a simple Flask app to give our model a user-friendly interface. Each time we ran our model, we observed the resource usage of our NPU in Task Manager, confirming that the NPU was indeed being utilized!
Conclusion
Our journey to run a “Hello World” model on the NPU was not without its share of challenges. There were many learnings along the way, most notably the agency we have in the quantization step to balance accuracy and efficiency. We believe this will be one of the most important considerations when we develop future applications.
Our other major learning is that certain tasks – like large neural network models, or reasoning over audio, vision, or language datatypes – are best suited for the NPU, whereas other tasks could be better suited for the CPU or GPU. The same way we experiment with different models during development, in future projects we plan to experiment with different implementation methods to determine which one is most performant, in terms of inference time and power consumption.
This exploration has only added fuel to our fire as we consider where we can take this next. We believe we can take advantage of the NPU’s unique power to build even greater Surface devices and experiences. (And we can’t wait to see what you do too!)
This space is growing fast, thanks to groups like the Applied Sciences Group at Microsoft, Qualcomm, and ONNX. With many libraries being open source, we anticipate these assets will only get better with time. Together, we can unlock the NPU’s limitless potential. How will you use it?
Learn more