Driven by the need for user privacy, real-time performance, platform flexibility, and cost efficiency, edge AI are coming into the spotlight and transforming the AI landscape. Today at Ignite, we are excited to announce four features in ONNX Runtime ecosystem to make edge AI more accessible to everyone.
While edge inferencing offers substantial advantages, implementing these solutions presents distinct challenges. Edge devices often have limited computing power and memory, necessitating lighter inference solutions that are difficult to implement. Additionally, Edge devices also feature a variety of different hardware options, including increasingly popular NPUs (neural processing units). The diversity of edge device hardware and platforms makes it hard to manage dependencies and ensure interoperability.
That’s where ONNX Runtime comes in. Built to be lightweight and flexible, ONNX Runtime is the ideal solution on resource-constrained edge and mobile devices and accelerates inference across a variety of hardware providers and platforms, including CPUs, GPUs and NPUs. This blog provides more details on how ONNX Runtime enables universal AI across edge devices, empowering AI developers to implement advanced AI capabilities either directly at the edge or as part of a hybrid AI solution.
Accelerate small languages models (SLMs) on NPUs across platforms
Neural Processing Unit (NPU) significantly enhances AI performance by efficiently handling machine learning tasks. Another key benefit of the NPU is its energy efficiency, which is crucial for mobile/embedded devices and battery-operated applications, enabling longer usage times without sacrificing performance.
As a high-performance cross-platform inference engine, ONNX Runtime has broadened its hardware support to include NPUs. Microsoft has partnered with various hardware vendors to seamlessly integrate their NPU accelerators into the ONNX Runtime framework. ONNX Runtime with Qualcomm AI Engine Direct SDK (QNN accelerator) enables Phi Silica model inference on the Snapdragon X series NPU in Copilot+ PCs as announced in Build 2024. Now with the ONNX Runtime Generative API with QNN accelerator , we can accelerate advanced SLMs including Phi 3.5 mini and Llama 3.2 3B models on Qualcomm NPUs across diverse platforms, including both PC and mobile devices. We observe the time to first token with about 100ms during prompt processing using the Llama 3.2 3B model with a prompt of up to 128 tokens on a Snapdragon 8 Elite mobile device. Powered by ONNX Runtime, LM Studio is able to run SLMs on Copilot+PCs taking advantage of Snapdragon NPUs. Please check out this Ignite session Boost edge AI: Accelerate model inference with ONNX Runtime and NPU for more details.
NPU-accelerated in-browser inference with ONNX Runtime Web and WebNN
ONNX Runtime Web enables developers to run and deploy machine learning models directly in web inference stack. This means that web applications can leverage the power of machine learning without requiring the user to install any additional software or plugins, as well as protect the user’s privacy and security. Visit this blog to learn how ONNX Runtime Web enables Goodnotes to release Scribble to Erase AI feature to Windows, Web, and Android platforms based on web technologies.
WebNN is an emerging web standard that defines a powerful way to inference ML models on the web. WebNN accesses the hardware-accelerated capabilities of local accelerators, such as GPUs or NPUs, to run machine learning models efficiently and securely. WebNN can enable a variety of use cases, such as generative AI, object recognition, natural language processing, and more. Today, you can utilize the ONNX Runtime Web with WebNN Developer Preview on Intel and QC NPUs. Check out the latest ONNX Runtime with WebNN news in this Ignite breakout session Get ready for End of Support (EOS) and the future of AI at work with Windows 11 and Copilot+ PCs. To try out our samples and learn how to get started, visit aka.ms/webnn.
Multi LoRA for more flexible and lightweight model fine-tuning and inference
LoRA (Low Rank Adaption) is a popular method of fine-tuning generative models. Not only is it more efficient than fine-tuning the entire model, LoRA enables shipping multiple fine-tuned adapters for different scenarios, without having to re-deploy the entire model. As shown in the image below, the memory footprint can be reduced by a factor of four by utilizing the multi-LoRA feature. For resource-constrained environments such as edge and mobile devices, this opens possibilities for extending model quality and flexibility.
The ONNX Runtime ecosystem now supports the end-to-end workflow: Olive enables you to quantize, fine-tune and optimize your LoRA adapters for the ONNX Runtime, or you can convert PyTorch LoRA adapters for the ONNX Runtime. Switching the adapters in and out at runtime is fast and efficient. Visit more details from here.
Even easier model optimization with the new Olive CLI
At Build 2023 Microsoft announced Olive an advanced model optimization toolkit designed to streamline the process of optimizing AI models for deployment with the ONNX runtime.
Olive operates through a structured workflow consisting of a series of model optimization tasks known as passes. These passes can include model compression, graph capture, quantization, and graph optimization. Each pass has adjustable parameters that can be tuned to achieve optimal metrics like accuracy and latency. Whilst the workflow paradigm used in Olive is very flexible, the learning curve can be challenging for AI Developers new to model optimization processes. To make model optimization more approachable, we have curated a set of Olive workflows for common scenarios and exposed them as a simple command in a new easy-to-use CLI. Please visit this blog for more information.
Call to Action
Now is the perfect time to explore these new features and start optimizing your edge AI deployments. Whether you're working with embedded devices, mobile platforms, or hybrid edge-cloud environments, ONNX Runtime can help you unlock the full potential of your AI models. Check out this end-to-end Ignite lab showcasing model fine-tuning and hybrid inference with Olive and ONNX runtime. We encourage you to try out ONNX Runtime in your next edge-AI project and experience firsthand how it can streamline deployment and boost performance. Visit the ONNX Runtime GitHub repository and our documentation to get started with more guides, tutorials, and use cases.