edgeAI
2 TopicsBringing AI to the edge: Hackathon Windows ML
AI Developer Hackathon Windows ML Hosted by Qualcomm on SnapDragonX We’re excited to announce our support and participation for the upcoming global series of Edge AI hackathons, hosted by Qualcomm Technologies. The first is on June 14-15 in Bangalore. We see a world of hybrid AI, developing rapidly as new generation of intelligent applications get built for diverse scenarios. These range from mobile, desktop, spatial computing and extending all the way to industrial and automotive. Mission critical workloads oscillate between decision-making in the moment, on device, to fine tuning models on the cloud. We believe we are in the early stages of development of agentic applications that efficiently run on the edge for scenarios needing local deployment and on-device inferencing. Microsoft Windows ML Windows ML – a cutting-edge runtime optimized for performant on-device model inference and simplified deployment, and the foundation of Windows AI Foundry. Windows ML is designed to support developers creating AI-infused applications with ease, harnessing the incredible strength of Windows’ diverse hardware ecosystem whether it’s for entry-level laptops, Copilot+ PCs or top-of-the-line AI workstations. It’s built to help developers leverage the client silicon best suited for their specific workload on any given device whether it’s an NPU for low-power and sustained inference, a GPU for raw horsepower or CPU for the broadest footprint and flexibility. Introducing Windows ML: The future of machine learning development on Windows - Windows Developer Blog Getting Started To get started, install AI Toolkit, leverage one of our conversion and optimization templates, or start building your own. Explore documentation and code samples available on Microsoft Learn, check out AI Dev Gallery (install, documentation) for demos and more samples to help you get started with Windows ML. Microsoft and Qualcomm Technologies: A strong collaboration Microsoft and Qualcomm Technologies’ collaboration bring new advanced AI features into Copilot+ PCs, leveraging the Snapdragon X Elite. Microsoft Research has played a pivotal role by optimizing new lightweight LLMs, such as Phi Silica, specifically for on-device execution with the Hexagon NPU. These models are designed to run efficiently on Hexagon NPUs, enabling multimodal AI experiences like vision-language tasks directly on Copilot+ PCs without relying on the cloud. Additionally, Microsoft has made DeepSeek R1 7B and 14B distilled models available via Azure AI Foundry, further expanding the AI ecosystem on the edge. This collaboration marks a significant step in democratizing AI by making powerful, efficient models accessible on everyday devices Windows AI Foundry expands AI capabilities by providing high-performance built-in models and supports developers' custom models with silicon performance. This developer platform plays a key role in this collaboration. Windows ML enables Windows 11 and Copilot+ PCs to use the Hexagon NPU for power efficient inference. Scaling optimization through Olive toolchain The Windows ML foundation of the Windows AI Foundry provides a unified platform for AI development across various hardware architectures and brings silicon performance using QNN Execution provider. This stack includes Windows ML and toolchains like Olive, easily accessible in AI Toolkit for VS Code, which streamlines model optimization and deployment. Qualcomm Technologies has contributed to Microsoft’s Olive, an open-source model optimization tool that enhances AI performance by optimizing models for efficient inference on client systems. This tool is particularly beneficial for running LLMs and GenAI workloads on Qualcomm Technologies’ platforms. Real-World Applications Through Qualcomm Technologies and Microsoft’s collaboration we have partnered with top developers to adopt Windows ML and have demonstrated impressive performance for their AI features. Independent Solution Vendors (ISVs) such as Powder, Topaz Labs, Camo and McAfee, Join us at the Hackathon With the recent launch of Qualcomm Snapdragon® X Elite-powered Windows laptops, developers can now take advantage of powerful NPUs (Neural Processing Units) to deploy AI applications that are both responsive and energy-efficient. These new devices open up a world of opportunities for developers to rethink how applications are built from productivity tools to creative assistants and intelligent agents all running directly on the device. Our mission has always been to enable high-quality AI experiences using compact, optimized models. These models are tailor-made for edge computing, offering faster inference, lower memory usage, and enhanced privacy without compromising performance. We encourage all application developers whether you’re building with open-source SLMs (small language models), working on smart assistants, or exploring new on-device AI use cases to join us at the event. You can register here: https://www.qualcomm.com/support/contact/forms/edge-ai-developer-hackathon-bengaluru-proposal-submission Dive deeper into these innovative developer solutions: Windows AI Foundry & Windows ML on Qualcomm NPU Microsoft and Qualcomm Technologies collaborate on Windows 11, Copilot+ PCs and Windows AI Foundry | Qualcomm Unlocking the power of Qualcomm QNN Execution Provider GPU backen Introducing Windows ML: The future of machine learning development on Windows - Windows Developer Blog623Views4likes0CommentsUsing Advanced Reasoning Model on EdgeAI Part 1 - Quantization, Conversion, Performance
DeepSeek-R1 is very popular, and it can achieve the same capabilities as OpenAI o1 in advanced reasoning. Microsoft has also added DeepSeek-R1 models to Azure AI Foundry and GitHub Models. We can compare DeepSeek-R1 ith other available models through GitHub Models Playground Note This series revolves around deployment of SLMs to Edge Devices 'Edge AI' we will focus on the deployment advanced reasoning models, with different application scenarios. You can learn more in the following session AI Tour BRK453. In this experiement we want to deploy advanced reasoning models to the edge, so that they can run on edge devices with limited computing power and offline environments. At this time, the recommendation is to use the traditional ONNX model . We can use Microsoft Olive to convert the DeepSeek-R1 Distrill model. Getting started with Microsoft Olive is very straightforward. Install the Microsoft Olive library through the command line and Python 3.10+ (recommended) pip install olive-ai The DeepSeek-R1 Distrill model series has different parameters such as 1.5B, 7B, 8B, 14B, 32B, 70B, etc. This article is mainly based on the 1.5B, 7B, and 14B models (so a Small Language Model). CPU Inference Let's discuss 1.5B and 7B, which are models with lower parameter. We can directly use the CPU as computing for inference to test the effect (hardware environment Azure DevBox, AMD EPYC 7763 64-Core + 64GB Memory + 2T SSD) Quantization conversion olive auto-opt --model_name_or_path <Your DeepSeek-R1-Distill-Qwen-1.5B/7B local location> --output_path <Your Convert ONNX INT4 Model local location> --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety or provided as an Offical Model) DeepSeek-R1-Distill-Qwen-1.5B-ONNX-INT4-CPU DeepSeek-R1-Distill-Qwen-7B-ONNX-INT4-CPU Running with ONNX Runtime GenAI Install ONNX Runtime GenAI and ONNX Runtime CPU support libraries pip install onnxruntime-genai pip install onnxruntime Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-1.5b.ipynb https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-7b.ipynb Performance comparison 1.5B vs 7B We compare two different inference scenarios explain 1+1=2 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated 2. Find all pairwise different isomorphism groups with order 147 and no elements with order 49 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated Results of the numbers Through the test, we can see that the 1.5B model of DeepSeek is more suitable for use on CPU inference and can be deployed on traditional PCs or IoT devices. As for 7B, although it has better inference, it is not very effective on CPU operation. GPU Inference It is ideal if we have a GPU on the edge device. We can quantize and convert it to an ONNX model for CPU inference through Microsoft Olive. Of course, it can also be converted to a model for GPU inference. Here I take the 14B DeepSeek-R1-Distill-Qwen-14B as an example and make an inference comparison with Microsoft's Phi-4-14B Quantization conversion olive auto-opt --model_name_or_path <Your Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --output_path <Your converted Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety and not an Official Model) DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Phi-4-14B-ONNX-INT4-GPU Running with ONNX Runtime GenAI CUDA Install ONNX Runtime GenAI and ONNX Runtime GPU support libraries pip install onnxruntime-genai-cuda pip install onnxruntime-gpu Compare the results in the GPU environment with Gradio It is recommended to use a GPU with more than 8G memory To increase the comparison of the results, we compare it with Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU to see the different results. We also show we use OpenAI o1-mini (it is recommended to use o1-mini through GitHub Models), Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/Performance_AdvancedReasoning_ONNX_CPU.ipynb You can test any prompt on Gradio to compare the results of Phi-4-14B-ONNX-INT4-GPU, DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU and OpenAI o1 mini. DeepSeek-R1 reduces the cost of inference models and produces more instructive results on professional problems, but Phi-4-14B also has advantages in reasoning and uses lower computing power to complete inference. As for OpenAI o1 mini, it is more comprehensive and can touch all problems. If you want to deploy to Edge Device, Phi-4-14B and quantized DeepSeek-R1 are good choices for you. This blog is just a simple test and the first in this series. Please share your feedback and continue the discussion in the Microsoft AI Discord Channel. Feel free to me a message or comment. We look forward to sharing more around the opportunity of EdgeAI and more content in this series. Resource DeepSeek-R1 in GitHub Models https://github.com/marketplace/models/azureml-deepseek/DeepSeek-R1 DeepSeek-R1 in Azure AI Foundry https://ai.azure.com/explore/models/DeepSeek-R1/version/1/registry/azureml-deepseek Phi-4-14B in Hugging face https://huggingface.co/microsoft/phi-4 Learn about Microsoft Olive https://github.com/microsoft/olive Learn about ONNX Runtime GenAI https://github.com/microsoft/onnxruntime-genai Microsoft AI Discord Channel BRK453 Exploring cutting-edge models: LLMs, SLMs, local development and more https://aka.ms/aitour/brk453913Views0likes0Comments