ai foundry
2 TopicsUsing Advanced Reasoning Model on EdgeAI Part 1 - Quantization, Conversion, Performance
DeepSeek-R1 is very popular, and it can achieve the same capabilities as OpenAI o1 in advanced reasoning. Microsoft has also added DeepSeek-R1 models to Azure AI Foundry and GitHub Models. We can compare DeepSeek-R1 ith other available models through GitHub Models Playground Note This series revolves around deployment of SLMs to Edge Devices 'Edge AI' we will focus on the deployment advanced reasoning models, with different application scenarios. You can learn more in the following session AI Tour BRK453. In this experiement we want to deploy advanced reasoning models to the edge, so that they can run on edge devices with limited computing power and offline environments. At this time, the recommendation is to use the traditional ONNX model . We can use Microsoft Olive to convert the DeepSeek-R1 Distrill model. Getting started with Microsoft Olive is very straightforward. Install the Microsoft Olive library through the command line and Python 3.10+ (recommended) pip install olive-ai The DeepSeek-R1 Distrill model series has different parameters such as 1.5B, 7B, 8B, 14B, 32B, 70B, etc. This article is mainly based on the 1.5B, 7B, and 14B models (so a Small Language Model). CPU Inference Let's discuss 1.5B and 7B, which are models with lower parameter. We can directly use the CPU as computing for inference to test the effect (hardware environment Azure DevBox, AMD EPYC 7763 64-Core + 64GB Memory + 2T SSD) Quantization conversion olive auto-opt --model_name_or_path <Your DeepSeek-R1-Distill-Qwen-1.5B/7B local location> --output_path <Your Convert ONNX INT4 Model local location> --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety or provided as an Offical Model) DeepSeek-R1-Distill-Qwen-1.5B-ONNX-INT4-CPU DeepSeek-R1-Distill-Qwen-7B-ONNX-INT4-CPU Running with ONNX Runtime GenAI Install ONNX Runtime GenAI and ONNX Runtime CPU support libraries pip install onnxruntime-genai pip install onnxruntime Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-1.5b.ipynb https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-7b.ipynb Performance comparison 1.5B vs 7B We compare two different inference scenarios explain 1+1=2 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated 2. Find all pairwise different isomorphism groups with order 147 and no elements with order 49 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated Results of the numbers Through the test, we can see that the 1.5B model of DeepSeek is more suitable for use on CPU inference and can be deployed on traditional PCs or IoT devices. As for 7B, although it has better inference, it is not very effective on CPU operation. GPU Inference It is ideal if we have a GPU on the edge device. We can quantize and convert it to an ONNX model for CPU inference through Microsoft Olive. Of course, it can also be converted to a model for GPU inference. Here I take the 14B DeepSeek-R1-Distill-Qwen-14B as an example and make an inference comparison with Microsoft's Phi-4-14B Quantization conversion olive auto-opt --model_name_or_path <Your Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --output_path <Your converted Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety and not an Official Model) DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Phi-4-14B-ONNX-INT4-GPU Running with ONNX Runtime GenAI CUDA Install ONNX Runtime GenAI and ONNX Runtime GPU support libraries pip install onnxruntime-genai-cuda pip install onnxruntime-gpu Compare the results in the GPU environment with Gradio It is recommended to use a GPU with more than 8G memory To increase the comparison of the results, we compare it with Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU to see the different results. We also show we use OpenAI o1-mini (it is recommended to use o1-mini through GitHub Models), Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/Performance_AdvancedReasoning_ONNX_CPU.ipynb You can test any prompt on Gradio to compare the results of Phi-4-14B-ONNX-INT4-GPU, DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU and OpenAI o1 mini. DeepSeek-R1 reduces the cost of inference models and produces more instructive results on professional problems, but Phi-4-14B also has advantages in reasoning and uses lower computing power to complete inference. As for OpenAI o1 mini, it is more comprehensive and can touch all problems. If you want to deploy to Edge Device, Phi-4-14B and quantized DeepSeek-R1 are good choices for you. This blog is just a simple test and the first in this series. Please share your feedback and continue the discussion in the Microsoft AI Discord Channel. Feel free to me a message or comment. We look forward to sharing more around the opportunity of EdgeAI and more content in this series. Resource DeepSeek-R1 in GitHub Models https://github.com/marketplace/models/azureml-deepseek/DeepSeek-R1 DeepSeek-R1 in Azure AI Foundry https://ai.azure.com/explore/models/DeepSeek-R1/version/1/registry/azureml-deepseek Phi-4-14B in Hugging face https://huggingface.co/microsoft/phi-4 Learn about Microsoft Olive https://github.com/microsoft/olive Learn about ONNX Runtime GenAI https://github.com/microsoft/onnxruntime-genai Microsoft AI Discord Channel BRK453 Exploring cutting-edge models: LLMs, SLMs, local development and more https://aka.ms/aitour/brk453603Views0likes0CommentsTiny But Mighty: Unleashing the Power of Small Language Models 🚀
While Large Language Models (LLMs) like GPT-4 dominate headlines with their extensive capabilities, they often come at the cost of high computational requirements and complexity. For developers and organizations looking to implement AI solutions on edge devices or with limited resources, Small Language Models (SLMs) are emerging as a practical alternative. SLMs are not just "smaller" versions of their larger counterparts—they're designed to be faster, more efficient, and adaptable for specific tasks. With fewer parameters and lower computational needs, SLMs open the door to deploying AI on mobile devices, IoT systems, and edge environments without compromising performance. What You Stand to Learn 🧠 Introduction to Microsoft's AI Ecosystem Discover Microsoft's end-to-end AI development tools, from Azure AI Services to ONNX Runtime, enabling efficient and secure deployment of AI models across cloud and edge environments. The Advantages of SLMs over LLMs SLMs are game-changers for edge AI applications, providing faster training and inference times, reduced energy costs, and scalability across diverse devices. Hands-On with Phi-3 and ONNX Runtime Experience live demonstrations of SLMs in action with tools like Phi-3 and ONNX Runtime, showcasing how to fine-tune and deploy models on mobile devices, IoT, and hybrid cloud environments. Responsible AI Practices Understand how to safeguard your AI applications with Microsoft's Responsible AI toolkit, ensuring ethical and trustworthy deployments. Watch the Full Session 👨💻 📅 Date: December 12, 2024 ⏰ Time: 4 PM GMT | 5 PM CEST | 8 AM PT | 11 AM ET | 7 PM EAT A session packed with live demos, practical examples, and Q&A opportunities. Register NOW | Events | Microsoft Reactor Agenda 🔍 Introduction (5 min) A brief overview of the session and its focus on SLMs and LLMs. Microsoft AI Tooling (5 min) Explore the latest tools like Azure AI Services, Azure Machine Learning, and Responsible AI Tooling. How to Choose the Right Model (10 min) Key considerations such as performance, customizability, and ethical implications. Comparing SLMs vs LLMs (10 min) The strengths, weaknesses, and best use cases for both Small and Large Language Models. Deploying Models at the Edge (10 min) Insights into optimizing AI for mobile, IoT, and edge devices. Q&A Addressing participant questions about AI development and deployment.343Views2likes0Comments