ai foundry
11 TopicsEdge AI for Beginners : Getting Started with Foundry Local
In Module 08 of the EdgeAI for Beginners course, Microsoft introduces Foundry Local a toolkit that helps you deploy and test Small Language Models (SLMs) completely offline. In this blog, I’ll share how I installed Foundry Local, ran the Phi-3.5-mini model on my windows laptop, and what I learned through the process. What Is Foundry Local? Foundry Local allows developers to run AI models locally on their own hardware. It supports text generation, summarization, and code completion — all without sending data to the cloud. Unlike cloud-based systems, everything happens on your computer, so your data never leaves your device. Prerequisites Before starting, make sure you have: Windows 10 or 11 Python 3.10 or newer Git Internet connection (for the first-time model download) Foundry Local installed Step 1 — Verify Installation After installing Foundry Local, open Command Prompt and type: foundry --version If you see a version number, Foundry Local is installed correctly. Step 2 — Start the Service Start the Foundry Local service using: foundry service start You should see a confirmation message that the service is running. Step 3 — List Available Models To view the models supported by your system, run: foundry model list You’ll get a list of locally available SLMs. Here’s what I saw on my machine: Note: Model availability depends on your device’s hardware. For most laptops, phi-3.5-mini works smoothly on CPU. Step 4 — Run the Phi-3.5 Model Now let’s start chatting with the model: foundry model run phi-3.5-mini-instruct-generic-cpu:1 Once it loads, you’ll enter an interactive chat mode. Try a simple prompt: Hello! What can you do? The model replies instantly — right from your laptop, no cloud needed. To exit, type: /exit How It Works Foundry Local loads the model weights from your device and performs inference locally.This means text generation happens using your CPU (or GPU, if available). The result: complete privacy, no internet dependency, and instant responses. Benefits for Students For students beginning their journey in AI, Foundry Local offers several key advantages: No need for high-end GPUs or expensive cloud subscriptions. Easy setup for experimenting with multiple models. Perfect for class assignments, AI workshops, and offline learning sessions. Promotes a deeper understanding of model behavior by allowing step-by-step local interaction. These factors make Foundry Local a practical choice for learning environments, especially in universities and research institutions where accessibility and affordability are important. Why Use Foundry Local Running models locally offers several practical benefits compared to using AI Foundry in the cloud. With Foundry Local, you do not need an internet connection, and all computations happen on your personal machine. This makes it faster for small models and more private since your data never leaves your device. In contrast, AI Foundry runs entirely on the cloud, requiring internet access and charging based on usage. For students and developers, Foundry Local is ideal for quick experiments, offline testing, and understanding how models behave in real-time. On the other hand, AI Foundry is better suited for large-scale or production-level scenarios where models need to be deployed at scale. In summary, Foundry Local provides a flexible and affordable environment for hands-on learning, especially when working with smaller models such as Phi-3, Qwen2.5, or TinyLlama. It allows you to experiment freely, learn efficiently, and better understand the fundamentals of Edge AI development. Optional: Restart Later Next time you open your laptop, you don’t have to reinstall anything. Just run these two commands again: foundry service start foundry model run phi-3.5-mini-instruct-generic-cpu:1 What I Learned Following the EdgeAI for Beginners Study Guide helped me understand: How edge AI applications work How small models like Phi 3.5 can run on a local machine How to test prompts and build chat apps with zero cloud usage Conclusion Running the Phi-3.5-mini model locally with Foundry Localgave me hands-on insight into edge AI. It’s an easy, private, and cost-free way to explore generative AI development. If you’re new to Edge AI, start with the EdgeAI for Beginners course and follow its Study Guide to get comfortable with local inference and small language models. Resources: EdgeAI for Beginners GitHub Repo Foundry Local Official Site Phi Model Link170Views0likes0CommentsBuilding a Multi-Agent System with Azure AI Agent Service: Campus Event Management
Personal Background My name is Peace Silly. I studied French and Spanish at the University of Oxford, where I developed a strong interest in how language is structured and interpreted. That curiosity about syntax and meaning eventually led me to computer science, which I came to see as another language built on logic and structure. In the academic year 2024–2025, I completed the MSc Computer Science at University College London, where I developed this project as part of my Master’s thesis. Project Introduction Can large-scale event management be handled through a simple chat interface? This was the question that guided my Master’s thesis project at UCL. As part of the Industry Exchange Network (IXN) and in collaboration with Microsoft, I set out to explore how conversational interfaces and autonomous AI agents could simplify one of the most underestimated coordination challenges in campus life: managing events across multiple departments, societies, and facilities. At large universities, event management is rarely straightforward. Rooms are shared between academic timetables, student societies, and one-off events. A single lecture theatre might host a departmental seminar in the morning, a society meeting in the afternoon, and a careers talk in the evening, each relying on different systems, staff, and communication chains. Double bookings, last-minute cancellations, and maintenance issues are common, and coordinating changes often means long email threads, manual spreadsheets, and frustrated users. These inefficiencies do more than waste time; they directly affect how a campus functions day to day. When venues are unavailable or notifications fail to reach the right people, even small scheduling errors can ripple across entire departments. A smarter, more adaptive approach was needed, one that could manage complex workflows autonomously while remaining intuitive and human for end users. The result was the Event Management Multi-Agent System, a cloud-based platform where staff and students can query events, book rooms, and reschedule activities simply by chatting. Behind the scenes, a network of Azure-powered AI agents collaborates to handle scheduling, communication, and maintenance in real time, working together to keep the campus running smoothly. The user scenario shown in the figure below exemplifies the vision that guided the development of this multi-agent system. Starting with Microsoft Learning Resources I began my journey with Microsoft’s tutorial Build Your First Agent with Azure AI Foundry which introduced the fundamentals of the Azure AI Agent Service and provided an ideal foundation for experimentation. Within a few weeks, using the Azure Foundry environment, I extended those foundations into a fully functional multi-agent system. Azure Foundry’s visual interface was an invaluable learning space. It allowed me to deploy, test, and adjust model parameters such as temperature, system prompts, and function calling while observing how each change influenced the agents’ reasoning and collaboration. Through these experiments, I developed a strong conceptual understanding of orchestration and coordination before moving to the command line for more complex development later. When development issues inevitably arose, I relied on the Discord support community and the GitHub forum for troubleshooting. These communities were instrumental in addressing configuration issues and providing practical examples, ensuring that each agent performed reliably within the shared-thread framework. This early engagement with Microsoft’s learning materials not only accelerated my technical progress but also shaped how I approached experimentation, debugging, and iteration. It transformed a steep learning curve into a structured, hands-on process that mirrored professional software development practice. A Decentralised Team of AI Agents The system’s intelligence is distributed across three specialised agents, powered by OpenAI’s GPT-4.1 models through Azure OpenAI Service. They each perform a distinct role within the event management workflow: Scheduling Agent – interprets natural language requests, checks room availability, and allocates suitable venues. Communications Agent – notifies stakeholders when events are booked, modified, or cancelled. Maintenance Agent – monitors room readiness, posts fault reports when venues become unavailable, and triggers rescheduling when needed. Each agent operates independently but communicates through a shared thread, a transparent message log that serves as the coordination backbone. This thread acts as a persistent state space where agents post updates, react to changes, and maintain a record of every decision. For example, when a maintenance fault is detected, the Maintenance Agent logs the issue, the Scheduling Agent identifies an alternative venue, and the Communications Agent automatically notifies attendees. These interactions happen autonomously, with each agent responding to the evolving context recorded in the shared thread. Interfaces and Backend The system was designed with both developer-focused and user-facing interfaces, supporting rapid iteration and intuitive interaction. The Terminal Interface Initially, the agents were deployed and tested through a terminal interface, which provided a controlled environment for debugging and verifying logic step by step. This setup allowed quick testing of individual agents and observation of their interactions within the shared thread. The Chat Interface As the project evolved, I introduced a lightweight chat interface to make the system accessible to staff and students. This interface allows users to book rooms, query events, and reschedule activities using plain language. Recognising that some users might still want to see what happens behind the scenes, I added an optional toggle that reveals the intermediate steps of agent reasoning. This transparency feature proved valuable for debugging and for more technical users who wanted to understand how the agents collaborated. When a user interacts with the chat interface, they are effectively communicating with the Scheduling Agent, which acts as the primary entry point. The Scheduling Agent interprets natural-language commands such as “Book the Engineering Auditorium for Friday at 2 PM” or “Reschedule the robotics demo to another room.” It then coordinates with the Maintenance and Communications Agents to complete the process. Behind the scenes, the chat interface connects to a FastAPI backend responsible for core logic and data access. A Flask + HTMX layer handles lightweight rendering and interactivity, while the Azure AI Agent Service manages orchestration and shared-thread coordination. This combination enables seamless agent communication and reliable task execution without exposing any of the underlying complexity to the end user. Automated Notifications and Fault Detection Once an event is scheduled, the Scheduling Agent posts the confirmation to the shared thread. The Communications Agent, which subscribes to thread updates, automatically sends notifications to all relevant stakeholders by email. This ensures that every participant stays informed without any manual follow-up. The Maintenance Agent runs routine availability checks. If a fault is detected, it logs the issue to the shared thread, prompting the Scheduling Agent to find an alternative room. The Communications Agent then notifies attendees of the change, ensuring minimal disruption to ongoing events. Testing and Evaluation The system underwent several layers of testing to validate both functional and non-functional requirements. Unit and Integration Tests Backend reliability was evaluated through unit and integration tests to ensure that room allocation, conflict detection, and database operations behaved as intended. Automated test scripts verified end-to-end workflows for event creation, modification, and cancellation across all agents. Integration results confirmed that the shared-thread orchestration functioned correctly, with all test cases passing consistently. However, coverage analysis revealed that approximately 60% of the codebase was tested, leaving some areas such as Azure service integration and error-handling paths outside automated validation. These trade-offs were deliberate, balancing test depth with project scope and the constraints of mocking live dependencies. Azure AI Evaluation While functional testing confirmed correctness, it did not capture the agents’ reasoning or language quality. To assess this, I used Azure AI Evaluation, which measures conversational performance across metrics such as relevance, coherence, fluency, and groundedness. The results showed high scores in relevance (4.33) and groundedness (4.67), confirming the agents’ ability to generate accurate and context-aware responses. However, slightly lower fluency scores and weaker performance in multi-turn tasks revealed a retrieval–execution gap typical in task-oriented dialogue systems. Limitations and Insights The evaluation also surfaced several key limitations: Synthetic data: All tests were conducted with simulated datasets rather than live campus systems, limiting generalisability. Scalability: A non-functional requirement in the form of horizontal scalability was not tested. The architecture supports scaling conceptually but requires validation under heavier load. Despite these constraints, the testing process confirmed that the system was both technically reliable and linguistically robust, capable of autonomous coordination under normal conditions. The results provided a realistic picture of what worked well and what future iterations should focus on improving. Impact and Future Work This project demonstrates how conversational AI and multi-agent orchestration can streamline real operational processes. By combining Azure AI Agent Services with modular design principles, the system automates scheduling, communication, and maintenance while keeping the user experience simple and intuitive. The architecture also establishes a foundation for future extensions: Predictive maintenance to anticipate venue faults before they occur. Microsoft Teams integration for seamless in-chat scheduling. Scalability testing and real-user trials to validate performance at institutional scale. Beyond its technical results, the project underscores the potential of multi-agent systems in real-world coordination tasks. It illustrates how modularity, transparency, and intelligent orchestration can make everyday workflows more efficient and human-centred. Acknowledgements What began with a simple Microsoft tutorial evolved into a working prototype that reimagines how campuses could manage their daily operations through conversation and collaboration. This was both a challenging and rewarding journey, and I am deeply grateful to Professor Graham Roberts (UCL) and Professor Lee Stott (Microsoft) for their guidance, feedback, and support throughout the project.200Views1like0CommentsMonitoring and Evaluating LLMs in Clinical Contexts with Azure AI Foundry
👀 Missed Session 02? Don’t worry—you can still catch up. But first, here’s what AI HLS Ignited is all about: What is AI HLS Ignited? AI HLS Ignited is a Microsoft-led technical series for healthcare innovators, solution architects, and AI engineers. Each session brings to life real-world AI solutions that are reshaping the Healthcare and Life Sciences (HLS) industry. Through live demos, architectural deep dives, and GitHub-hosted code, we equip you with the tools and knowledge to build with confidence. Session 02 Recap: In this session, we introduced MedEvals, an end-to-end evaluation framework for medical AI applications built on Azure AI Foundry. Inspired by Stanford’s MedHELM benchmark, MedEvals enables providers and payers to systematically validate performance, safety, and compliance of AI solutions across clinical decision support, documentation, patient communication, and more. 🧠 Why Scalable Evaluation Is Critical for Medical AI "Large language models (LLMs) hold promise for tasks ranging from clinical decision support to patient education. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medical information." — Evaluating large language models in medical applications: a survey As AI systems become deeply embedded in healthcare workflows, the need for rigorous evaluation frameworks intensifies. Although large language models (LLMs) can augment tasks ranging from clinical documentation to decision support, their deployment in patient-facing settings demands systematic validation to guarantee safety, fidelity, and robustness. Benchmarks such as MedHELM address this requirement by subjecting models to a comprehensive battery of clinically derived tasks built on dataset (ground truth), enabling fine-grained, multi-metric performance assessment across the full spectrum of clinical use cases. However, shipping a medical LLM is only step one. Without a repeatable, metrics-driven evaluation loop, quality erodes, regulatory gaps widen, and patient safety is put at risk. This project accelerates your ability to operationalize trustworthy LLMs by delivering plug-and-play medical benchmarks, configurable evaluators, and CI/CD templates—so every model update triggers an automated, domain-specific “health check” that flags drift, surfaces bias, and validates clinical accuracy before it ever reaches production. 🚀 How to Get Started with MedEvals Kick off your MedEvals journey by following our curated labs. Newcomers to Azure AI Foundry can start with the foundational workflow; seasoned practitioners can dive into advanced evaluation pipelines and CI/CD integration. 🧪 Labs 🧪 Foundry Basics & Custom Evaluations: 🧾 Notebook Authenticate, initialize a Foundry project, run built-in metrics, and build custom evaluators with EvalAI and PromptEval. 🧪 Search & Retrieval Evaluations: 🧾 Notebook Prepare datasets, execute search metrics (precision, recall, NDCG), visualize results, and register evaluators in Foundry. 🧪 Repeatable Evaluations & CI/CD: 🧾 Notebook Define evaluation schemas, build deterministic pipelines with PyTest, and automate drift detection using GitHub Actions. 🏥 Use Cases 📝 Creating Your Clinical Evaluation with RevCycle Determinations Select a model and metric that best supports the determination behind the rationale made on AI-assisted prior authorizations based on real payor policy. This notebook use case includes: Selecting multiple candidate LLMs (e.g., gpt-4o, o1) Breaking down determinations both in deterministic results (approved vs rejected) and the supporting rationale and logic. Running evaluations across multiple dimensions Combining deterministic evaluators and LLM-as-a-Judge methods Evaluating the differential impacts of evaluators on the rationale across scenarios 🧾Get Started with the Notebook Why it matters: Enables data-driven metric selection for clinical workflows, ensures transparent benchmarking, and accelerates safe AI adoption in healthcare. 📝 Evaluating AI Medical Notes Summarization Applications Systematically assess how different foundation models and prompting strategies perform on clinical summarization tasks, following the MedHELM framework. This notebook use case includes: Preparing real-world datasets of clinical notes and summaries Benchmarking summarization quality using relevance, coherence, factuality, and harmfulness metrics Testing prompting techniques (zero-shot, few-shot, chain-of-thought prompting) Evaluating outputs using both automated metrics and human-in-the-loop scoring 🧾Get Started with the Notebook Why it matters: Ensures responsible deployment of AI applications for clinical summarization, guaranteeing high standards of quality, trustworthiness, and usability. 📣 Join Us for the Next Session Help shape the future of healthcare by sharing AI HLS Ignited with your network—and don’t miss what’s coming next! 📅 Register for the upcoming session → AI HLS Ignited Event Page 💻 Explore the code, demos, and architecture → AI HLS Ignited GitHub Repository1.1KViews0likes0CommentsOrchestrate multimodal AI insights within your healthcare data estate (Public Preview)
In today’s healthcare landscape, there is an increasing emphasis on leveraging artificial intelligence (AI) to extract meaningful insights from diverse datasets to improve patient care and drive clinical research. However, incorporating AI into your healthcare data estate often brings significant costs and challenges, especially when dealing with siloed and unstructured data. Healthcare organizations produce and consume data that is not only vast but also varied in format—ranging from structured EHR entries to unstructured clinical notes and imaging data. Traditional methods require manual effort to prepare and harmonize this data for AI, specify the AI output format, set up API calls, store the AI outputs, integrate the AI outputs, and analyze the AI outputs for each AI model or service you decide to use. Orchestrate multimodal AI insights is designed to streamline and scale healthcare AI within your data estate by building off of the data transformations in healthcare data solutions in Microsoft Fabric. This capability provides a framework to generate AI insights by connecting your multimodal healthcare data to an ecosystem of AI services and models and integrating structured AI-generated insights back into your data estate. When you combine these AI-generated insights with the existing healthcare data in your data estate, you can power advanced analytics scenarios for your organization and patient population. Key features: Metadata store lakehouse acts as a central repository for the metadata for AI orchestration to effectively capture and manage enrichment definitions, view definitions, and contextual information for traceability purposes. Execution notebooks define the enrichment view and enrichment definition based on the model configuration and input mappings. They also specify the model processor and transformer. The model processor calls the model API, and the transformer produces the standardized output while saving the output in the bronze lakehouse in the Ingest folder. Transformation pipeline to ingest AI-generated insights through the healthcare data solutions medallion lakehouse layers and persist the insights in an enrichment store within the silver layer. Conceptual architecture: The data transformations in healthcare data solutions in Microsoft Fabric allow you ingest, store, and analyze multimodal data. With the orchestrate multimodal AI insights capability, this standardized data serves as the input for healthcare AI models. The model results are stored in a standardized format and provide new insights from your data. The diagram below shows the flow of integrating AI generated insights into the data estate, starting as raw data in the bronze lakehouse and being transformed to delta tables in the silver lakehouse. This capability simplifies AI integration across modalities for data-driven research and care, currently supporting: Text Analytics for health in Azure AI Language to extract medical entities such as conditions and medications from unstructured clinical notes. This utilizes the data in the DocumentReference FHIR resource. MedImageInsight healthcare AI model in Azure AI Foundry to generate medical image embeddings from imaging data. This model leverages the data in the ImagingStudy FHIR resource. MedImageParse healthcare AI model in Azure AI Foundry to enable segmentation, detection, and recognition from imaging data across numerous object types and imaging modalities. This model uses the data in the ImagingStudy FHIR resource. By using orchestrate multimodal AI insights to leverage the data in healthcare data solutions for these models and integrate the results into the data estate, you can analyze your existing data alongside AI enrichments. This allows you to explore use cases such as creating image segmentations and combining with your existing imaging metadata and clinical data to enable quick insights and disease progression trends for clinical research at the patient level. Get started today! This capability is now available in public preview, and you can use the in-product sample data to test this feature with any of the three models listed above. For more information and to learn how to deploy the capability, please refer to the product documentation. We will dive deeper into more detailed aspects of the capability, such as the enrichment store and custom AI use cases, in upcoming blogs. Medical device disclaimer: Microsoft products and services (1) are not designed, intended or made available as a medical device, and (2) are not designed or intended to be a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. Customers/partners are responsible for ensuring solutions comply with applicable laws and regulations. FHIR® is the registered trademark of HL7 and is used with permission of HL7.1.3KViews2likes0CommentsAI Agents: Key Principles and Guidelines - Part 3
This blog post, the third in a series on AI agents, focuses on user-centric design principles for building effective and trustworthy agentic systems. Drawing from the "Agentic Design Patterns" section of Microsoft's "AI Agents for Beginners" GitHub repository, the post outlines key principles categorized by Agent (Space), Agent (Time), and Agent (Core). These principles emphasize connection, accessibility, leveraging historical context, adapting to future needs, and establishing trust through transparency and control. Practical implementation guidelines are provided, along with a travel agent example to illustrate how these principles can be applied in real-world scenarios. The post also links to additional resources and previous installments in the series for a comprehensive learning experience.2.7KViews1like0CommentsUnleashing the Power of AI Agents: Transforming Business Operations
Let "Get Started with AI Agents," in this short blog I want explore the evolution, capabilities, and applications of AI agents, highlighting their potential to enhance productivity and efficiency. We take a peak into the challenges of developing AI agents and introduce powerful tools like Azure AI Foundry and Azure AI Agent Service that empower developers to build, deploy, and scale AI agents securely and efficiently. In today's rapidly evolving technological landscape, the integration of AI agents into business processes is becoming increasingly essential. Lets delve into the transformative potential of AI agents and how they can revolutionize various aspects of our operations. We begin by exploring the evolution of LLM-based solutions, tracing the journey from no agents to sophisticated multi-agent systems. This progression highlights the growing complexity and capabilities of AI agents, which are now poised to handle wide-scope, complex use cases requiring diverse skills. Lets now look at agentic AI capabilities. AI agents can significantly enhance employee productivity and process efficiency, making our operations faster and more effective. Lets examine the key applications of AI agents across industries, such as travel booking and expense management, employee onboarding, personalized customer support, and data analytics and reporting. However, developing AI agents is not without its challenges. Some of the primary considerations, including tool integration, interoperability, scalability, real-time processing, maintenance, flexibility, error handling, and security. These challenges underscore the need for robust platforms that enable rapid development and secure deployment of AI agents. To this end, we introduce Azure AI Foundry and Azure AI Agent Service. These tools empower developers to build, deploy, and scale AI agents securely and efficiently. Azure AI Foundry offers a comprehensive suite of tools, including model catalogs, content safety features, and machine learning capabilities. The Azure AI Agent Service, currently in public preview, provides flexible model selection, extensive data connections, enterprise-grade security, and rapid development and automation capabilities. When building multi agent or agentic based systems there is a huge importance of multi-agent orchestration. Tools like AutoGen and Semantic Kernel facilitate the orchestration of multi-agent systems, enabling seamless integration and collaboration between different AI agents. In conclusion, the transformative potential of AI agents in driving productivity, efficiency, and innovation. By leveraging the capabilities of Azure AI Foundry and Azure AI Agent Service, we can overcome the challenges of AI agent development and unlock new opportunities for growth and success. Resources Azure AI Discord - https://aka.ms/AzureAI/Discord Global AI community - https://globalai.community Generative AI for beginners – https://aka.ms/genai-beginners AI Agents for beginners - https://aka.ms/ai-agents-beginners Attend one of the Global AI Bootcamp near you - https://globalai.community/bootcamp/ Build AI Tour open content - https://aka.ms/aitour/repos Build your first Agent with Azure AI Agent Service - Slide deck and code - https://github.com/microsoft/aitour-build-your-first-agent-with-azure-ai-agent-service1.2KViews2likes0CommentsUsing Advanced Reasoning Model on EdgeAI Part 1 - Quantization, Conversion, Performance
DeepSeek-R1 is very popular, and it can achieve the same capabilities as OpenAI o1 in advanced reasoning. Microsoft has also added DeepSeek-R1 models to Azure AI Foundry and GitHub Models. We can compare DeepSeek-R1 ith other available models through GitHub Models Playground Note This series revolves around deployment of SLMs to Edge Devices 'Edge AI' we will focus on the deployment advanced reasoning models, with different application scenarios. You can learn more in the following session AI Tour BRK453. In this experiement we want to deploy advanced reasoning models to the edge, so that they can run on edge devices with limited computing power and offline environments. At this time, the recommendation is to use the traditional ONNX model . We can use Microsoft Olive to convert the DeepSeek-R1 Distrill model. Getting started with Microsoft Olive is very straightforward. Install the Microsoft Olive library through the command line and Python 3.10+ (recommended) pip install olive-ai The DeepSeek-R1 Distrill model series has different parameters such as 1.5B, 7B, 8B, 14B, 32B, 70B, etc. This article is mainly based on the 1.5B, 7B, and 14B models (so a Small Language Model). CPU Inference Let's discuss 1.5B and 7B, which are models with lower parameter. We can directly use the CPU as computing for inference to test the effect (hardware environment Azure DevBox, AMD EPYC 7763 64-Core + 64GB Memory + 2T SSD) Quantization conversion olive auto-opt --model_name_or_path <Your DeepSeek-R1-Distill-Qwen-1.5B/7B local location> --output_path <Your Convert ONNX INT4 Model local location> --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety or provided as an Offical Model) DeepSeek-R1-Distill-Qwen-1.5B-ONNX-INT4-CPU DeepSeek-R1-Distill-Qwen-7B-ONNX-INT4-CPU Running with ONNX Runtime GenAI Install ONNX Runtime GenAI and ONNX Runtime CPU support libraries pip install onnxruntime-genai pip install onnxruntime Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-1.5b.ipynb https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-7b.ipynb Performance comparison 1.5B vs 7B We compare two different inference scenarios explain 1+1=2 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated 2. Find all pairwise different isomorphism groups with order 147 and no elements with order 49 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated Results of the numbers Through the test, we can see that the 1.5B model of DeepSeek is more suitable for use on CPU inference and can be deployed on traditional PCs or IoT devices. As for 7B, although it has better inference, it is not very effective on CPU operation. GPU Inference It is ideal if we have a GPU on the edge device. We can quantize and convert it to an ONNX model for CPU inference through Microsoft Olive. Of course, it can also be converted to a model for GPU inference. Here I take the 14B DeepSeek-R1-Distill-Qwen-14B as an example and make an inference comparison with Microsoft's Phi-4-14B Quantization conversion olive auto-opt --model_name_or_path <Your Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --output_path <Your converted Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety and not an Official Model) DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Phi-4-14B-ONNX-INT4-GPU Running with ONNX Runtime GenAI CUDA Install ONNX Runtime GenAI and ONNX Runtime GPU support libraries pip install onnxruntime-genai-cuda pip install onnxruntime-gpu Compare the results in the GPU environment with Gradio It is recommended to use a GPU with more than 8G memory To increase the comparison of the results, we compare it with Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU to see the different results. We also show we use OpenAI o1-mini (it is recommended to use o1-mini through GitHub Models), Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/Performance_AdvancedReasoning_ONNX_CPU.ipynb You can test any prompt on Gradio to compare the results of Phi-4-14B-ONNX-INT4-GPU, DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU and OpenAI o1 mini. DeepSeek-R1 reduces the cost of inference models and produces more instructive results on professional problems, but Phi-4-14B also has advantages in reasoning and uses lower computing power to complete inference. As for OpenAI o1 mini, it is more comprehensive and can touch all problems. If you want to deploy to Edge Device, Phi-4-14B and quantized DeepSeek-R1 are good choices for you. This blog is just a simple test and the first in this series. Please share your feedback and continue the discussion in the Microsoft AI Discord Channel. Feel free to me a message or comment. We look forward to sharing more around the opportunity of EdgeAI and more content in this series. Resource DeepSeek-R1 in GitHub Models https://github.com/marketplace/models/azureml-deepseek/DeepSeek-R1 DeepSeek-R1 in Azure AI Foundry https://ai.azure.com/explore/models/DeepSeek-R1/version/1/registry/azureml-deepseek Phi-4-14B in Hugging face https://huggingface.co/microsoft/phi-4 Learn about Microsoft Olive https://github.com/microsoft/olive Learn about ONNX Runtime GenAI https://github.com/microsoft/onnxruntime-genai Microsoft AI Discord Channel BRK453 Exploring cutting-edge models: LLMs, SLMs, local development and more https://aka.ms/aitour/brk453957Views0likes0Comments