onnx
28 TopicsIntroducing Azure AI Travel Agents: A Flagship MCP-Powered Sample for AI Travel Solutions
We are excited to introduce AI Travel Agents, a sample application with enterprise functionality that demonstrates how developers can coordinate multiple AI agents (written in multiple languages) to explore travel planning scenarios. It's built with LlamaIndex.TS for agent orchestration, Model Context Protocol (MCP) for structured tool interactions, and Azure Container Apps for scalable deployment. TL;DR: Experience the power of MCP and Azure Container Apps with The AI Travel Agents! Try out live demo locally on your computer for free to see real-time agent collaboration in action. Share your feedback on our community forum. We’re already planning enhancements, like new MCP-integrated agents, enabling secure communication between the AI agents and MCP servers and more. NOTE: This example uses mock data and is intended for demonstration purposes rather than production use. The Challenge: Scaling Personalized Travel Planning Travel agencies grapple with complex tasks: analyzing diverse customer needs, recommending destinations, and crafting itineraries, all while integrating real-time data like trending spots or logistics. Traditional systems falter with latency, scalability, and coordination, leading to delays and frustrated clients. The AI Travel Agents tackles these issues with a technical trifecta: LlamaIndex.TS orchestrates six AI agents for efficient task handling. MCP equips agents with travel-specific data and tools. Azure Container Apps ensures scalable, serverless deployment. This architecture delivers operational efficiency and personalized service at scale, transforming chaos into opportunity. LlamaIndex.TS: Orchestrating AI Agents The heart of The AI Travel Agents is LlamaIndex.TS, a powerful agentic framework that orchestrates multiple AI agents to handle travel planning tasks. Built on a Node.js backend, LlamaIndex.TS manages agent interactions in a seamless and intelligent manner: Task Delegation: The Triage Agent analyzes queries and routes them to specialized agents, like the Itinerary Planning Agent, ensuring efficient workflows. Agent Coordination: LlamaIndex.TS maintains context across interactions, enabling coherent responses for complex queries, such as multi-city trip plans. LLM Integration: Connects to Azure OpenAI, GitHub Models or any local LLM using Foundy Local for advanced AI capabilities. LlamaIndex.TS’s modular design supports extensibility, allowing new agents to be added with ease. LlamaIndex.TS is the conductor, ensuring agents work in sync to deliver accurate, timely results. Its lightweight orchestration minimizes latency, making it ideal for real-time applications. MCP: Fueling Agents with Data and Tools The Model Context Protocol (MCP) empowers AI agents by providing travel-specific data and tools, enhancing their functionality. MCP acts as a data and tool hub: Real-Time Data: Supplies up-to-date travel information, such as trending destinations or seasonal events, via the Web Search Agent using Bing Search. Tool Access: Connects agents to external tools, like the .NET-based customer queries analyzer for sentiment analysis, the Python-based itinerary planning for trip schedules or destination recommendation tools written in Java. For example, when the Destination Recommendation Agent needs current travel trends, MCP delivers via the Web Search Agent. This modularity allows new tools to be integrated seamlessly, future-proofing the platform. MCP’s role is to enrich agent capabilities, leaving orchestration to LlamaIndex.TS. Azure Container Apps: Scalability and Resilience Azure Container Apps powers The AI Travel Agents sample application with a serverless, scalable platform for deploying microservices. It ensures the application handles varying workloads with ease: Dynamic Scaling: Automatically adjusts container instances based on demand, managing booking surges without downtime. Polyglot Microservices: Supports .NET (Customer Query), Python (Itinerary Planning), Java (Destination Recommandation) and Node.js services in isolated containers. Observability: Integrates tracing, metrics, and logging enabling real-time monitoring. Serverless Efficiency: Abstracts infrastructure, reducing costs and accelerating deployment. Azure Container Apps' global infrastructure delivers low-latency performance, critical for travel agencies serving clients worldwide. The AI Agents: A Quick Look While MCP and Azure Container Apps are the stars, they support a team of multiple AI agents that drive the application’s functionality. Built and orchestrated with Llamaindex.TS via MCP, these agents collaborate to handle travel planning tasks: Triage Agent: Directs queries to the right agent, leveraging MCP for task delegation. Customer Query Agent: Analyzes customer needs (emotions, intents), using .NET tools. Destination Recommendation Agent: Suggests tailored destinations, using Java. Itinerary Planning Agent: Crafts efficient itineraries, powered by Python. Web Search Agent: Fetches real-time data via Bing Search. These agents rely on MCP’s real-time communication and Azure Container Apps’ scalability to deliver responsive, accurate results. It's worth noting though this sample application uses mock data for demonstration purpose. In real worl scenario, the application would communicate with an MCP server that is plugged in a real production travel API. Key Features and Benefits The AI Travel Agents offers features that showcase the power of MCP and Azure Container Apps: Real-Time Chat: A responsive Angular UI streams agent responses via MCP’s SSE, ensuring fluid interactions. Modular Tools: MCP enables tools like analyze_customer_query to integrate seamlessly, supporting future additions. Scalable Performance: Azure Container Apps ensures the UI, backend and the MCP servers handle high traffic effortlessly. Transparent Debugging: An accordion UI displays agent reasoning providing backend insights. Benefits: Efficiency: LlamaIndex.TS streamlines operations. Personalization: MCP’s data drives tailored recommendations. Scalability: Azure ensures reliability at scale. Thank You to Our Contributors! The AI Travel Agents wouldn’t exist without the incredible work of our contributors. Their expertise in MCP development, Azure deployment, and AI orchestration brought this project to life. A special shoutout to: Pamela Fox – Leading the developement of the Python MCP server. Aaron Powell and Justin Yoo – Leading the developement of the .NET MCP server. Rory Preddy – Leading the developement of the Java MCP server. Lee Stott and Kinfey Lo – Leading the developement of the Local AI Foundry Anthony Chu and Vyom Nagrani – Leading Azure Container Apps roadmap Matt Soucoup and Julien Dubois – Leading the ACA DevRel strategy Wassim Chegham – Architected MCP and backend orchestration. And many more! See the GitHub repository for all contributors. Thank you for your dedication to pushing the boundaries of AI and cloud technology! Try It Out Experience the power of MCP and Azure Container Apps with The AI Travel Agents! Try out live demo locally on your computer for free to see real-time agent collaboration in action. Conclusion Developers can explore today the open-source project on GitHub, with setup and deployment instructions. Share your feedback on our community forum. We’re already planning enhancements, like new MCP-integrated agents, enabling secure communication between the AI agents and MCP servers and more. This is still a work in progress and we also welcome all kind of contributions. Please fork and star the repo to stay tuned for updates! ◾️We would love your feedback and continue the discussion in the Azure AI Foundry Discord aka.ms/foundry/discord On behalf of Microsoft DevRel Team.Make Phi-4-mini-reasoning more powerful with industry reasoning on edge devices
In situations with limited computing, Phi-4-mini-reasoning will is an excellent model choice. We can use Microsoft Olive or Apple MLX Framework to quantize Phi-4-mini-reasoning and deploy it on edge terminals such as IoT, Laotop and mobile devices. Quantization In order to solve the problem that the model is difficult to deploy directly to specific hardware, we need to reduce the complexity of the model through model quantization. Undertaking the quantization process will inevitably cause precision loss. Quantize Phi-4-mini-reasoning using Microsoft Olive Microsoft Olive is an AI model optimization toolkit for ONNX Runtime. Given a model and target hardware, Olive (short for Onnx LIVE) will combine the most appropriate optimization techniques to output the most efficient ONNX model for inference in the cloud or on the edge. We can combine Microsoft Olive and Phi-4-mini-reasoning on Azure AI Foundry's Model Catalog to quantize Phi-4-mini-reasoning to an ONNX format model. Create your Notebook on Azure ML Install Microsoft Olive pip install git+https://github.com/Microsoft/Olive.git Quantize using Microsoft Olive olive auto-opt --model_name_or_path {Azure Model Catalog path ,such as azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1 }--device cpu --provider CPUExecutionProvider --use_model_builder --precision int4 --output_path ./phi-4-14b-reasoninig-onnx --log_level 1 Register your quantized Model ! python -m mlx_lm.generate --model ./phi-4-mini-reasoning --adapter-path ./adapters --max-token 4096 --prompt "A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?" --extra-eos-token "<|end|>" Download to local and run Download the onnx model to local device ml_client.models.download("phi-4-mini-onnx-int4-cpu", 1) Running onnx model with onnxruntime-genai Install onnxruntime-genai (This is CPU version) pip install onnxruntime-genai Run it import onnxruntime_genai as og model_folder = "Your ONNX Model Path" model = og.Model(model_folder) tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() search_options = {} search_options['max_length'] = 32768 chat_template = "<|user|>{input}<|end|><|assistant|>" text = 'A school arranges dormitories for students. If each dormitory accommodates 5 people, 4 people cannot live there; if each dormitory accommodates 6 people, one dormitory only has 4 people, and two dormitories are empty. Find the number of students in this grade and the number of dormitories.' prompt = f'{chat_template.format(input=text)}' input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.append_tokens(input_tokens) while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end='', flush=True) Get Notebook from Phi Cookbook : https://aka.ms/phicookbook Quantize Phi-4-mini-reasoning model using Apple MLX Install Apple MLX Framework pip install -U mlx-lm Convert Phi-4-mini-reasoning model through Apple MLX quantization python -m mlx_lm.convert --hf-path {Phi-4-mini-reasoning Hugging face id} -q Run Phi-4-mini-reasoning with Apple MLX in terminal python -m mlx_lm.generate --model ./mlx_model --max-token 2048 --prompt "A school arranges dormitories for students. If each dormitory accommodates 5 people, 4 people cannot live there; if each dormitory accommodates 6 people, one dormitory only has 4 people, and two dormitories are empty. Find the number of students in this grade and the number of dormitories." --extra-eos-token "<|end|>" --temp 0.0 Fine-tuning We can fine-tune the CoT data of different scenarios to enable Phi-4-mini-reasoning to have reasoning capabilities for different scenarios. Here we use the Medical CoT data from a public Huggingface datasets as our example (this is just an example. If you need rigorous medical reasoning, please seek more professional data support) We can fine-tune our CoT data in Azure ML Fine-tune Phi-4-mini-reasoning using Microsoft Olive in Azure ML Note- Please use Standard_NC24ads_A100_v4 to run this sample Get Data from Hugging face datasets pip install datasets run this script to get train data from datasets import load_dataset def formatting_prompts_func(examples): inputs = examples["Question"] cots = examples["Complex_CoT"] outputs = examples["Response"] texts = [] for input, cot, output in zip(inputs, cots, outputs): text = prompt_template.format(input, cot, output) + "<|end|>" # text = prompt_template.format(input, cot, output) + "<|endoftext|>" texts.append(text) return { "text": texts, } # Create the English dataset dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train",trust_remote_code=True) dataset = dataset.map(formatting_prompts_func, batched = True,remove_columns=["Question", "Complex_CoT", "Response"]) dataset.to_json("en_dataset.jsonl") Fine-tuning with Microsoft Olive olive finetune \ --method lora \ --model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} \ --trust_remote_code \ --data_name json \ --data_files ./en_dataset.jsonl \ --train_split "train[:16000]" \ --eval_split "train[16000:19700]" \ --text_field "text" \ --max_steps 100 \ --logging_steps 10 \ --output_path {Your fine-tuning save path} \ --log_level 1 Convert the model to ONNX with Microsoft Olive olive capture-onnx-graph \ --model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} \ --adapter_path {Your fine-tuning adapter path} \ --use_model_builder \ --output_path {Your save onnx path} \ --log_level 1 olive generate-adapter \ --model_name_or_path {Your save onnx path} \ --output_path {Your save onnx adapter path} \ --log_level 1 Run the model with onnxruntime-genai-cuda Install onnxruntime-genai-cuda SDK import onnxruntime_genai as og import numpy as np import os model_folder = "./models/phi-4-mini-reasoning/adapter-onnx/model/" model = og.Model(model_folder) adapters = og.Adapters(model) adapters.load('./models/phi-4-mini-reasoning/adapter-onnx/model/adapter_weights.onnx_adapter', "en_medical_reasoning") tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() search_options = {} search_options['max_length'] = 200 search_options['past_present_share_buffer'] = False search_options['temperature'] = 1 search_options['top_k'] = 1 prompt_template = """<|user|>{}<|end|><|assistant|><think>""" question = """ A 33-year-old woman is brought to the emergency department 15 minutes after being stabbed in the chest with a screwdriver. Given her vital signs of pulse 110\/min, respirations 22\/min, and blood pressure 90\/65 mm Hg, along with the presence of a 5-cm deep stab wound at the upper border of the 8th rib in the left midaxillary line, which anatomical structure in her chest is most likely to be injured? """ prompt = prompt_template.format(question, "") input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.set_active_adapter(adapters, "en_medical_reasoning") generator.append_tokens(input_tokens) while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end='', flush=True) inference model with onnxruntime-genai cuda olive finetune \ --method lora \ --model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} \ --trust_remote_code \ --data_name json \ --data_files ./en_dataset.jsonl \ --train_split "train[:16000]" \ --eval_split "train[16000:19700]" \ --text_field "text" \ --max_steps 100 \ --logging_steps 10 \ --output_path {Your fine-tuning save path} \ --log_level 1 Fine-tune Phi-4-mini-reasoning using Apple MLX locally on MacOS Note- we recommend that you use devices with a minimum of 64GB Memory and Apple Silicon devices Get the DataSet from Hugging face datasets pip install datasets run this script to get train and valid data from datasets import load_dataset prompt_template = """<|user|>{}<|end|><|assistant|><think>{}</think>{}<|end|>""" def formatting_prompts_func(examples): inputs = examples["Question"] cots = examples["Complex_CoT"] outputs = examples["Response"] texts = [] for input, cot, output in zip(inputs, cots, outputs): # text = prompt_template.format(input, cot, output) + "<|end|>" text = prompt_template.format(input, cot, output) + "<|endoftext|>" texts.append(text) return { "text": texts, } dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", trust_remote_code=True) split_dataset = dataset["train"].train_test_split(test_size=0.2, seed=200) train_dataset = split_dataset['train'] validation_dataset = split_dataset['test'] train_dataset = train_dataset.map(formatting_prompts_func, batched = True,remove_columns=["Question", "Complex_CoT", "Response"]) train_dataset.to_json("./data/train.jsonl") validation_dataset = validation_dataset.map(formatting_prompts_func, batched = True,remove_columns=["Question", "Complex_CoT", "Response"]) validation_dataset.to_json("./data/valid.jsonl") Fine-tuning with Apple MLX python -m mlx_lm.lora --model ./phi-4-mini-reasoning --train --data ./data --iters 100 Running the model ! python -m mlx_lm.generate --model ./phi-4-mini-reasoning --adapter-path ./adapters --max-token 4096 --prompt "A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?" --extra-eos-token "<|end|>" Get Notebook from Phi Cookbook : https://aka.ms/phicookbook We hope this sample has inspired you to use Phi-4-mini-reasoning and Phi-4-reasoning to complete industry reasoning for our own scenarios. Related resources Phi4-mini-reasoning Tech Report https://aka.ms/phi4-mini-reasoning/techreport Phi-4-Mini-Reasoning technical Report· microsoft/Phi-4-mini-reasoning Phi-4-mini-reasoning on Azure AI Foundry https://aka.ms/phi4-mini-reasoning/azure Phi-4 Reasoning Blog https://aka.ms/phi4-mini-reasoning/blog Phi Cookbook https://aka.ms/phicookbook Showcasing Phi-4-Reasoning: A Game-Changer for AI Developers | Microsoft Community Hub Models Phi-4 Reasoning https://huggingface.co/microsoft/Phi-4-reasoning Phi-4 Reasoning Plus https://huggingface.co/microsoft/Phi-4-reasoning-plus Phi-4-mini-reasoning Hugging Face https://aka.ms/phi4-mini-reasoning/hf Phi-4-mini-reasoning on Azure AI Foundry https://aka.ms/phi4-mini-reasoning/azure Microsoft (Microsoft) Models on Hugging Face Phi-4 Reasoning Models Azure AI Foundry Models Access Phi-4-reasoning models Phi Models at Azure AI Foundry Models Phi Models on Hugging Face Phi Models on GitHub Marketplace ModelsBuild AI Agents with MCP Tool Use in Minutes with AI Toolkit for VSCode
We’re excited to announce Agent Builder, the newest evolution of what was formerly known as Prompt Builder, now reimagined and supercharged for intelligent app development. This powerful tool in AI Toolkit enables you to create, iterate, and optimize agents—from prompt engineering to tool integration—all in one seamless workflow. Whether you're designing simple chat interactions or complex task-performing agents with tool access, Agent Builder simplifies the journey from idea to integration. Why Agent Builder? Agent Builder is designed to empower developers and prompt engineers to: 🚀 Generate starter prompts with natural language 🔁 Iterate and refine prompts based on model responses 🧩 Break down tasks with prompt chaining and structured outputs 🧪 Test integrations with real-time runs and tool use such as MCP servers 💻 Generate production-ready code for rapid app development And a lot of features are coming soon, stay tuned for: 📝 Use variables in prompts �� Run agent with test cases to test your agent easily 📊 Evaluate the accuracy and performance of your agent with built-in or your custom metrics ☁️ Deploy your agent to cloud Build Smart Agents with Tool Use (MCP Servers) Agents can now connect to external tools through MCP (Model Control Protocol) servers, enabling them to perform real-world actions like querying a database, accessing APIs, or executing custom logic. Connect to an Existing MCP Server To use an existing MCP server in Agent Builder: In the Tools section, select + MCP Server. Choose a connection type: Command (stdio) – run a local command that implements the MCP protocol HTTP (server-sent events) – connect to a remote server implementing the MCP protocol If the MCP server supports multiple tools, select the specific tool you want to use. Enter your prompts and click Run to test the agent's interaction with the tool. This integration allows your agents to fetch live data or trigger custom backend services as part of the conversation flow. Build and Scaffold a New MCP Server Want to create your own tool? Agent Builder helps you scaffold a new MCP server project: In the Tools section, select + MCP Server. Choose MCP server project. Select your preferred programming language: Python or TypeScript. Pick a folder to create your server project. Name your project and click Create. Agent Builder generates a scaffolded implementation of the MCP protocol that you can extend. Use the built-in VS Code debugger: Press F5 or click Debug in Agent Builder Test with prompts like: System: You are a weather forecast professional that can tell weather information based on given location. User: What is the weather in Shanghai? Agent Builder will automatically connect to your running server and show the response, making it easy to test and refine the tool-agent interaction. AI Sparks from Prototype to Production with AI Toolkit Building AI-powered applications from scratch or infusing intelligence into existing systems? AI Sparks is your go-to webinar series for mastering the AI Toolkit (AITK) from foundational concepts to cutting-edge techniques. In this bi-weekly, hands-on series, we’ll cover: 🚀SLMs & Local Models – Test and deploy AI models and applications efficiently on your own terms locally, to edge devices or to the cloud 🔍 Embedding Models & RAG – Supercharge retrieval for smarter applications using existing data. 🎨 Multi-Modal AI – Work with images, text, and beyond. 🤖 Agentic Frameworks – Build autonomous, decision-making AI systems. Watch on Demand Share your feedback Get started with the latest version, share your feedback, and let us know how these new features help you in your AI development journey. As always, we’re here to listen, collaborate, and grow alongside our amazing user community. Thank you for being a part of this journey—let’s build the future of AI together! Join our Microsoft Azure AI Foundry Discord channel to continue the discussion 🚀Selecting and upgrading models using Evaluations – Part 2
In the previous article, we explored why evaluations are crucial and how they can help you choose the right model for your specific industry, domain, or app-level data. We also introduced the "bulk-run" feature in AI Toolkit for Visual Studio Code, which allows you to automate parts of the human evaluation process. In this article, we'll take things a step further by using a more capable model to evaluate the responses of a less capable one. For example, you might compare older versions of a model against a newer, more powerful version, or evaluate a fine-tuned small language model (SLM) using a larger model like GPT-4o. You can access this functionality through the "Evaluations" option in the tool's menu of the AI Toolkit for Visual Studio Code Extension (see below). But before we start using it, let’s take a moment to understand the distinct types of evaluation methods available for assessing responses from large language models. Evaluators When testing AI models, it's not enough to just look at outputs manually. Evaluators help us systematically measure how well a model is performing across different dimensions like relevance, coherence and fluency, these specific metrics include grammar, similarity to ground truth and more. Below is a brief overview of the key evaluators commonly used: Coherence - Evaluates how naturally and logically a model’s response flows. It checks whether the answer makes sense in context and follows a consistent train of thought. Required columns: query, response Fluency - Assesses grammatical correctness and fluency. A fluent response reads smoothly, like something a human would write. Required columns: response Relevance - Checks how well the response answers the original question or prompt. It’s all about staying on topic and being helpful. Required columns: query, response Similarity - Measures how similar the model’s response is to a reference (ground truth), taking both the question and answer into account. Required columns: query, response, ground_truth BLEU (Bilingual Evaluation Understudy) - A popular metric that compares how closely the model’s output matches reference texts using n-gram overlaps. Required columns: response, ground_truth F1 Score - Calculates the overlap of words between the model’s output and the correct answer, balancing precision and recall. Required columns: response, ground_truth GLEU (General Language Understanding Benchmark) - Similar to BLEU but optimized for sentence-level evaluation. It uses n-gram overlap to assess how well the output matches the reference. Required columns: response, ground_truth METEOR - Goes beyond simple word overlap by aligning synonyms and related phrases, while also focusing on precision, recall, and word order. Required columns: response, ground_truth Using Evaluations Now that we have an overview of the evaluations, let’s use a sample dataset to run an evaluation. Open Visual Studio Code and select the AI Toolkit Extension, in the AI Toolkit extension: Click on the Tools Menu > Eval the Tools Menu > Evaluations and you should get a window like below: You can either create a new evaluation or create a new evaluation run (See the blue button on the top right of the screen). If you create a new evaluation, you can choose one or more of the evaluators we talked about above. You can use the sample dataset, or you can use your own dataset. Just be aware, that if you are running a large dataset of your own, you might run against the rate limit for GitHub models, if you choose those for evaluating the output. You can create your own dataset in the JSONLines format we discussed in the earlier part of this blog post. In addition to using your own dataset to evaluate the model, you can also use your own python evaluators. Click on the evaluators tab and you should see the following screen. Using the Create Evaluation button (highlighted in blue on the top right-hand corner of the pane), you can create and add your own evaluator. The fields are self-explanatory. Evaluation run Let's now run the evaluation and you should see something like the output below. You can see the line-by-line input (from the JSONLines dataset that you used) and output against each of the evaluators that you have selected. You can also see the details of the run in the output pane below as the evaluations run. You will see each evaluation start (once per evaluator) and run through each of lines in your dataset. You might also see some errors sometimes due to rate limiting and which can be retried automatically by AI toolkit executor. You can see the scores for each of the evaluators by scrolling horizontally. You can additionally back up and check these scores using human evaluations as well, if necessary, especially for fields where domain expertise is important and the risk of harm due to errors is higher. Evaluations play a key role in understanding, selecting models and improving model performance across tasks and domains. By using a mix of automated and human-in-the-loop evaluators, you can get a clearer picture of your model's strengths and weaknesses. Start small, measure often, and let the data guide your AI application iterations. Further reading Selecting and upgrading models – part 1 - Using Evaluations – Part 1 Evaluating generative AI applications - https://aka.ms/evaluate-genAI AI Toolkit Samples Generative AI for Beginners guide - https://microsoft.github.io/generative-ai-for-beginners AI toolkit for VSCode Marketplace - https://aka.ms/AIToolkit Docs for AI toolkit - https://aka.ms/AIToolkit/doc AI Spark series https://developer.microsoft.com/en-us/reactor/events/25040/The Startup Stage: Powered by Microsoft for Startups at European AI & Cloud Summit
🚀 The Startup Stage: Powered by Microsoft for Startups Take center stage in the AI and Cloud Startup Program, designed to showcase groundbreaking solutions and foster collaboration between ambitious startups and influential industry leaders. Whether you're looking to engage with potential investors, connect with clients, or share your boldest ideas, this is the platform to shine. Why Join the Startup Stage? Pitch to Top Investors: Present your ideas and products to key decision-makers in the tech world. Gain Visibility: Showcase your startup in a vibrant space dedicated to innovation, and prove that you are the next game-changer. Learn from the Best: Hear from visionary thought leaders and Microsoft AI experts about the latest trends and opportunities in AI and cloud. AI Competition: Propel Your Startup Stand out from the crowd by participating in the European AI & Cloud Startup Stage competition, exclusively designed for startups leveraging Microsoft AI and Azure Cloud services. Compete for prestigious awards, including: $25,000 in Microsoft Azure Credits. A mentoring session with Marco Casalaina, VP of Products at Azure AI. Fast-track access to exclusive resources through the Microsoft for Startups Program. Get ready to deliver a pitch in front of a live audience and an expert panel on 28 May 2025! How to Apply: Ensure your startup solution runs on Microsoft AI and Azure Cloud. Register as a conference and submit your Competiton application form before the deadline: 14 April 2025 at European Cloud and AI Summit. Be Part of Something Bigger This isn’t just an exhibition—it’s a thriving community where innovation meets opportunity. Don’t miss out! With tickets already 70% sold out, now’s the time to secure your spot. Join the European AI and Cloud Startup Area with a booth or launchpad, and accelerate your growth in the tech ecosystem. Visit the [European AI and Cloud Summit](https://ecs.events) website to learn more, purchase tickets, or apply for the AI competition. Download the sponsorship brochure for detailed insights into this once-in-a-lifetime event. Together, let’s shape the future of cloud technology. See you in Düsseldorf! 🎉Getting Started with the AI Dev Gallery
March Update: The Gallery is now available on the Microsoft Store! The AI Dev Gallery is a new open-source project designed to inspire and support developers in integrating on-device AI functionality into their Windows apps. It offers an intuitive UX for exploring and testing interactive AI samples powered by local models. Key features include: Quickly explore and download models from well-known sources on GitHub and HuggingFace. Test different models with interactive samples over 25 different scenarios, including text, image, audio, and video use cases. See all relevant code and library references for every sample. Switch between models that run on CPU and GPU depending on your device capabilities. Quickly get started with your own projects by exporting any sample to a fresh Visual Studio project that references the same model cache, preventing duplicate downloads. Part of the motivation behind the Gallery was exposing developers to the host of benefits that come with on-device AI. Some of these benefits include improved data security and privacy, increased control and parameterization, and no dependence on an internet connection or third-party cloud provider. Requirements Device Requirements Minimum OS Version: Windows 10, version 1809 (10.0; Build 17763) Architecture: x64, ARM64 Memory: At least 16 GB is recommended Disk Space: At least 20GB free space is recommended GPU: 8GB of VRAM is recommended for running samples on the GPU Using the Gallery The AI Dev Gallery has can be navigated in two ways: The Samples View The Models View Navigating Samples In this view, samples are broken up into categories (Text, Code, Image, etc.) and then into more specific samples, like in the Translate Text pictured below: On clicking a sample, you will be prompted to choose a model to download if you haven’t run this sample before: Next to the model you can see the size of the model, whether it will run on CPU or GPU, and the associated license. Pick the model that makes the most sense for your machine. You can also download new models and change the model for a sample later from the sample view. Just click the model drop down at the top of the sample: The last thing you can do from the Sample pane is view the sample code and export the project to Visual Studio. Both buttons are found in the top right corner of the sample, and the code view will look like this: Navigating Models If you would rather navigate by models instead of samples, the Gallery also provides the model view: The model view contains a similar navigation menu on the right to navigate between models based on category. Clicking on a model will allow you to see a description of the model, the versions of it that are available to download, and the samples that use the model. Clicking on a sample will take back over to the samples view where you can see the model in action. Deleting and Managing Models If you need to clear up space or see download details for the models you are using, you can head over the Settings page to manage your downloads: From here, you can easily see every model you have downloaded and how much space on your drive they are taking up. You can clear your entire cache for a fresh start or delete individual models that you are no longer using. Any deleted model can be redownload through either the models or samples view. Next Steps for the Gallery The AI Dev Gallery is still a work in progress, and we plan on adding more samples, models, APIs, and features, and we are evaluating adding support for NPUs to take the experience even further If you have feedback, noticed a bug, or any ideas for features or samples, head over to the issue board and submit an issue. We also have a discussion board for any other topics relevant to the Gallery. The Gallery is an open-source project, and we would love contribution, feedback, and ideation! Happy modeling!5.8KViews5likes3CommentsJoin the ONNX Generative AI Runtime teams for a discussion on the newest releases
Join Us for an Exclusive Round Table Discussion on the ONNX Generative AI Runtime! Date: 24th March 2025 Time: 8.30am PT Location: Microsoft AI Discord Community What is an AMA? An "Ask Me Anything" (AMA) is an informal discussion where the floor is opened to the general public to ask the host or a guest anything they want to know. It's a great opportunity to interact directly with experts and get your questions answered in real-time. Don't miss this opportunity to connect with our experts and enhance your understanding of ONNX. Mark your calendars and prepare your questions for an engaging and informative session! Join the Phi AMA session! How to join: Join the Azure AI Community Discord Unlock the Future of AI with ONNX Runtime Discover the ONNX Generative AI runtime and explore the limitless possibilities of generative AI. Whether you're an AI enthusiast, developer, or industry expert, this round table is your chance to dive deep into the innovative world of ONNX Runtime. Event Highlights: In-depth overview of the ONNX Generative AI Runtime Interactive session with step-by-step coding examples User experiences and success stories Why Attend? Gain expert insights into the ONNX Generative AI Runtime Network with like-minded professionals Enhance your AI skills with practical sessions Stay ahead of the curve with cutting-edge technology Speakers and Panelists: Kunal Vaishnavi Software engineer in the AI Platform team at Microsoft focusing on optimizing the latest state-of-the-art models. He is a co-founder of ONNX Runtime GenAI and invented the model builder. Baiju Meswani Designer of the pipelined model runtime, and the multi modal model API and general continuous integration and package publishing guru Ryan Hill Software Engineer, Initial creator of the ONNX Generative AI project and its core architecture and APIs. Natalie Kershaw Program Manager of the ONNX Generative AI Runtime and general wranglerPose Estimation with the AI Dev Gallery
What's Going On Here? This blog post is the first in an upcoming series that will spotlight the local AI samples contained in the new AI Dev Gallery. The Gallery is a preview project that aims to showcase local AI scenarios on Windows and to give developers the guidance they need to enable those scenarios themselves. The Gallery is open-source and contains a wide selection of different models and samples, including text, image, audio, and video use cases. In addition to being able to see a given model in action, each sample contains a source code view and a button to export the sample directly to a new Visual Studio project. The Gallery is available on the Microsoft Store and is entirely open sourced on GitHub. For this first sample spotlight, we will be taking a look at one of my favorite scenarios: Human Pose Estimation with HRNet. This sample is enabled by ONNX Runtime, and depending on the processor in your Windows device, this sample supports running on the CPU and NPU. I'll cover how to check which hardware is supported and how to switch between them later in the post. Pose Estimation Demo This sample takes in an uploaded photo and renders pose estimations onto the main human figure in the photo. It will render connections between the torso and limbs, along with five points corresponding to key facial features (eyes, nose, and ears). Before diving into the code for this sample, here's a quick video example: Let's get right to the code to see how this implemented. Code Walkthrough This walkthrough will focus on essential code and may gloss over some UI logic and helper functions. The full code for this sample can be browsed in depth in the AI Dev Gallery itself or in the GitHub repository. When this sample is first opened, it will make an initial call to LoadModelAsync which looks like this: protected override async Task LoadModelAsync(SampleNavigationParameters sampleParams) { // Tell our inference session where our model lives and which hardware to run it on await InitModel(sampleParams.ModelPath, sampleParams.HardwareAccelerator); sampleParams.NotifyCompletion(); // Make first call to inference once model is loaded await DetectPose(Path.Join(Windows.ApplicationModel.Package.Current.InstalledLocation.Path, "Assets", "pose_default.png")); } In this function, a ModelPath and HardwareAccelerator are passed into our InitModel function, which handles instantiating an ONNX Runtime InferenceSession with our model location and the hardware that inference will be performed on. You can jump to Switching to NPU Execution later in this post for more in depth information on how the InferenceSession is instantiated. Once the model has finished initializing, this function calls for an initial round of inference via DetectPose on a default image. Preprocessing, Calling For Inference, and Postprocessing Output The inference logic, along with the required preprocessing and postprocessing, takes place in the DetectPose function. This is a pretty long function, so let's go through it piece by piece. First, this function checks that it was passed a valid file path and performs some updates to our XAML: private async Task DetectPose(string filePath) { // Check if the passed in file path exists, and return if not if (!Path.Exists(filePath)) { return; } // Update XAML to put the view into the "Loading" state Loader.IsActive = true; Loader.Visibility = Visibility.Visible; UploadButton.Visibility = Visibility.Collapsed; DefaultImage.Source = new BitmapImage(new Uri(filePath)); Next, the input image is loaded into a Bitmap and then resized to the expected input size of the HRNet model (256x192) with the helper function ResizeBitmap: // Load bitmap from image filepath using Bitmap originalImage = new(filePath); // Store expected input dimensions in variables, as these will be used later int modelInputWidth = 256; int modelInputHeight = 192; // Resize Bitmap to expected dimensions with ResizeBitmap helper using Bitmap resizedImage = BitmapFunctions.ResizeBitmap(originalImage, modelInputWidth, modelInputHeight); Once the image is stored in a bitmap of the proper size, we create a Tensor of dimensionality 1x3x192x256 that will represent the image. Each dimension, in order, corresponds to these values: Batch Size: our first value of 1 is just the number of inputs that are being processed. This implementation processes a single image at a time, so the batch size is just one. Color Channels: The next dimension has a value of 3 and corresponds to each of the typical color channels: red, green, and blue. This will define the color of each pixel in the image. Width: The next value of 256 (passed as modelInputWidth) is the pixel width of our image. Height: The last value of 192 (passed as modelInputHeight) is the pixel height of our image. Taken as a whole, this tensor represents a single image where each pixel in that image is defined by an X (width) and Y (height) pixel value and three-color values (red, green, blue). Also, it is good to note that the processing and inference section of this function is being ran in a Task to prevent the UI from becoming blocked: // Run our processing and inference logic as a Task to prevent the UI from being blocked var predictions = await Task.Run(() => { // Define a tensor that represents every pixel of a single image Tensor<float> input = new DenseTensor<float>([1, 3, modelInputWidth, modelInputHeight]); To improve the quality of the input, instead of just passing in the original pixel values to the tensor, the pixels values are normalized with the PreprocessBitmapWithStdDev helper function. This function uses the mean of each RGB value and the standard deviation (how far a value typically varies away from its mean) to "level out" outlier color values. You can think of it as a way of preventing images with really dramatic color differences from confusing the model. This step does not affect the dimensionality of the input. It only adjusts the values that will be stored in the tensor: // Normalize our input and store it in the "input" tensor. Dimension is still 1x3x256x192 input = BitmapFunctions.PreprocessBitmapWithStdDev(resizedImage, input); There is one last small step of set up before the input is passed to the InferenceSession, as ONNX expects a certain input format for inference. A List of type NamedOnnxValue is created with only one entry representing the input tensor that was just processed. Each NamedOnnxValue expects a metadata name (which is grabbed from the model itself using the InferenceSession) and a value (the tensor that was just processed): // Snag the input metadata name from the inference session var inputMetadataName = _inferenceSession!.InputNames[0]; // Create a list of NamedOnnxValues, with one entry var onnxInputs = new List<NamedOnnxValue> { // Call NamedOnnxValue.CreateFromTensor and pass in input metadata name and input tensor NamedOnnxValue.CreateFromTensor(inputMetadataName, input) }; The onnxInputs list that was just created is passed to InferenceSession.Run. It returns a collection of DisposableNamedOnnxValues to be processed: // Call Run to perform inference using IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = _inferenceSession!.Run(onnxInputs); The output of the HRNet model is a bit more verbose than a list of coordinates that correspond with human pose key points (like left knee, or right shoulder). Instead of exact predictions, it returns a heatmap for every pose key point that scores each location on the image with a probability that a certain joint exists there. So, there's a bit more work to do to get points that can be placed on an image. First, the function sets up the necessary values for post processing: // Fetch the heatmaps list from the inference results var heatmaps = results[0].AsTensor<float>(); // Get the output name from the inference session var outputName = _inferenceSession!.OutputNames[0]; // Use the output name to get the dimensions of the output from the inference session var outputDimensions = _inferenceSession!.OutputMetadata[outputName].Dimensions; // Finally, get the output width and height from those dimensions float outputWidth = outputDimensions[2]; float outputHeight = outputDimensions[3]; The output width and height are passed, along with the heatmaps list and the original image dimensions, to the PostProcessResults helper function. This function does two actions with each heatmap: It iterates over every value in the heatmap to find the coordinates where the probability is highest for each pose key point. It scales that value back to the size of the original image, since it was changed when it was passed into inference. This is why the original image dimensions were passed. From this function, a list of tuples containing the X and Y location of each key point is returned, so that they can be properly rendered onto the image: // Post process heatmap results to get key point coordinates List<(float X, float Y)> keypointCoordinates = PoseHelper.PostProcessResults(heatmaps, originalImage.Width, originalImage.Height, outputWidth, outputHeight); // Return those coordinates from the task return keypointCoordinates; }); Next up is rendering. Rendering Pose Predictions Rendering is handled by the RenderPredictions helper function which takes in the original image, the predictions that were generated, and a marker ratio to define how large to draw the predictions on the image. Note that this code is still being called from the DetectPose function: using Bitmap output = PoseHelper.RenderPredictions(originalImage, predictions, .02f); Rendering predictions is pretty key to the pose estimation flow, so let's dive into this function. This function will draw two things: Red ellipses at each pose key point (right knee, left eye, etc.) Blue lines connecting joint key points (right knee to right ankle, left shoulder to left elbow, etc.) Face key points (eyes, nose, ears) do not have any connections, and will just have dots ellipses rendered for them. The first thing the function does is set up the Graphics, Pen, and Brush objects necessary for drawing: public static Bitmap RenderPredictions(Bitmap image, List<(float X, float Y)> keypoints, float markerRatio, Bitmap? baseImage = null) { // Create a graphics object from the image using (Graphics g = Graphics.FromImage(image)) { // Average out width and height of image. // Ignore baseImage portion, it is used by another sample. var averageOfWidthAndHeight = baseImage != null ? baseImage.Width + baseImage.Height : image.Width + image.Height; // Get the marker size from the average dimension value and the marker ratio int markerSize = (int)(averageOfWidthAndHeight * markerRatio / 2); // Create a Red brush for the keypoints and a Blue pen for the connections Brush brush = Brushes.Red; using Pen linePen = new(Color.Blue, markerSize / 2); Next, a list of (int, int) tuples is instantiated that represents each connection. Each tuple has a StartIdx (where the connection starts, like left shoulder) and an EndIdx (where the connection ends, like left elbow). These indexes are always the same based on the output of the pose model and move from top to bottom on the human figure. As a result, you'll notice that indexes 0-4 are skipped, as those indexes represent the face key points, which don't have any connections: // Create a list of index tuples that represents each pose connection, face key points are excluded. List<(int StartIdx, int EndIdx)> connections = [ (5, 6), // Left shoulder to right shoulder (5, 7), // Left shoulder to left elbow (7, 9), // Left elbow to left wrist (6, 8), // Right shoulder to right elbow (8, 10), // Right elbow to right wris (11, 12), // Left hip to right hip (5, 11), // Left shoulder to left hip (6, 12), // Right shoulder to right hip (11, 13), // Left hip to left knee (13, 15), // Left knee to left ankle (12, 14), // Right hip to right knee (14, 16) // Right knee to right ankle ]; Next, for each tuple in that list, a blue line represenating a connection is drawn on the image with DrawLine. It takes in the Pen that was created, along with start and end coordinates from the keypoints list that was passed into the function: // Iterate over connections with a foreach loop foreach (var (startIdx, endIdx) in connections) { // Store keypoint start and end values in tuples var (startPointX, startPointY) = keypoints[startIdx]; var (endPointX, endPointY) = keypoints[endIdx]; // Pass those start and end coordinates, along with the Pen, to DrawLine g.DrawLine(linePen, startPointX, startPointY, endPointX, endPointY); } Next, the exact same thing is done for the red ellipses representing the keypoints. The entire keypoints list is iterated over because every key point gets an indicator regardless of whether or not it was included in a connection. The red ellipses are drawn second as they should be rendered on top of the blue lines representing connections: // Iterate over keypoints with a foreach loop foreach (var (x, y) in keypoints) { // Draw an ellipse using the red brush, the x and y coordinates, and the marker size g.FillEllipse(brush, x - markerSize / 2, y - markerSize / 2, markerSize, markerSize); } Now just return the image: return image; Jumping back over to DetectPose, the last thing left to do is to update the UI with the rendered predictions on the image: // Convert the output to a BitmapImage BitmapImage outputImage = BitmapFunctions.ConvertBitmapToBitmapImage(output); // Enqueue all our UI updates to ensure they don't happen off the UI thread. DispatcherQueue.TryEnqueue(() => { DefaultImage.Source = outputImage; Loader.IsActive = false; Loader.Visibility = Visibility.Collapsed; UploadButton.Visibility = Visibility.Visible; }); That's it! The final output looks like this: Switching to NPU Execution This sample also supports running on the NPU, in addition to the CPU, if you have met the correct device requirements. You will need a Windows with device with a Qualcomm NPU to run NPU samples in the Gallery. The easiest way to check if your device is NPU capable is within the Gallery itself. Using the Select Model dropdown, you can see which execution providers are supported on your device: I'm on a device with a Qualcomm NPU, so the Gallery is giving the option to run the sample on both CPU and NPU. How Gallery Samples Handle Switching Between Execution Providers When the pose is selected with specific hardware accelerator, that information is passed to the InitModel function that handles how the inference session is instantiated. It will specify the Qualcomm QNN execution provider that enables NPU execution. It looks like this: private Task InitModel(string modelPath, HardwareAccelerator hardwareAccelerator) { return Task.Run(() => { // Check if we already have an inference session if (_inferenceSession != null) { return; } // Set up ONNX Runtime (ORT) session options object SessionOptions sessionOptions = new(); sessionOptions.RegisterOrtExtensions(); if (hardwareAccelerator == HardwareAccelerator.QNN) // Check if QNN was passed { // Add the QNN execution provider if so Dictionary<string, string> options = new() { { "backend_path", "QnnHtp.dll" }, { "htp_performance_mode", "high_performance" }, { "htp_graph_finalization_optimization_mode", "3" } }; sessionOptions.AppendExecutionProvider("QNN", options); } // Create a new inference session with these sessionOptions, if CPU is selected, they will be default _inferenceSession = new InferenceSession(modelPath, sessionOptions); }); } With this function, an InferenceSession can be instantiated to fit whatever execution provider is passed in that particular situation and then that InferenceSession can be used throughout the sample. What's Next More in-depth coverage of the other samples in the gallery will be released periodically, covering a range of what is possible with local AI on Windows. Stay tuned for more sample breakdowns coming soon. In the meantime, go check out the AI Dev Gallery to explore more samples and models on Windows. If you run into any problems, feel free to open an issue on the GitHub repository. This project is open-sourced and any feedback to help us improve the Gallery is highly appreciated.Using Advanced Reasoning Model on EdgeAI Part 1 - Quantization, Conversion, Performance
DeepSeek-R1 is very popular, and it can achieve the same capabilities as OpenAI o1 in advanced reasoning. Microsoft has also added DeepSeek-R1 models to Azure AI Foundry and GitHub Models. We can compare DeepSeek-R1 ith other available models through GitHub Models Playground Note This series revolves around deployment of SLMs to Edge Devices 'Edge AI' we will focus on the deployment advanced reasoning models, with different application scenarios. You can learn more in the following session AI Tour BRK453. In this experiement we want to deploy advanced reasoning models to the edge, so that they can run on edge devices with limited computing power and offline environments. At this time, the recommendation is to use the traditional ONNX model . We can use Microsoft Olive to convert the DeepSeek-R1 Distrill model. Getting started with Microsoft Olive is very straightforward. Install the Microsoft Olive library through the command line and Python 3.10+ (recommended) pip install olive-ai The DeepSeek-R1 Distrill model series has different parameters such as 1.5B, 7B, 8B, 14B, 32B, 70B, etc. This article is mainly based on the 1.5B, 7B, and 14B models (so a Small Language Model). CPU Inference Let's discuss 1.5B and 7B, which are models with lower parameter. We can directly use the CPU as computing for inference to test the effect (hardware environment Azure DevBox, AMD EPYC 7763 64-Core + 64GB Memory + 2T SSD) Quantization conversion olive auto-opt --model_name_or_path <Your DeepSeek-R1-Distill-Qwen-1.5B/7B local location> --output_path <Your Convert ONNX INT4 Model local location> --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety or provided as an Offical Model) DeepSeek-R1-Distill-Qwen-1.5B-ONNX-INT4-CPU DeepSeek-R1-Distill-Qwen-7B-ONNX-INT4-CPU Running with ONNX Runtime GenAI Install ONNX Runtime GenAI and ONNX Runtime CPU support libraries pip install onnxruntime-genai pip install onnxruntime Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-1.5b.ipynb https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-7b.ipynb Performance comparison 1.5B vs 7B We compare two different inference scenarios explain 1+1=2 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated 2. Find all pairwise different isomorphism groups with order 147 and no elements with order 49 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated: 7B quantized ONNX model memory occupied, time consumption and number of tokens generated Results of the numbers Through the test, we can see that the 1.5B model of DeepSeek is more suitable for use on CPU inference and can be deployed on traditional PCs or IoT devices. As for 7B, although it has better inference, it is not very effective on CPU operation. GPU Inference It is ideal if we have a GPU on the edge device. We can quantize and convert it to an ONNX model for CPU inference through Microsoft Olive. Of course, it can also be converted to a model for GPU inference. Here I take the 14B DeepSeek-R1-Distill-Qwen-14B as an example and make an inference comparison with Microsoft's Phi-4-14B Quantization conversion olive auto-opt --model_name_or_path <Your Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --output_path <Your converted Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1 You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety and not an Official Model) DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Phi-4-14B-ONNX-INT4-GPU Running with ONNX Runtime GenAI CUDA Install ONNX Runtime GenAI and ONNX Runtime GPU support libraries pip install onnxruntime-genai-cuda pip install onnxruntime-gpu Compare the results in the GPU environment with Gradio It is recommended to use a GPU with more than 8G memory To increase the comparison of the results, we compare it with Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU to see the different results. We also show we use OpenAI o1-mini (it is recommended to use o1-mini through GitHub Models), Sample Code https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/Performance_AdvancedReasoning_ONNX_CPU.ipynb You can test any prompt on Gradio to compare the results of Phi-4-14B-ONNX-INT4-GPU, DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU and OpenAI o1 mini. DeepSeek-R1 reduces the cost of inference models and produces more instructive results on professional problems, but Phi-4-14B also has advantages in reasoning and uses lower computing power to complete inference. As for OpenAI o1 mini, it is more comprehensive and can touch all problems. If you want to deploy to Edge Device, Phi-4-14B and quantized DeepSeek-R1 are good choices for you. This blog is just a simple test and the first in this series. Please share your feedback and continue the discussion in the Microsoft AI Discord Channel. Feel free to me a message or comment. We look forward to sharing more around the opportunity of EdgeAI and more content in this series. Resource DeepSeek-R1 in GitHub Models https://github.com/marketplace/models/azureml-deepseek/DeepSeek-R1 DeepSeek-R1 in Azure AI Foundry https://ai.azure.com/explore/models/DeepSeek-R1/version/1/registry/azureml-deepseek Phi-4-14B in Hugging face https://huggingface.co/microsoft/phi-4 Learn about Microsoft Olive https://github.com/microsoft/olive Learn about ONNX Runtime GenAI https://github.com/microsoft/onnxruntime-genai Microsoft AI Discord Channel BRK453 Exploring cutting-edge models: LLMs, SLMs, local development and more https://aka.ms/aitour/brk453864Views0likes0CommentsAI Toolkit for VS Code January Update
AI Toolkit is a VS Code extension aiming to empower AI engineers in transforming their curiosity into advanced generative AI applications. This toolkit, featuring both local-enabled and cloud-accelerated inner loop capabilities, is set to ease model exploration, prompt engineering, and the creation and evaluation of generative applications. We are pleased to announce the January Update to the toolkit with support for OpenAI's o1 model and enhancements in the Model Playground and Bulk Run features. What's New? January’s update brings several exciting new features to boost your productivity in AI development. Here's a closer look at what's included: Support for OpenAI’s new o1 Model: We've added access to GitHub hosted OpenAI’s latest o1 model. This new model replaces the o1-preview and offers even better performance in handling complex tasks. You can start interacting with the o1 model within VS Code for free by using the latest AI Toolkit update. Chat History Support in Model Playground: We have heard your feedback that tracking past model interactions is crucial. The Model Playground has been updated to include support for chat history. This feature saves chat history as individual files stored entirely on your local machine, ensuring privacy and security. Bulk Run with Prompt Templating: The Bulk Run feature, introduced in the AI Toolkit December release, now supports prompt templating with variables. This allows users to create templates for prompts, insert variables, and run them in bulk. This enhancement simplifies the process of testing multiple scenarios and models. Stay tuned for more updates and enhancements as we continue to innovate and support your journey in AI development. Try out the AI Toolkit for Visual Studio Code, share your thoughts, and file issues and suggest features in our GitHub repo. Thank you for being a part of this journey with us!