slm
40 TopicsLeveraging the power of NPU to run Gen AI tasks on Copilot+ PCs
Thanks to their massive scale and impressive technical evolution, large language models (LLMs) have become the public face of Generative AI innovation. However, bigger isn’t always better. While LLMs like the ones behind Microsoft Copilot are incredibly capable at a wide range of tasks, less-discussed small language models (SLMs) expand the utility of Gen AI for real-time and edge applications. SLMs can run efficiently on a local device with low power consumption and fast performance, enabling new scenarios and cost models. SLMs can run on universally available chips like CPUs and GPUs, but their potential really comes alive running on Neural Processing Units (NPUs), such as the ones found in Microsoft Surface Copilot+ PCs. NPUs are specifically designed for processing machine learning workloads, leading to high performance per watt and thermal efficiency compared to CPUs or GPUs [1]. SLMs and NPUs together support running quite powerful Gen AI workloads efficiently on a laptop, even when running on battery power or multitasking. In this blog, we focus on running SLMs on Snapdragon® X Plus processors on the recently launched Surface Laptop 13-inch, using the Qualcomm® AI Hub, leading to efficient local inference, increased hardware utilization and minimal setup complexity. This is only one of many methods available - before diving into this specific use case, let’s first provide an overview of the possibilities for deploying small language models on Copilot+ PC NPUs. Qualcomm AI Engine Direct (QNN) SDK: This process requires converting SLMs into QNN binaries that can be executed through the NPU. The Qualcomm AI Hub provides a convenient way to compile any PyTorch, TensorFlow, or ONNX-converted models into QNN binaries executable by the Qualcomm AI Engine Direct SDK. Various precompiled models are directly available in the Qualcomm AI Hub, their collection of over 175 pre-optimized models, ready for download and integration into your application. ONNX Runtime: ONNX Runtime is an open-source inference engine from Microsoft designed to run models in the ONNX format. The QNN Execution Provider (EP) by Qualcomm Technologies optimizes inference on Snapdragon processors using AI acceleration hardware, mainly for mobile and embedded use. ONNX Runtime Gen AI is a specialized version optimized for generative AI tasks, including transformer-based models, aiming for high-performance inference in applications like large language models. Although ONNX Runtime with QNN EP can run models on Copilot+ PCs, some operator support is missing for Gen AI workloads. ONNX Runtime Gen AI is not yet publicly available for NPU; a private beta is currently out with an unclear ETA on public release at the time of releasing this blog. Here is the link to the Git repo for more info on upcoming releases microsoft/onnxruntime-genai: Generative AI extensions for onnxruntime Windows AI Foundry: Windows AI Foundry provides AI-supported features and APIs for Copilot+ PCs. It includes pre-built models such as Phi-Silica that can be inferred using Windows AI APIs. Additionally, it offers the capability to download models from the cloud for local inference on the device using Foundry Local. This feature is still in preview. You can learn more about Windows AI Foundry here: Windows AI Foundry | Microsoft Developer AI Toolkit for VS Code: The AI Toolkit for Visual Studio Code (VS Code) is a VS Code extension that simplifies generative AI app development by bringing together cutting-edge AI development tools and models from the Azure AI Foundry catalog and other catalogs like Hugging Face. This platform allows users to download multiple models either from the cloud or locally. It currently houses several models optimized to run on CPU, with support for NPU-based models forthcoming, starting with Deepseek R1. Comparison between different approaches Feature Qualcomm AI Hub ONNX Runtime (ORT) Windows AI Foundry AI Toolkit for VS code Availability of Models Wide set of AI models (vision, Gen AI, object detection, and audio). Any models can be integrated. NPU support for Gen AI tasks and ONNX Gen AI Runtime are not yet generally available. Phi Silica model is available through Windows AI APIs, additional AI models from cloud can be downloaded for local inference using Foundry Local Access to models from sources such as Azure AI Foundry and Hugging Face. Currently only supports Deepseek R1 and Phi 4 Mini models for NPU inference. Ease of development The API is user-friendly once the initial setup and end-to-end replication are complete. Simple setup, developer-friendly; however, limited support for custom operators means not all models deploy through ORT. Easiest framework to adopt—developers familiar with Windows App SDK face no learning curve. Intuitive interface for testing models via prompt-response, enabling quick experimentation and performance validation. Is processor or SoC independent No. Supports Qualcomm Technologies processors only. Models must be compiled and optimized for the specific SOC on the device. A list of supported chipsets is provided, and the resulting .bin files are SOC-specific. Limitations exist with QNN EP’s HTP backend: only quantized models and those with static shapes are currently supported. Yes. The tool can operate independently of SoC. It is part of the broader Windows Copilot Runtime framework, now rebranded as the Windows AI Foundry. Model-dependent. Easily deployable on-device; model download and inference are straightforward. As of writing this article and based on our team's research, we found Qualcomm AI Hub to be the most user-friendly and well-supported solution available at this time. In contrast, most other frameworks are still under development and not yet generally available. Before we dive into how to use Qualcomm AI Hub to run Small Language Models (SLMs), let’s first understand what Qualcomm AI Hub is. What is Qualcomm AI Hub? Qualcomm AI Hub is a platform designed to simplify the deployment of AI models for vision, audio, speech, and text applications on edge devices. It allows users to upload, optimize, and validate their models for specific target hardware—such as CPU, GPU, or NPU—within minutes. Models developed in PyTorch or ONNX are automatically converted for efficient on-device execution using frameworks like TensorFlow Lite, ONNX Runtime, or Qualcomm AI Engine Direct. The Qualcomm AI Hub offers access to a collection of over 100 pre-optimized models, with open-source deployment recipes available on GitHub and Hugging Face. Users can also test and profile these models on real devices with Snapdragon and Qualcomm platforms hosted in the cloud. In this blog we will be showing how you can use Qualcomm AI Hub to get a QNN context binary for models and use Qualcomm AI Engine to run those context binaries. The context binary is a SoC-specific deployment mechanism. When compiled for a device, it is expected that the model will be deployed to the same device. The format is operating system agnostic so the same model can be deployed on Android, Linux, or Windows. The context binary is designed only for the NPU. For more details on how to compile models in other formats, please visit the documentation here Overview of Qualcomm AI Hub — qai-hub documentation. The following case study details the efficient execution of the Phi-3.5 model using optimized, hardware-specific binaries on a Surface Laptop 13-inch powered by the Qualcomm Snapdragon X Plus processor, Hexagon™ NPU, and Qualcomm Al Hub. Microsoft Surface Engineering Case Study: Running Phi-3.5 Model Locally on Snapdragon X Plus on Surface Laptop 13-inch This case study details how the Phi-3.5 model was deployed on a Surface Laptop 13-inch powered by the Snapdragon X Plus processor. The study was developed and documented by the Surface DASH team, which specializes in delivering AI/ML solutions to Surface devices and generating data-driven insights through advanced telemetry. Using Qualcomm AI Hub, we obtained precompiled QNN context binaries tailored to the target SoC, enabling efficient local inference. This method maximizes hardware utilization and minimizes setup complexity. We used a Surface Laptop 13-inch with the Snapdragon X Plus processor as our test device. The steps below apply to the Snapdragon X Plus processor; however, the process remains similar for other Snapdragon X Series processors and devices as well. For the other processors, you may need to download different model variants of the desired models from Qualcomm AI Hub. Before you begin to follow along, please check the make and models of your NPU by navigating to Device Manager --> Neural Processors. We also used Visual Studio Code and Python (3.10.3.11, 3.12). We used the 3.11 version to run these steps below and recommend using the same, although there should be no difference in using a higher Python version. Before starting, let's create a new virtual environment in Python as a best practice. Follow the steps to create a new virtual environment here: https://code.visualstudio.com/docs/python/environments?from=20423#_creating-environments Create a folder named ‘genie_bundle’ store config and bin files. Download the QNN context binaries specific to your NPU and place the config files into the genie_bundle folder. Copy the .dll files from QNN SDK into the genie_bundle folder. Finally, execute the test prompt through genie-sdk in the required format for Phi-3.5. Setup steps in details Step 1: Setup local development environment Download QNN SDK: Go to the Qualcomm Software Center Qualcomm Neural Processing SDK | Qualcomm Developer and download the QNN SDK by clicking on Get Software (by default latest version of SDK gets downloaded). For the purpose of this demo, we used latest version available (2.34) . You may need to make an account on the Qualcomm website to access it. Step 2: Download QNN Context Binaries from Qualcomm AI Hub Models Download Binaries: Download the context binaries (.bin files) for the Phi-3.5-mini-instruct model from (Link to Download Phi-3.5 context binaries). Clone AI Hub Apps repo: Use the Genie SDK (Generative Runtime built on top of Qualcomm AI Direct Engine), and leverage the sample provided in https://github.com/quic/ai-hub-apps Setup folder structure to follow along the code: Create a folder named "genie_bundle" outside of the folder where AI Hub Apps repo was cloned. Selectively copy configuration files from AI Hub sample repo to 'genie_bundle' Step 3: Copy config files and edit files Copy config files to genie_bundle folder from ai-hub-apps. You will need two config files. You can use the PowerShell script below to copy the config files from repo to local genie folder created in previous steps. You also need to copy HTP backend config file as well as the genie config file from the repo # Define the source paths $sourceFile1 = "ai-hub-apps/tutorials/llm_on_genie/configs/htp/htp_backend_ext_config.json.template" $sourceFile2 = "ai-hub-apps/tutorials/llm_on_genie/configs/genie/phi_3_5_mini_instruct.json" # Define the local folder path $localFolder = "genie_bundle" # Define the destination file paths using the local folder $destinationFile1 = Join-Path -Path $localFolder -ChildPath "htp_backend_ext_config.json" $destinationFile2 = Join-Path -Path $localFolder -ChildPath "genie_config.json" # Create the local folder if it doesn't exist if (-not (Test-Path -Path $localFolder)) { New-Item -ItemType Directory -Path $localFolder } # Copy the files to the local folder Copy-Item -Path $sourceFile1 -Destination $destinationFile1 -Force Copy-Item -Path $sourceFile2 -Destination $destinationFile2 -Force Write-Host "Files have been successfully copied to the genie_bundle folder with updated names." After copying the files, you will need to make sure to change the default values of the parameters provided with template files copied. Edit HTP backend file in the newly pasted location - Change dsp_arch and soc_model to match with your configuration pdate soc model and dsp arch in HTP backend config files Edit genie_config file to include the downloaded binaries for Phi 3 models in previous steps Step 4: Download the tokenizer file from Hugging Face Visit the Hugging Face Website: Open your web browser and go to https://huggingface.co/microsoft/Phi-3.5-mini-instruct/tree/main Locate the Tokenizer File: On the Hugging Face page, find the tokenizer file for the Phi-3.5-mini-instruct model Download the File: Click on the download button to save the tokenizer file to your computer Save the File: Navigate to your genie_bundle folder and save the downloaded tokenizer file there. Note: There is an issue with the tokenizer.json file for the Phi 3.5 mini instruct model, where the output does not break words using spaces. To resolve this, you need to delete lines #192-197 in the tokenizer.json file. Download tokenizer files from the hugging face repo (Image Source - Hugging Face) Step 5: Copy files from QNN SDK Locate the QNN SDK Folder: Open the folder where you have installed the QNN SDK in step 1 and identify the required files. You need to copy the files from the below mentioned folder. Exact folder naming may change based on SDK version <QNN-SDK ROOT FOLDER>/qairt/2.34.0.250424/lib/hexagon-v75/unsigned <QNN-SDK ROOT FOLDER> /qairt/2.34.0.250424/lib/aarch64-windows-msvc <QNN-SDK ROOT FOLDER> /qairt/2.34.0.250424/bin/aarch64-windows-msvc Navigate to your genie_bundle folder and paste the copied files there. Step 6: Execute the Test Prompt Open Your Terminal: Navigate to your genie_bundle folder using your terminal or command prompt. Run the Command: Copy and paste the following command into your terminal: ./genie-t2t-run.exe -c genie_config.json -p "<|system|>\nYou are an assistant. Provide helpful and brief responses.\n<|user|>What is an NPU? \n<|end|>\n<|assistant|>\n" Check the Output: After running the command, you should see the response from the assistant in your terminal. This case study demonstrates the process of deploying a small language model (SLM) like Phi-3.5 on a Copilot+ PC using the Hexagon NPU and Qualcomm AI Hub. It outlines the setup steps, tooling, and configuration required for local inference using hardware-specific binaries. As deployment methods mature, this approach highlights a viable path toward efficient, scalable Gen AI execution directly on edge devices. Snapdragon® and Qualcomm® branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm, Snapdragon and Hexagon™ are trademarks or registered trademarks of Qualcomm Incorporated.2.7KViews6likes2CommentsMicrosoft Semantic Kernel and AutoGen: Open Source Frameworks for AI Solutions
Explore Microsoft’s open-source frameworks, Semantic Kernel and AutoGen. Semantic Kernel enables developers to create AI solutions across various domains using a single Large Language Model (LLM). AutoGen, on the other hand, uses AI Agents to perform smart tasks through agent dialogues. Discover how these technologies serve different scenarios and can be used to build powerful AI applications.47KViews6likes1CommentGetting Started with the AI Dev Gallery
March Update: The Gallery is now available on the Microsoft Store! The AI Dev Gallery is a new open-source project designed to inspire and support developers in integrating on-device AI functionality into their Windows apps. It offers an intuitive UX for exploring and testing interactive AI samples powered by local models. Key features include: Quickly explore and download models from well-known sources on GitHub and HuggingFace. Test different models with interactive samples over 25 different scenarios, including text, image, audio, and video use cases. See all relevant code and library references for every sample. Switch between models that run on CPU and GPU depending on your device capabilities. Quickly get started with your own projects by exporting any sample to a fresh Visual Studio project that references the same model cache, preventing duplicate downloads. Part of the motivation behind the Gallery was exposing developers to the host of benefits that come with on-device AI. Some of these benefits include improved data security and privacy, increased control and parameterization, and no dependence on an internet connection or third-party cloud provider. Requirements Device Requirements Minimum OS Version: Windows 10, version 1809 (10.0; Build 17763) Architecture: x64, ARM64 Memory: At least 16 GB is recommended Disk Space: At least 20GB free space is recommended GPU: 8GB of VRAM is recommended for running samples on the GPU Using the Gallery The AI Dev Gallery has can be navigated in two ways: The Samples View The Models View Navigating Samples In this view, samples are broken up into categories (Text, Code, Image, etc.) and then into more specific samples, like in the Translate Text pictured below: On clicking a sample, you will be prompted to choose a model to download if you haven’t run this sample before: Next to the model you can see the size of the model, whether it will run on CPU or GPU, and the associated license. Pick the model that makes the most sense for your machine. You can also download new models and change the model for a sample later from the sample view. Just click the model drop down at the top of the sample: The last thing you can do from the Sample pane is view the sample code and export the project to Visual Studio. Both buttons are found in the top right corner of the sample, and the code view will look like this: Navigating Models If you would rather navigate by models instead of samples, the Gallery also provides the model view: The model view contains a similar navigation menu on the right to navigate between models based on category. Clicking on a model will allow you to see a description of the model, the versions of it that are available to download, and the samples that use the model. Clicking on a sample will take back over to the samples view where you can see the model in action. Deleting and Managing Models If you need to clear up space or see download details for the models you are using, you can head over the Settings page to manage your downloads: From here, you can easily see every model you have downloaded and how much space on your drive they are taking up. You can clear your entire cache for a fresh start or delete individual models that you are no longer using. Any deleted model can be redownload through either the models or samples view. Next Steps for the Gallery The AI Dev Gallery is still a work in progress, and we plan on adding more samples, models, APIs, and features, and we are evaluating adding support for NPUs to take the experience even further If you have feedback, noticed a bug, or any ideas for features or samples, head over to the issue board and submit an issue. We also have a discussion board for any other topics relevant to the Gallery. The Gallery is an open-source project, and we would love contribution, feedback, and ideation! Happy modeling!6.2KViews5likes3CommentsBringing AI to the edge: Hackathon Windows ML
AI Developer Hackathon Windows ML Hosted by Qualcomm on SnapDragonX We’re excited to announce our support and participation for the upcoming global series of Edge AI hackathons, hosted by Qualcomm Technologies. The first is on June 14-15 in Bangalore. We see a world of hybrid AI, developing rapidly as new generation of intelligent applications get built for diverse scenarios. These range from mobile, desktop, spatial computing and extending all the way to industrial and automotive. Mission critical workloads oscillate between decision-making in the moment, on device, to fine tuning models on the cloud. We believe we are in the early stages of development of agentic applications that efficiently run on the edge for scenarios needing local deployment and on-device inferencing. Microsoft Windows ML Windows ML – a cutting-edge runtime optimized for performant on-device model inference and simplified deployment, and the foundation of Windows AI Foundry. Windows ML is designed to support developers creating AI-infused applications with ease, harnessing the incredible strength of Windows’ diverse hardware ecosystem whether it’s for entry-level laptops, Copilot+ PCs or top-of-the-line AI workstations. It’s built to help developers leverage the client silicon best suited for their specific workload on any given device whether it’s an NPU for low-power and sustained inference, a GPU for raw horsepower or CPU for the broadest footprint and flexibility. Introducing Windows ML: The future of machine learning development on Windows - Windows Developer Blog Getting Started To get started, install AI Toolkit, leverage one of our conversion and optimization templates, or start building your own. Explore documentation and code samples available on Microsoft Learn, check out AI Dev Gallery (install, documentation) for demos and more samples to help you get started with Windows ML. Microsoft and Qualcomm Technologies: A strong collaboration Microsoft and Qualcomm Technologies’ collaboration bring new advanced AI features into Copilot+ PCs, leveraging the Snapdragon X Elite. Microsoft Research has played a pivotal role by optimizing new lightweight LLMs, such as Phi Silica, specifically for on-device execution with the Hexagon NPU. These models are designed to run efficiently on Hexagon NPUs, enabling multimodal AI experiences like vision-language tasks directly on Copilot+ PCs without relying on the cloud. Additionally, Microsoft has made DeepSeek R1 7B and 14B distilled models available via Azure AI Foundry, further expanding the AI ecosystem on the edge. This collaboration marks a significant step in democratizing AI by making powerful, efficient models accessible on everyday devices Windows AI Foundry expands AI capabilities by providing high-performance built-in models and supports developers' custom models with silicon performance. This developer platform plays a key role in this collaboration. Windows ML enables Windows 11 and Copilot+ PCs to use the Hexagon NPU for power efficient inference. Scaling optimization through Olive toolchain The Windows ML foundation of the Windows AI Foundry provides a unified platform for AI development across various hardware architectures and brings silicon performance using QNN Execution provider. This stack includes Windows ML and toolchains like Olive, easily accessible in AI Toolkit for VS Code, which streamlines model optimization and deployment. Qualcomm Technologies has contributed to Microsoft’s Olive, an open-source model optimization tool that enhances AI performance by optimizing models for efficient inference on client systems. This tool is particularly beneficial for running LLMs and GenAI workloads on Qualcomm Technologies’ platforms. Real-World Applications Through Qualcomm Technologies and Microsoft’s collaboration we have partnered with top developers to adopt Windows ML and have demonstrated impressive performance for their AI features. Independent Solution Vendors (ISVs) such as Powder, Topaz Labs, Camo and McAfee, Join us at the Hackathon With the recent launch of Qualcomm Snapdragon® X Elite-powered Windows laptops, developers can now take advantage of powerful NPUs (Neural Processing Units) to deploy AI applications that are both responsive and energy-efficient. These new devices open up a world of opportunities for developers to rethink how applications are built from productivity tools to creative assistants and intelligent agents all running directly on the device. Our mission has always been to enable high-quality AI experiences using compact, optimized models. These models are tailor-made for edge computing, offering faster inference, lower memory usage, and enhanced privacy without compromising performance. We encourage all application developers whether you’re building with open-source SLMs (small language models), working on smart assistants, or exploring new on-device AI use cases to join us at the event. You can register here: https://www.qualcomm.com/support/contact/forms/edge-ai-developer-hackathon-bengaluru-proposal-submission Dive deeper into these innovative developer solutions: Windows AI Foundry & Windows ML on Qualcomm NPU Microsoft and Qualcomm Technologies collaborate on Windows 11, Copilot+ PCs and Windows AI Foundry | Qualcomm Unlocking the power of Qualcomm QNN Execution Provider GPU backen Introducing Windows ML: The future of machine learning development on Windows - Windows Developer Blog630Views4likes0CommentsBuilding AI Agents on edge devices using Ollama + Phi-4-mini Function Calling
The new Phi-4-mini and Phi-4-multimodal now support Function Calling. This feature enables the models to connect with external tools and APIs. By deploying Phi-4-mini and Phi-4-multimodal with Function Calling capabilities on edge devices, we can achieve local expansion of knowledge capabilities and enhance their task execution efficiency. This blog will focus on how to use Phi-4-mini's Function Calling capabilities to build efficient AI Agents on edge devices. What‘s Function Calling How it works First we need to learn how Function Calling works Tool Integration: Function Calling allows LLM/SLM to interact with external tools and APIs, such as weather APIs, databases, or other services. Function Definition: Defines a function (tool) that LLM/SLM can call, specifying its name, parameters, and expected output. LLM Detection: LLM/SLM analyzes the user's input and determines if a function call is required and which function to use. JSON Output: LLM/SLM outputs a JSON object containing the name of the function to call and the parameters required by the function. External Execution: The application executes the function call using the parameters provided by LLM/SLM. Response to LLM: Returns the output of Function Calling to LLM/SLM, and LLM/SLM can use this information to generate a response to the user. Application scenarios Data retrieval: convert natural language queries into API calls to fetch data (e.g., "show my recent orders" triggers a database query) Operation execution: convert user requests into specific function calls (e.g., "schedule a meeting" becomes a calendar API call) Computational tasks: handle mathematical or logical operations through dedicated functions (e.g., calculate compound interest or statistical analysis) Data processing: chain multiple function calls together (e.g., get data → parse → transform → store) UI/UX integration: trigger interface updates based on user interactions (e.g., update map markers or display charts) Phi-4-mini / Phi-4-multimodal's Function Calling Phi-4-mini / Phi-4-multimodal supports single and parallel Function Calling. Things to note when calling You need to define Tools in System to start single or parallel Function Calling If you want to start parallel Function Calling, you also need to add 'some tools' to the System prompt The following is an example Single Function Calling tools = [ { "name": "get_match_result", "description": "get match result", "parameters": { "match": { "description": "The name of the match", "type": "str", "default": "Arsenal vs ManCity" } } }, ] messages = [ { "role": "system", "content": "You are a helpful assistant", "tools": json.dumps(tools), # pass the tools into system message using tools argument }, { "role": "user", "content": "What is the result of Arsenal vs ManCity today?" } ] Full Sample : Click Parallel Function Calling AGENT_TOOLS = { "booking_fight": { "name": "booking_fight", "description": "booking fight", "parameters": { "departure": { "description": "The name of Departure airport code", "type": "str", }, "destination": { "description": "The name of Destination airport code", "type": "str", }, "outbound_date": { "description": "The date of outbound flight", "type": "str", }, "return_date": { "description": "The date of return flight", "type": "str", } } }, "booking_hotel": { "name": "booking_hotel", "description": "booking hotel", "parameters": { "query": { "description": "The name of the city", "type": "str", }, "check_in_date": { "description": "The date of check in", "type": "str", }, "check_out_date": { "description": "The date of check out", "type": "str", } } }, } SYSTEM_PROMPT = """ You are my travel agent with some tools available. """ messages = [ { "role": "system", "content": SYSTEM_PROMPT, "tools": json.dumps(AGENT_TOOLS), # pass the tools into system message using tools argument }, { "role": "user", "content": """I have a business trip from London to New York in March 21 2025 to March 27 2025, can you help me to book a hotel and flight tickets""" } ] Full sample : click Using Ollama and Phi-4-mini Function Calling to Create AI Agents on Edge Devices Ollama is a popular free tool for deploying LLM/SLM locally and can be used in combination with AI Toolkit for VS Code. In addition to being deployed on your PC/Laptop, it can also be deployed on IoT, mobile phones, containers, etc. To use Phi-4-mini on Ollama, you need to use Ollama 0.5.13+. Different quantitative versions are supported on Ollama, as shown in the figure below: Using Ollama, we can deploy Phi-4-mini on the edge, and implement AI Agent with Function Calling under limited computing power, so that Generative AI can be applied more effectively on the edge. Current Issues A sad experience - If you directly use the interface to try to call Ollama in the above way, you will find that Function Calling will not be triggered. There are discussions on Ollama's GitHub Issue. You can enter the Issue https://github.com/ollama/ollama/issues/9437. By modifying the Phi-4-mini Template on the ModelFile to implement a single Function Calling, but the call to Parallel Function Calling still failed. Resolution We have implemented a fix by making a adjustments to the template. We have improved it according to Phi-4-mini's Chat Template and re-modified the Modelfile. Of course, the quantitative model has a huge impact on the results. The adjustments are as follows: TEMPLATE """ {{- if .Messages }} {{- if or .System .Tools }}<|system|> {{ if .System }}{{ .System }} {{- end }} In addition to plain text responses, you can chose to call one or more of the provided functions. Use the following rule to decide when to call a function: * if the response can be generated from your internal knowledge (e.g., as in the case of queries like "What is the capital of Poland?"), do so * if you need external information that can be obtained by calling one or more of the provided functions, generate a function calls If you decide to call functions: * prefix function calls with functools marker (no closing marker required) * all function calls should be generated in a single JSON list formatted as functools[{"name": [function name], "arguments": [function arguments as JSON]}, ...] * follow the provided JSON schema. Do not hallucinate arguments or values. Do to blindly copy values from the provided samples * respect the argument type formatting. E.g., if the type if number and format is float, write value 7 as 7.0 * make sure you pick the right functions that match the user intent Available functions as JSON spec: {{- if .Tools }} {{ .Tools }} {{- end }}<|end|> {{- end }} {{- range .Messages }} {{- if ne .Role "system" }}<|{{ .Role }}|> {{- if and .Content (eq .Role "tools") }} {"result": {{ .Content }}} {{- else if .Content }} {{ .Content }} {{- else if .ToolCalls }} functools[ {{- range .ToolCalls }}{{ "{" }}"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}{{ "}" }} {{- end }}] {{- end }}<|end|> {{- end }} {{- end }}<|assistant|> {{ else }} {{- if .System }}<|system|> {{ .System }}<|end|>{{ end }}{{ if .Prompt }}<|user|> {{ .Prompt }}<|end|>{{ end }}<|assistant|> {{ end }}{{ .Response }}{{ if .Response }}<|user|>{{ end }} """ We have tested the solution using different quantitative models. In the laptop environment, we recommend that you use the following model to enable single/parallel Function Calling: phi4-mini:3.8b-fp16. Note: you need to bind the defined Modelfile and phi4-mini:3.8b-fp16 together to enable this to work. Please execute the following command in the command line: #If you haven't downloaded it yet, please execute this command firstr ollama run phi4-mini:3.8b-fp16 #Binding with the adjusted Modelfile ollama create phi4-mini:3.8b-fp16 -f {Your Modelfile Path} To test the single Function Calling and Parallel Function Calling of Phi-4-mini. Single Function Calling Parallel Function Calling Full Sample in notebook The above example is just a simple introduction. As we move forward with the development we hope to find simpler ways to apply it on the edge, use Function Calling to expand the scenarios of Phi-4-mini / Phi-4-multimodal, and also develop more usecases in vertical industries. Resources Phi-4 model on Hugging face https://huggingface.co/collections/microsoft/phi-4-677e9380e514feb5577a40e4 Phi-4-mini on Ollama https://ollama.com/library/phi4-mini Learn Function Calling https://huggingface.co/docs/hugs/en/guides/function-calling Phi Cookbook - Samples and Resources for Phi Models https://aka.ms/phicookbook5.7KViews4likes1CommentAI Toolkit for Visual Studio Code: October 2024 Update Highlights
The AI Toolkit’s October 2024 update revolutionizes Visual Studio Code with game-changing features for developers, researchers, and enthusiasts. Explore multi-model integration, including GitHub Models, ONNX, and Google Gemini, alongside custom model support. Dive into multi-modal capabilities for richer AI testing and seamless multi-platform compatibility across Windows, macOS, and Linux. Tailored for productivity, the enhanced Model Catalog simplifies choosing the best tools for your projects. Try it now and share feedback to shape the future of AI in VS Code!2.9KViews4likes0CommentsGetting Started - Generative AI with Phi-3-mini: A Guide to Inference and Deployment
Getting started with Microsoft Phi-3-mini - Inference Phi-3-mini models, Discover how Phi-3-mini, a new series of models from Microsoft, enables deployment of Large Language Models (LLMs) on edge devices and IoT devices. Learn how to use Semantic Kernel, Ollama/LlamaEdge, and ONNX Runtime to access and infer phi3-mini models, and explore the possibilities of generative AI in various application scenarios50KViews4likes13CommentsEssential Microsoft Resources for MVPs & the Tech Community from the AI Tour
Unlock the power of Microsoft AI with redeliverable technical presentations, hands-on workshops, and open-source curriculum from the Microsoft AI Tour! Whether you’re a Microsoft MVP, Developer, or IT Professional, these expertly crafted resources empower you to teach, train, and lead AI adoption in your community. Explore top breakout sessions covering GitHub Copilot, Azure AI, Generative AI, and security best practices—designed to simplify AI integration and accelerate digital transformation. Dive into interactive workshops that provide real-world applications of AI technologies. Take it a step further with Microsoft’s Open-Source AI Curriculum, offering beginner-friendly courses on AI, Machine Learning, Data Science, Cybersecurity, and GitHub Copilot—perfect for upskilling teams and fostering innovation. Don’t just learn—lead. Access these resources, host impactful training sessions, and drive AI adoption in your organization. Start sharing today! Explore now: Microsoft AI Tour Resources.AI Sparks: Unleashing Agents with the AI Toolkit
The final episode of our "AI Sparks" series delved deep into the exciting world of AI Agents and their practical implementation. We also covered a fair part of MCP with Microsoft AI Toolkit extension for VS Code. We kicked off by charting the evolutionary path of intelligent conversational systems. Starting with the rudimentary rule-based Basic Chatbots, we then explored the advancements brought by Basic Generative AI Chatbots, which offered contextually aware interactions. Then we explored the Retrieval-Augmented Generation (RAG), highlighting its ability to ground generative models in specific knowledge bases, significantly enhancing accuracy and relevance. The limitations were also discussed for the above mentioned techniques. The session was then centralized to the theme – Agents and Agentic Frameworks. We uncovered the fundamental shift from basic chatbots to autonomous agents capable of planning, decision-making, and executing tasks. We moved forward with detailed discussion on the distinction between Single Agentic systems, where one core agent orchestrates the process, and Multi-Agent Architectures, where multiple specialized agents collaborate to achieve complex goals. A key part of building robust and reliable AI Agents, as we discussed, revolves around carefully considering four critical factors. Firstly, Knowledge-Providing agents with the right context is paramount for them to operate effectively and make informed decisions. Secondly, equipping agents with the necessary Actions by granting them access to the appropriate tools allows them to execute tasks and achieve desired outcomes. Thirdly, Security is non-negotiable; ensuring agents have access only to the data and services they genuinely need is crucial for maintaining privacy and preventing unintended actions. Finally, establishing robust Evaluations mechanisms is essential to verify that agents are completing tasks correctly and meeting the required standards. These four pillars – Knowledge, Actions, Security, and Evaluation – form the bedrock of any successful agentic implementation. To illustrate the transformative power of AI Agents, we explored several interesting use cases and applications. These ranged from intelligent personal assistants capable of managing schedules and automating workflows to sophisticated problem-solving systems in domains like customer service. A significant portion of the session was dedicated to practical implementation through demonstrations. We highlighted key frameworks that are empowering developers to build agentic systems.: Semantic Kernel: We highlighted its modularity and rich set of features for integrating various AI services and tools. Autogen Studio: The focus here was on its capabilities for facilitating the creation and management of multi-agent conversations and workflows. Agent Service: We discussed its role in providing a more streamlined and managed environment for deploying and scaling AI agents. The major point of attraction was that these were demonstrated using the local LLMs which were hosted using AI Toolkit. This showcased the ease with which developers can utilize VS Code AI toolkit to build and experiment with agentic workflows directly within their familiar development environment. Finally, we demystified the concept of Model Context Protocol (MCP) and demonstrated how seamlessly it can be implemented using the Agent Builder within the VS Code AI Toolkit. We demonstrated this with a basic Website development using MCP. This practical demonstration underscored the toolkit's power in simplifying the development of complex solutions that can maintain context and engage in more natural, multi-step interactions. The "AI Sparks" series concluded with a discussion, where attendees had a clearer understanding of the evolution, potential and practicalities of AI Agents. The session underscored that we are on the cusp of a new era of intelligent systems that are not just reactive but actively work alongside us to achieve goals. The tools and frameworks are maturing, and the possibilities for agentic applications are sparking innovation across various industries. It was an exciting journey, and engagement during the final session on AI Sparks around Agents truly highlighted the transformative potential of this field. "AI Sparks" Series Roadmap: The "AI Sparks" series delved deeper into specific topics using AI Toolkit for Visual Studio Code, including: Introduction to AI toolkit and feature walkthrough: Introduction to the AI Toolkit extension for VS Code a powerful way to explore and integrate the latest AI models from OpenAI, Meta, Deepseek, Mistral, and more. Introduction to SLMs and local model with use cases: Explore Small Language Models (SLMs) and how they compare to larger models. Building RAG Applications: Create powerful applications that combine the strengths of LLMs with external knowledge sources. Multimodal Support and Image Analysis: Working with vision models and building multimodal applications. Evaluation and Model Selection: Evaluate model performance and choose the best model for your needs. Agents and Agentic Frameworks: Exploring the cutting edge of AI agents and how they can be used to build more complex and autonomous systems. The full playlist of the series with all the episodes of "AI Sparks" is available at AI Sparks Playlist. Continue the discussion and questions in Microsoft AI Discord Community where we have a dedicated AI-sparks channel. All the code samples can be found on AI_Toolkit_Samples .We look forward to continuing these insightful discussions in future series!305Views2likes0Comments