phi
2 TopicsIntroducing Phi-4-Reasoning-Vision to Microsoft Foundry
Vision reasoning models unlock a critical capability for developers: the ability to move beyond passive perception toward systems that can understand, reason over, and act on visual information. Instead of treating images, diagrams, documents, or UI screens as unstructured inputs, vision reasoning models enable developers to build applications that can interpret visual structure, connect it with textual context, and perform multi-step reasoning to reach actionable conclusions. Today, we are excited to announce Phi-4-Reasoning-Vision-15B is available in Microsoft Foundry and Hugging Face. This model brings high‑fidelity vision to the reasoning‑focused Phi‑4 family, extending small language models (SLMs) beyond perception into structured, multi‑step visual reasoning for agents, analytical tools, and scientific workflows. What’s new? The Phi model family has advanced toward combining efficient visual understanding with strong reasoning in small language models. Earlier Phi‑4 models demonstrated reliable perception and grounding across images and text, while later iterations introduced structured reasoning to improve performance on complex tasks. Phi‑4‑reasoning-vision-15B brings these threads together, pairing high‑resolution visual perception with selective, task‑aware reasoning. As a result, the model can reason deeply when needed while remaining fast and efficient for perception‑focused scenarios—making it well suited for interactive, real‑world applications. Key capabilities Reasoning behavior is explicitly enabled via prompting: Developers can explicitly enable or disable reasoning to balance latency and accuracy at runtime. Optimized for vision reasoning and can be used for: diagram-based math, document, chart, and table understanding, GUI interpretations and grounding for agent scenarios to interpret screens and actions, Computer-use agent scenarios, and General image chat and answering questions Benchmarks The following results summarize Phi-4-reasoning-vision-15B performance across a set of established multimodal reasoning, mathematics, and computer use benchmarks. The following benchmarks are the result of internal evaluations. Benchmark Phi-4-reasoning-vision-15B Phi-4-reasoning-vision-15B – force no think Phi-4-mm-instruct Kimi-VL-A3B-Instruct gemma-3-12b-it Qwen3-VL-8B-Instruct-4K Qwen3-VL-8B-Instruct-32K Qwen3-VL-32B-Instruct-4K Qwen3-VL-32B-Instruct-32K AI2D _TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 ChartQA _TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 HallusionBench 64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 MathVerse _MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 MathVision _MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 MathVista _MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 MMMU _VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 ScreenSpot _v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 Table 1: Accuracy comparisons relative to popular open-weight, non-thinking models Benchmark Phi-4-reasoning-vision-15B Phi-4-reasoning-vision-15B - force thinking Kimi-VL-A3B-Thinking gemma-3-12b-it Qwen3-VL-8B-Thinking-4K Qwen3-VL-8B-Thinking-40K Qwen3-VL-32B-Thiking-4K Qwen3-VL-32B-Thinking-40K AI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 ChartQA _TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 HallusionBench 64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 MathVerse _MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 MathVision _MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 MathVista _MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 MMMU _VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 ScreenSpot _v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 Table 2: Accuracy comparisons relative to popular open-weight, thinking models All results were obtained using a consistent evaluation setup and prompts across models; numbers are provided for comparison and analysis rather than as leaderboard claims. For more information regarding benchmarks and evaluations, please read the technical paper on the Microsoft Research hub. Suggested use cases and applications Phi‑4‑Reasoning-Vision-15B supports applications that require both high‑fidelity visual perception and structured inference. Two representative scenarios include scientific and mathematical reasoning over visual inputs, and computer‑using agents (CUAs) that operate directly on graphical user interfaces. In both cases, the model provides grounded visual understanding paired with controllable, low‑latency reasoning suitable for interactive systems. Computer use agents in retail scenarios For computer use agents, Phi‑4‑Reasoning-Vision-15B provides the perception and grounding layer required to understand and act within live ecommerce interfaces. For example, in an online shopping experience, the model interprets screen content—products, prices, filters, promotions, buttons, and cart state—and produces grounded observations that agentic models like Fara-7B can use to select actions. Its compact size and low latency inference make it well suited for CUA workflows and agentic applications. Visual reasoning for education Another practical use of visual reasoning models is education. A developer could build a K‑12 tutoring app with Phi‑4‑Reasoning‑Vision‑15B where students upload photos of worksheets, charts, or diagrams to get guided help—not answers. The model can understand the visual content, identify where the student went wrong, and explain the correct steps clearly. Over time, the app can adapt by serving new examples matched to the student’s learning level, turning visual problem‑solving into a personalized learning experience. Microsoft Responsible AI principles At Microsoft, our mission to empower people and organizations remains constant—especially in the age of AI, where the potential for human achievement is greater than ever. We recognize that trust is foundational to AI adoption, and earning that trust requires a commitment to transparency, safety, and accountability. As with other Phi models, Phi-4-Reasoning-Vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. These safety focused training signals help the model recognize and decline requests that fall outside intended or acceptable use. Additional details on the model’s safety considerations, evaluation approach, and known limitations are provided in the accompanying technical blog and model card. Getting started Start using Phi‑4‑Reasoning-Vision-15B in Microsoft Foundry today. Microsoft Foundry provides a unified environment for model discovery, evaluation, and deployment, making it straightforward to move from initial experimentation to production use while applying appropriate safety and governance practices. Deploy the new model on Microsoft Foundry. Learn more about the Phi family on Foundry Labs and in the Phi Cookbook Connect to the Microsoft Developer Community on Discord Read the technical paper on Microsoft Research Read more use cases on the Educators Developer blog886Views0likes0CommentsTransforming Android Development: Unveiling MediaTek’s latest chipset with Microsoft's Phi models
Imagine running advanced AI applications—like intelligent copilots and Retrieval-Augmented Generation (RAG)—directly on Android devices, completely offline. With the rapid evolution of Neural Processing Units (NPUs), this is no longer a future vision—it’s happening now. Optimized AI at the Edge: Phi-4-mini on MediaTek Thanks to MediaTek’s conversion and quantization tools, Microsoft’s Phi-4-mini and Phi-4-mini-reasoning models are now optimized for MediaTek NPUs. This collaboration empowers developers to build fast, responsive, and privacy-preserving AI experiences on Android—without needing cloud connectivity. MediaTek’s flagship Dimensity 9400 and 9400+ platform with Dimensity GenAI Toolkit 2.0 delivers excellent performance with the Phi-4 mini (3.8B) model where prefill speed is >800 tokens/sec and decode speed is >21 tokens/sec. Unlock Enhanced Performance: Introducing MediaTek's NeuroPilot SDK The MediaTek NeuroPilot SDK is a robust software development toolkit designed to accelerate AI application development and deployment across MediaTek’s hardware ecosystem. It provides developers with advanced optimization tools and cross-platform compatibility, enabling efficient implementation of neural networks while balancing performance, power efficiency, and resource utilization. Comprehensive toolchain and documentation support The NeuroPilot platform offers a complete toolchain, including SDKs, APIs, and documentation, for model quantization/conversion, compilation, and integration. Developers can leverage these tools to optimize neural networks, significantly improving on-device performance while reducing power consumption and memory usage. MediaTek’s Dimensity GenAI Toolkit 2.0 now supports the Phi-4 series and provides best practices. Users can convert and quantize Phi-4 mini models in just a few steps, enabling seamless deployment on Dimensity series platforms. A key advantage is that developers do not require specialized hardware expertise to rapidly prototype and deploy customized AI solutions. One-time coding, cross-platform deployment The MediaTek NeuroPilot SDK supports all AI-capable MediaTek hardware, empowering developers to adopt a "code once, deploy everywhere" strategy across smartphones, tablets, automotive, smart home devices, IoT products, and future platforms. This aligns with MediaTek’s corporate philosophy of bringing AI to everyone. This unified approach streamlines development, reduces costs, and accelerates time-to-market. The SDK integrates with Android and Linux ecosystems, providing complete compiler suites, analyzers, and application libraries to ensure compatibility and optimize performance. Demo 1: Deploying Phi-4-mini-reasoning with NeuroPilot SDK In this demo, developers are shown how to use the NeuroPilot SDK to deploy the Phi-4-mini-reasoning model on edge devices. The SDK enables efficient conversion and optimization, making it possible to bring advanced reasoning capabilities to smartphones and other local hardware. The Phi-4-mini-reasoning model brings logical and problem-solving capabilities to the edge. With MediaTek’s advanced conversion tools, this new model can be transformed for MediaTek’s DLA, enabling a new class of intelligent applications on mobile devices. Bringing reasoning capabilities to the edge allows developers to build faster, more responsive AI experiences—without relying on cloud access. Demo 2: Deploying Phi-4-mini with NeuroPilot SDK This video demonstrates how to convert and run the Phi-4-mini model using the NeuroPilot SDK. With a focus on instruction-following tasks, this deployment empowers developers to build responsive, embedded AI assistants that understand and execute user commands locally. Whether it’s productivity tools or context-sensitive automation, Phi-4-mini brings natural interaction and reliability directly to the device. Imagine the possibilities: Real world scenarios Intelligent information access with on-device RAG Picture this: your application intelligently accesses and reasons over on-device documents, like PDFs or internal knowledge bases, using an advanced embedding model paired with the MediaTek optimized Phi-4-mini. This enables developers to create: Personalized Assistants: Apps that understand user context from their own documents. Offline Knowledge Hubs: Providing instant access to relevant information without needing cloud connectivity. Enhanced Productivity Tools: Smart summarization, Q&A, and content generation based on local data. Demo 3: Private RAG chatbot on device People are on their mobile devices every day—saving new documents, sending messages, taking notes, and more. With how much we’re able to store on our phones and laptops, it can get hard to find specific files or pieces of information when we need them most. What if you could implement a personal assistant that understands your question and fetches exactly what you’re looking for, without you needing to dig through your device? This demo showcases a Retrieval-Augmented Generation (RAG) implementation of the Phi model embedded directly on a smartphone. The chatbot allows users to ask natural language questions and instantly retrieve relevant information from local files. Because the model runs on-device, there's no need for a cloud connection—ensuring your data stays private while still offering intelligent, context-aware result RAG based Phi-4-mini solution, so that when you searched your device, it parsed through every document to help you find the exact document you are looking for. Stay ahead of the curve: If you're eager to explore the Phi-4 family of models on edge devices and master building next-gen apps with MediaTek's powerful NPU, don't miss the key sessions at Microsoft Build and Computex Taipei happening this week. This is your chance to get direct insights from the experts. Microsoft Build 2025: Uncover the latest on Azure AI Foundry on May 20 during “Unveiling Latest Innovations in Azure AI Foundry Model Catalog” If you are in person on May 20 th , catch the second lab “Fine-Tune End-to-End Distillation Models with Azure AI Foundry Models” Learn about Phi on Windows devices in on May 20 th for “Enable seamless deployment across Intel Copilot+ AI PCs and Azure” Computex 2025 : MediaTek Booth (M0806) on May 20-23. See MediaTek 's AI vision and hardware innovations firsthand. Resources Explore the Phi-4 Model Family on Azure AI Foundry and HuggingFace Get access to the Phi Cookbook: Your practical guide and code repository for building with Phi models. Learn more about Mediatek NeuroPilot Connect with the MediaTek Developer Application1.1KViews0likes0Comments