A small language model that combines high-resolution vision with selective, task aware reasoning
Vision reasoning models unlock a critical capability for developers: the ability to move beyond passive perception toward systems that can understand, reason over, and act on visual information. Instead of treating images, diagrams, documents, or UI screens as unstructured inputs, vision reasoning models enable developers to build applications that can interpret visual structure, connect it with textual context, and perform multi-step reasoning to reach actionable conclusions.
Today, we are excited to announce Phi-4-Reasoning-Vision-15B is available in Microsoft Foundry and Hugging Face. This model brings high‑fidelity vision to the reasoning‑focused Phi‑4 family, extending small language models (SLMs) beyond perception into structured, multi‑step visual reasoning for agents, analytical tools, and scientific workflows.
What’s new?
The Phi model family has advanced toward combining efficient visual understanding with strong reasoning in small language models. Earlier Phi‑4 models demonstrated reliable perception and grounding across images and text, while later iterations introduced structured reasoning to improve performance on complex tasks. Phi‑4‑reasoning-vision-15B brings these threads together, pairing high‑resolution visual perception with selective, task‑aware reasoning. As a result, the model can reason deeply when needed while remaining fast and efficient for perception‑focused scenarios—making it well suited for interactive, real‑world applications.
Key capabilities
- Reasoning behavior is explicitly enabled via prompting: Developers can explicitly enable or disable reasoning to balance latency and accuracy at runtime.
-
Optimized for vision reasoning and can be used for: diagram-based math, document, chart, and table understanding, GUI interpretations and grounding for agent scenarios to interpret screens and actions, Computer-use agent scenarios, and General image chat and answering questions
Benchmarks
The following results summarize Phi-4-reasoning-vision-15B performance across a set of established multimodal reasoning, mathematics, and computer use benchmarks. The following benchmarks are the result of internal evaluations.
|
Benchmark |
Phi-4-reasoning-vision-15B |
Phi-4-reasoning-vision-15B – force no think |
Phi-4-mm-instruct |
Kimi-VL-A3B-Instruct |
gemma-3-12b-it |
Qwen3-VL-8B-Instruct-4K |
Qwen3-VL-8B-Instruct-32K |
Qwen3-VL-32B-Instruct-4K |
Qwen3-VL-32B-Instruct-32K |
|
AI2D_TEST |
84.8 |
84.7 |
68.6 |
84.6 |
80.4 |
82.7 |
83 |
84.8 |
85 |
|
ChartQA_TEST |
83.3 |
76.5 |
23.5 |
87 |
39 |
83.1 |
83.2 |
84.3 |
84 |
|
HallusionBench |
64.4 |
63.1 |
56 |
65.2 |
65.3 |
73.5 |
74.1 |
74.4 |
74.9 |
|
MathVerse_MINI |
44.9 |
43.8 |
32.4 |
41.7 |
29.8 |
54.5 |
57.4 |
64.2 |
64.2 |
|
MathVision_MINI |
36.2 |
34.2 |
20 |
28.3 |
31.9 |
45.7 |
50 |
54.3 |
60.5 |
|
MathVista_MINI |
75.2 |
68.7 |
50.5 |
67.1 |
57.4 |
77.1 |
76.4 |
82.5 |
81.8 |
|
MMMU_VAL |
54.3 |
52 |
42.3 |
52 |
50 |
60.7 |
64.6 |
68.6 |
70.6 |
|
MMStar |
64.5 |
63.3 |
45.9 |
60 |
59.4 |
68.9 |
69.9 |
73.7 |
74.3 |
|
OCRBench |
76 |
75.6 |
62.6 |
86.5 |
75.3 |
89.2 |
90 |
88.5 |
88.5 |
|
ScreenSpot_v2 |
88.2 |
88.3 |
28.5 |
89.8 |
3.5 |
91.5 |
91.5 |
93.7 |
93.9 |
Table 1: Accuracy comparisons relative to popular open-weight, non-thinking models
|
Benchmark |
Phi-4-reasoning-vision-15B |
Phi-4-reasoning-vision-15B - force thinking |
Kimi-VL-A3B-Thinking |
gemma-3-12b-it |
Qwen3-VL-8B-Thinking-4K |
Qwen3-VL-8B-Thinking-40K |
Qwen3-VL-32B-Thiking-4K |
Qwen3-VL-32B-Thinking-40K |
|
AI2D_TEST |
84.8 |
79.7 |
81.2 |
80.4 |
83.5 |
83.9 |
86.9 |
87.2 |
|
ChartQA_TEST |
83.3 |
82.9 |
73.3 |
39 |
78 |
78.6 |
78.5 |
79.1 |
|
HallusionBench |
64.4 |
63.9 |
70.6 |
65.3 |
71.6 |
73 |
76.4 |
76.6 |
|
MathVerse_MINI |
44.9 |
53.1 |
61 |
29.8 |
67.3 |
73.3 |
78.3 |
78.2 |
|
MathVision_MINI |
36.2 |
36.2 |
50.3 |
31.9 |
43.1 |
50.7 |
60.9 |
58.6 |
|
MathVista_MINI |
75.2 |
74.1 |
78.6 |
57.4 |
77.7 |
79.5 |
83.9 |
83.8 |
|
MMMU_VAL |
54.3 |
55 |
60.2 |
50 |
59.3 |
65.3 |
72 |
72.2 |
|
MMStar |
64.5 |
63.9 |
69.6 |
59.4 |
69.3 |
72.3 |
75.5 |
75.7 |
|
OCRBench |
76 |
73.7 |
79.9 |
75.3 |
81.2 |
82 |
83.7 |
85 |
|
ScreenSpot_v2 |
88.2 |
88.1 |
81.8 |
3.5 |
93.3 |
92.7 |
83.1 |
83.1 |
Table 2: Accuracy comparisons relative to popular open-weight, thinking models
All results were obtained using a consistent evaluation setup and prompts across models; numbers are provided for comparison and analysis rather than as leaderboard claims. For more information regarding benchmarks and evaluations, please read the technical paper on the Microsoft Research hub.
Suggested use cases and applications
Phi‑4‑Reasoning-Vision-15B supports applications that require both high‑fidelity visual perception and structured inference. Two representative scenarios include scientific and mathematical reasoning over visual inputs, and computer‑using agents (CUAs) that operate directly on graphical user interfaces. In both cases, the model provides grounded visual understanding paired with controllable, low‑latency reasoning suitable for interactive systems.
Figure 1. Phi-4-Reasoning-Vision-15B can interpret sequences of images.
Computer use agents in retail scenarios
For computer use agents, Phi‑4‑Reasoning-Vision-15B provides the perception and grounding layer required to understand and act within live ecommerce interfaces. For example, in an online shopping experience, the model interprets screen content—products, prices, filters, promotions, buttons, and cart state—and produces grounded observations that agentic models like Fara-7B can use to select actions. Its compact size and low latency inference make it well suited for CUA workflows and agentic applications.
Visual reasoning for education
Another practical use of visual reasoning models is education. A developer could build a K‑12 tutoring app with Phi‑4‑Reasoning‑Vision‑15B where students upload photos of worksheets, charts, or diagrams to get guided help—not answers. The model can understand the visual content, identify where the student went wrong, and explain the correct steps clearly. Over time, the app can adapt by serving new examples matched to the student’s learning level, turning visual problem‑solving into a personalized learning experience.
Microsoft Responsible AI principles
At Microsoft, our mission to empower people and organizations remains constant—especially in the age of AI, where the potential for human achievement is greater than ever. We recognize that trust is foundational to AI adoption, and earning that trust requires a commitment to transparency, safety, and accountability. As with other Phi models, Phi-4-Reasoning-Vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. These safety focused training signals help the model recognize and decline requests that fall outside intended or acceptable use. Additional details on the model’s safety considerations, evaluation approach, and known limitations are provided in the accompanying technical blog and model card.
Getting started
Start using Phi‑4‑Reasoning-Vision-15B in Microsoft Foundry today. Microsoft Foundry provides a unified environment for model discovery, evaluation, and deployment, making it straightforward to move from initial experimentation to production use while applying appropriate safety and governance practices.
- Deploy the new model on Microsoft Foundry.
- Learn more about the Phi family on Foundry Labs and in the Phi Cookbook
- Connect to the Microsoft Developer Community on Discord
- Read the technical paper on Microsoft Research
- Read more use cases on the Educators Developer blog