Blog Post

Microsoft Foundry Blog
5 MIN READ

Introducing Phi-4-Reasoning-Vision to Microsoft Foundry

yashlara's avatar
yashlara
Icon for Microsoft rankMicrosoft
Mar 04, 2026

A small language model that combines high-resolution vision with selective, task aware reasoning

Vision reasoning models unlock a critical capability for developers: the ability to move beyond passive perception toward systems that can understand, reason over, and act on visual information. Instead of treating images, diagrams, documents, or UI screens as unstructured inputs, vision reasoning models enable developers to build applications that can interpret visual structure, connect it with textual context, and perform multi-step reasoning to reach actionable conclusions.

Today, we are excited to announce Phi-4-Reasoning-Vision-15B is available in Microsoft Foundry and Hugging Face. This model brings high‑fidelity vision to the reasoning‑focused Phi‑4 family, extending small language models (SLMs) beyond perception into structured, multi‑step visual reasoning for agents, analytical tools, and scientific workflows.

What’s new?

The Phi model family has advanced toward combining efficient visual understanding with strong reasoning in small language models. Earlier Phi‑4 models demonstrated reliable perception and grounding across images and text, while later iterations introduced structured reasoning to improve performance on complex tasks. Phi4reasoning-vision-15B brings these threads together, pairing high‑resolution visual perception with selective, task‑aware reasoning. As a result, the model can reason deeply when needed while remaining fast and efficient for perception‑focused scenarios—making it well suited for interactive, real‑world applications.

Key capabilities

  • Reasoning behavior is explicitly enabled via prompting: Developers can explicitly enable or disable reasoning to balance latency and accuracy at runtime.
  • Optimized for vision reasoning and can be used for: diagram-based math, document, chart, and table understanding, GUI interpretations and grounding for agent scenarios to interpret screens and actions, Computer-use agent scenarios, and General image chat and answering questions

Benchmarks

The following results summarize Phi-4-reasoning-vision-15B performance across a set of established multimodal reasoning, mathematics, and computer use benchmarks. The following benchmarks are the result of internal evaluations.

Benchmark

Phi-4-reasoning-vision-15B

Phi-4-reasoning-vision-15B – force no think

Phi-4-mm-instruct

Kimi-VL-A3B-Instruct

gemma-3-12b-it

Qwen3-VL-8B-Instruct-4K

Qwen3-VL-8B-Instruct-32K

Qwen3-VL-32B-Instruct-4K

Qwen3-VL-32B-Instruct-32K

AI2D_TEST

84.8

84.7

68.6

84.6

80.4

82.7

83

84.8

85

ChartQA_TEST

83.3

76.5

23.5

87

39

83.1

83.2

84.3

84

HallusionBench

64.4

63.1

56

65.2

65.3

73.5

74.1

74.4

74.9

MathVerse_MINI

44.9

43.8

32.4

41.7

29.8

54.5

57.4

64.2

64.2

MathVision_MINI

36.2

34.2

20

28.3

31.9

45.7

50

54.3

60.5

MathVista_MINI

75.2

68.7

50.5

67.1

57.4

77.1

76.4

82.5

81.8

MMMU_VAL

54.3

52

42.3

52

50

60.7

64.6

68.6

70.6

MMStar

64.5

63.3

45.9

60

59.4

68.9

69.9

73.7

74.3

OCRBench

76

75.6

62.6

86.5

75.3

89.2

90

88.5

88.5

ScreenSpot_v2

88.2

88.3

28.5

89.8

3.5

91.5

91.5

93.7

93.9

Table 1: Accuracy comparisons relative to popular open-weight, non-thinking models

 

Benchmark

Phi-4-reasoning-vision-15B

Phi-4-reasoning-vision-15B - force thinking

Kimi-VL-A3B-Thinking

gemma-3-12b-it

Qwen3-VL-8B-Thinking-4K

Qwen3-VL-8B-Thinking-40K

Qwen3-VL-32B-Thiking-4K

Qwen3-VL-32B-Thinking-40K

AI2D_TEST

84.8

79.7

81.2

80.4

83.5

83.9

86.9

87.2

ChartQA_TEST

83.3

82.9

73.3

39

78

78.6

78.5

79.1

HallusionBench

64.4

63.9

70.6

65.3

71.6

73

76.4

76.6

MathVerse_MINI

44.9

53.1

61

29.8

67.3

73.3

78.3

78.2

MathVision_MINI

36.2

36.2

50.3

31.9

43.1

50.7

60.9

58.6

MathVista_MINI

75.2

74.1

78.6

57.4

77.7

79.5

83.9

83.8

MMMU_VAL

54.3

55

60.2

50

59.3

65.3

72

72.2

MMStar

64.5

63.9

69.6

59.4

69.3

72.3

75.5

75.7

OCRBench

76

73.7

79.9

75.3

81.2

82

83.7

85

ScreenSpot_v2

88.2

88.1

81.8

3.5

93.3

92.7

83.1

83.1

Table 2: Accuracy comparisons relative to popular open-weight, thinking models

All results were obtained using a consistent evaluation setup and prompts across models; numbers are provided for comparison and analysis rather than as leaderboard claims. For more information regarding benchmarks and evaluations, please read the technical paper on the Microsoft Research hub.

Suggested use cases and applications

Phi‑4‑Reasoning-Vision-15B supports applications that require both high‑fidelity visual perception and structured inference. Two representative scenarios include scientific and mathematical reasoning over visual inputs, and computer‑using agents (CUAs) that operate directly on graphical user interfaces. In both cases, the model provides grounded visual understanding paired with controllable, low‑latency reasoning suitable for interactive systems.

 

Figure 1. Phi-4-Reasoning-Vision-15B can interpret sequences of images.

Computer use agents in retail scenarios
For computer use agents, Phi‑4‑Reasoning-Vision-15B provides the perception and grounding layer required to understand and act within live ecommerce interfaces. For example, in an online shopping experience, the model interprets screen content—products, prices, filters, promotions, buttons, and cart state—and produces grounded observations that agentic models like Fara-7B can use to select actions. Its compact size and low latency inference make it well suited for CUA workflows and agentic applications.

Visual reasoning for education

Another practical use of visual reasoning models is education. A developer could build a K‑12 tutoring app with Phi‑4‑Reasoning‑Vision‑15B where students upload photos of worksheets, charts, or diagrams to get guided help—not answers. The model can understand the visual content, identify where the student went wrong, and explain the correct steps clearly. Over time, the app can adapt by serving new examples matched to the student’s learning level, turning visual problem‑solving into a personalized learning experience.

Microsoft Responsible AI principles

At Microsoft, our mission to empower people and organizations remains constant—especially in the age of AI, where the potential for human achievement is greater than ever. We recognize that trust is foundational to AI adoption, and earning that trust requires a commitment to transparency, safety, and accountability. As with other Phi models, Phi-4-Reasoning-Vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. These safety focused training signals help the model recognize and decline requests that fall outside intended or acceptable use. Additional details on the model’s safety considerations, evaluation approach, and known limitations are provided in the accompanying technical blog and model card.

Getting started

Start using Phi‑4‑Reasoning-Vision-15B in Microsoft Foundry today. Microsoft Foundry provides a unified environment for model discovery, evaluation, and deployment, making it straightforward to move from initial experimentation to production use while applying appropriate safety and governance practices.

Updated Mar 04, 2026
Version 3.0
No CommentsBe the first to comment