Phi-4-Reasoning-vision-15B is Microsoft's latest vision reasoning model released on Microsoft Foundry. It combines high-resolution visual perception with selective, task-aware reasoning, making it the first model in the Phi-4 family to simultaneously achieve both "seeing clearly" and "thinking deeply" as a small language model (SLM).
Traditional vision models only perform passive perception — recognizing "what's in" an image. Phi-4-Reasoning-Vision-15B goes further by performing structured, multi-step reasoning: understanding visual structure in images, connecting it with textual context, and reaching actionable conclusions. This enables developers to build intelligent applications ranging from chart analysis to GUI automation.
Core Design Features
2.1 Selective Reasoning
The model's most critical design feature is its hybrid reasoning behavior. It can switch between "reasoning mode" and "non-reasoning mode" based on the prompt:
- When deep reasoning is needed (e.g., math problems, logical analysis) → Multi-step reasoning chain is activated
- When fast perception is sufficient (e.g., OCR, element localization) → Direct output with reduced latency
2.2 Three Thinking Modes (from Notebook Examples)
Developers can precisely control reasoning behavior via the thinking_mode parameter:
|
Mode |
Trigger |
Description |
Best For |
|
hybrid (Mixed) |
Default |
Model autonomously decides whether deep reasoning is needed |
General use, balancing speed and accuracy |
|
think (Deep Thinking) |
Appends <think> token |
Forces full reasoning chain |
Complex math / science / logic problems |
|
nothink (Fast Response) |
Appends <nothink> token |
Skips reasoning chain, outputs directly |
Low-latency perception tasks, simple Q&A |
The corresponding code implementation:
def run_inference(processor, model, prompt, image, thinking_mode="hybrid"):
## FORM MESSAGE AND LOAD IMAGE
messages = [
{
"role": "user",
"content": prompt,
}
]
## PROCESS INPUTS
prompt = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
return_dict=False,
)
if thinking_mode == "think":
prompt = str(prompt) + "<think>"
elif thinking_mode == "nothink":
prompt = str(prompt) + "<|dummy_84|>"
print(f"Prompt: {prompt}")
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
## GENERATE RESPONSE
output_ids = model.generate(
**inputs,
max_new_tokens=1024,
temperature=None,
top_p=None,
do_sample=False,
use_cache=False,
)
## DECODE RESPONSE
sequence_length = inputs["input_ids"].shape[1]
sequence_length -= 1 if thinking_mode == "think" else 0 # remove the extra token for nothink mode
new_output_ids = output_ids[:, sequence_length:]
model_output = processor.batch_decode(
new_output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
return model_output
This design allows developers to dynamically balance latency and accuracy at runtime — essential for real-time interactive applications.
Key Use Cases
Use Case 1: GUI Agents (Computer Use Agents)
This is one of the model's most important application areas.The model receives a screenshot and a natural language instruction, then outputs the normalized bounding box coordinates for the target UI element. The Notebook also provides a plot_boxes() visualization function that compares model predictions (red box) against ground truth annotations (green box).
Real-World Example — E-Commerce Shopping Agent:
As described in the official documentation, in retail scenarios the model serves as the perception layer for computer-use agents:
- Screen comprehension: Identifies products, prices, filters, promotions, buttons, and cart states
- Grounded output: Produces actionable coordinates for upstream agent models (e.g., Fara-7B) to execute clicks, scrolls, and other interactions
- Real-time decision support: Compact model size and low-latency inference, suitable for navigating dense product listings and comparing options
Use Case 2: Mathematical and Scientific Visual Reasoning
Typical applications:
- Interpreting geometric figures and function graphs for problem-solving
- Analyzing scientific experiment diagrams and data charts
- Education: Students photograph and upload problems; the model shows the complete reasoning process and solution steps
Use Case 3: Document, Chart, and Table Understanding
Typical applications:
- IT Operations: Interpreting monitoring dashboards, performance charts, and incident reports to assist diagnosis and decision-making
- Financial Analysis: Extracting metrics from report screenshots and interpreting trends
- Enterprise Report Automation: Processing scanned documents and tables to generate structured summaries
Samples
1. Using Phi-4-Reasoning-Vision-15B to detect jaywalking
Go to - Sample Code
2. Using Phi-4-Reasoning-Vision-15B to math
Go to - Sample Code
3. Using Phi-4-Reasoning-Vision-15B for GUI Agent
Go to - Sample Code
Model Comparison at a Glance
Below is a comparison of Phi-4-Reasoning-Vision-15B against comparable models on key tasks:
No Thinking Mode
Thinking Mode
Phi-4-Reasoning-Vision-15B shows clear advantages in math reasoning and GUI grounding tasks while remaining competitive in general multimodal understanding.
Summary
Phi-4-Reasoning-Vision-15B represents a significant milestone for small vision reasoning models:
- Sees clearly: High-resolution visual perception supporting documents, charts, UI screenshots, and more
- Thinks deeply: Selective multi-step reasoning chains that rival larger models on complex tasks
- Runs fast: 15B parameters + NoThink mode, suitable for real-time interactive applications
- Adapts flexibly: Three thinking modes switchable on the fly, letting developers dynamically balance accuracy and latency at runtime
Whether building e-commerce shopping agents, IT operations assistants, or educational tutoring tools, this model provides a complete capability chain from "seeing" to "understanding" to "acting."
Resources
1. Read official Blog - Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
2. Learn more about Phi-4-reasoning-vision in Huggingface - https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B
3. Learn more about Microsoft Phi Family - Microsoft Phi CookBook