Blog Post

Microsoft Developer Community Blog
4 MIN READ

Phi-4-Reasoning-Vision-15B: Use Cases In-Depth

kinfey's avatar
kinfey
Icon for Microsoft rankMicrosoft
Mar 04, 2026

Phi-4-Reasoning-vision-15B is Microsoft's latest vision reasoning model released on Microsoft Foundry. It combines high-resolution visual perception with selective, task-aware reasoning, making it the first model in the Phi-4 family to simultaneously achieve both "seeing clearly" and "thinking deeply" as a small language model (SLM).

Traditional vision models only perform passive perception — recognizing "what's in" an image. Phi-4-Reasoning-Vision-15B goes further by performing structured, multi-step reasoning: understanding visual structure in images, connecting it with textual context, and reaching actionable conclusions. This enables developers to build intelligent applications ranging from chart analysis to GUI automation.

Core Design Features

2.1 Selective Reasoning

The model's most critical design feature is its hybrid reasoning behavior. It can switch between "reasoning mode" and "non-reasoning mode" based on the prompt:

  • When deep reasoning is needed (e.g., math problems, logical analysis) → Multi-step reasoning chain is activated
  • When fast perception is sufficient (e.g., OCR, element localization) → Direct output with reduced latency

2.2 Three Thinking Modes (from Notebook Examples)

Developers can precisely control reasoning behavior via the thinking_mode parameter:

Mode

Trigger

Description

Best For

hybrid (Mixed)

Default

Model autonomously decides whether deep reasoning is needed

General use, balancing speed and accuracy

think (Deep Thinking)

Appends <think> token

Forces full reasoning chain

Complex math / science / logic problems

nothink (Fast Response)

Appends <nothink> token

Skips reasoning chain, outputs directly

Low-latency perception tasks, simple Q&A

The corresponding code implementation:

def run_inference(processor, model, prompt, image, thinking_mode="hybrid"):
    ## FORM MESSAGE AND LOAD IMAGE
    messages = [
        {
            "role": "user",
            "content": prompt,
        }
    ]

    ## PROCESS INPUTS

    prompt = processor.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        return_dict=False,
    )

    if thinking_mode == "think":
        prompt = str(prompt) + "<think>"
    elif thinking_mode == "nothink":
        prompt = str(prompt) + "<|dummy_84|>"

    print(f"Prompt: {prompt}")

    inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)

    ## GENERATE RESPONSE
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=None,
        top_p=None,
        do_sample=False,
        use_cache=False,
    )

    ## DECODE RESPONSE
    sequence_length = inputs["input_ids"].shape[1]

    sequence_length -= 1 if thinking_mode == "think" else 0 # remove the extra token for nothink mode

    new_output_ids = output_ids[:, sequence_length:]
    model_output = processor.batch_decode(
        new_output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    return model_output

This design allows developers to dynamically balance latency and accuracy at runtime — essential for real-time interactive applications.

Key Use Cases

Use Case 1: GUI Agents (Computer Use Agents)

This is one of the model's most important application areas.The model receives a screenshot and a natural language instruction, then outputs the normalized bounding box coordinates for the target UI element. The Notebook also provides a plot_boxes() visualization function that compares model predictions (red box) against ground truth annotations (green box).

Real-World Example — E-Commerce Shopping Agent:

As described in the official documentation, in retail scenarios the model serves as the perception layer for computer-use agents:

  • Screen comprehension: Identifies products, prices, filters, promotions, buttons, and cart states
  • Grounded output: Produces actionable coordinates for upstream agent models (e.g., Fara-7B) to execute clicks, scrolls, and other interactions
  • Real-time decision support: Compact model size and low-latency inference, suitable for navigating dense product listings and comparing options

Use Case 2: Mathematical and Scientific Visual Reasoning

Typical applications:

  • Interpreting geometric figures and function graphs for problem-solving
  • Analyzing scientific experiment diagrams and data charts
  • Education: Students photograph and upload problems; the model shows the complete reasoning process and solution steps

Use Case 3: Document, Chart, and Table Understanding

Typical applications:

  • IT Operations: Interpreting monitoring dashboards, performance charts, and incident reports to assist diagnosis and decision-making
  • Financial Analysis: Extracting metrics from report screenshots and interpreting trends
  • Enterprise Report Automation: Processing scanned documents and tables to generate structured summaries

Samples


1. Using Phi-4-Reasoning-Vision-15B to detect jaywalking

    Go to - Sample Code

2. Using Phi-4-Reasoning-Vision-15B to math

    Go to - Sample Code

3. Using Phi-4-Reasoning-Vision-15B for GUI Agent

     Go to - Sample Code

Model Comparison at a Glance

Below is a comparison of Phi-4-Reasoning-Vision-15B against comparable models on key tasks:

No Thinking Mode


Thinking Mode

Phi-4-Reasoning-Vision-15B shows clear advantages in math reasoning and GUI grounding tasks while remaining competitive in general multimodal understanding.

Summary

Phi-4-Reasoning-Vision-15B represents a significant milestone for small vision reasoning models:

  1. Sees clearly: High-resolution visual perception supporting documents, charts, UI screenshots, and more
  2. Thinks deeply: Selective multi-step reasoning chains that rival larger models on complex tasks
  3. Runs fast: 15B parameters + NoThink mode, suitable for real-time interactive applications
  4. Adapts flexibly: Three thinking modes switchable on the fly, letting developers dynamically balance accuracy and latency at runtime

Whether building e-commerce shopping agents, IT operations assistants, or educational tutoring tools, this model provides a complete capability chain from "seeing" to "understanding" to "acting."

Resources

1. Read official Blog - Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
2. Learn more about Phi-4-reasoning-vision in Huggingface  - https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B
3. Learn more about Microsoft Phi Family - Microsoft Phi CookBook

 

 




Updated Mar 03, 2026
Version 1.0
No CommentsBe the first to comment