Microsoft Developer Community Blog

7 MIN READ

Make Phi-4-mini-reasoning more powerful with industry reasoning on edge devices

Lee_Stott

Microsoft

Apr 30, 2025

The advanced reasoning function model has officially entered the Phi-4 Family. You can use Phi-4-mini-reasoning (3.8B) and Phi-4-reasoning (14B) to complete reasoning based on mathematics and simple logic scenarios for different computing power. In real scenarios, we need reasoning scenarios not only in mathematics, but also in many industry scenarios after fine tuning with CoT datasets. Next, I will use a simple data scenario in medicine as an example

In situations with limited computing, Phi-4-mini-reasoning will is an excellent model choice. We can use Microsoft Olive or Apple MLX Framework to quantize Phi-4-mini-reasoning and deploy it on edge terminals such as IoT, Laotop and mobile devices.

Quantization

In order to solve the problem that the model is difficult to deploy directly to specific hardware, we need to reduce the complexity of the model through model quantization. Undertaking the quantization process will inevitably cause precision loss.

Quantize Phi-4-mini-reasoning using Microsoft Olive

Microsoft Olive is an AI model optimization toolkit for ONNX Runtime. Given a model and target hardware, Olive (short for Onnx LIVE) will combine the most appropriate optimization techniques to output the most efficient ONNX model for inference in the cloud or on the edge. We can combine Microsoft Olive and Phi-4-mini-reasoning on Azure AI Foundry's Model Catalog to quantize Phi-4-mini-reasoning to an ONNX format model.

Create your Notebook on Azure ML
Install Microsoft Olive

pip install git+https://github.com/Microsoft/Olive.git

Quantize using Microsoft Olive

olive auto-opt --model_name_or_path {Azure Model Catalog path ,such as azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1 }--device cpu --provider CPUExecutionProvider --use_model_builder --precision int4 --output_path ./phi-4-14b-reasoninig-onnx --log_level 1

Register your quantized Model

! python -m mlx_lm.generate --model ./phi-4-mini-reasoning --adapter-path ./adapters --max-token 4096 --prompt "A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?" --extra-eos-token "<|end|>"

Download to local and run

Download the onnx model to local device

ml_client.models.download("phi-4-mini-onnx-int4-cpu", 1) Running onnx model with onnxruntime-genai Install onnxruntime-genai (This is CPU version) pip install onnxruntime-genai Run it import onnxruntime_genai as og model_folder = "Your ONNX Model Path" model = og.Model(model_folder) tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() search_options = {} search_options['max_length'] = 32768 chat_template = "<|user|>{input}<|end|><|assistant|>" text = 'A school arranges dormitories for students. If each dormitory accommodates 5 people, 4 people cannot live there; if each dormitory accommodates 6 people, one dormitory only has 4 people, and two dormitories are empty. Find the number of students in this grade and the number of dormitories.' prompt = f'{chat_template.format(input=text)}' input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.append_tokens(input_tokens) while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end='', flush=True)

Get Notebook from Phi Cookbook : https://aka.ms/phicookbook

Quantize Phi-4-mini-reasoning model using Apple MLX

Install Apple MLX Framework

pip install -U mlx-lm

Convert Phi-4-mini-reasoning model through Apple MLX quantization

python -m mlx_lm.convert --hf-path {Phi-4-mini-reasoning Hugging face id} -q

Run Phi-4-mini-reasoning with Apple MLX in terminal

python -m mlx_lm.generate --model ./mlx_model --max-token 2048 --prompt "A school arranges dormitories for students. If each dormitory accommodates 5 people, 4 people cannot live there; if each dormitory accommodates 6 people, one dormitory only has 4 people, and two dormitories are empty. Find the number of students in this grade and the number of dormitories." --extra-eos-token "<|end|>" --temp 0.0

Fine-tuning

We can fine-tune the CoT data of different scenarios to enable Phi-4-mini-reasoning to have reasoning capabilities for different scenarios. Here we use the Medical CoT data from a public Huggingface datasets as our example (this is just an example. If you need rigorous medical reasoning, please seek more professional data support)

We can fine-tune our CoT data in Azure ML

Fine-tune Phi-4-mini-reasoning using Microsoft Olive in Azure ML

Note- Please use Standard_NC24ads_A100_v4 to run this sample

Get Data from Hugging face datasets

pip install datasets run this script to get train data from datasets import load_dataset def formatting_prompts_func(examples): inputs = examples["Question"] cots = examples["Complex_CoT"] outputs = examples["Response"] texts = [] for input, cot, output in zip(inputs, cots, outputs): text = prompt_template.format(input, cot, output) + "<|end|>" # text = prompt_template.format(input, cot, output) + "<|endoftext|>" texts.append(text) return { "text": texts, } # Create the English dataset dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train",trust_remote_code=True) dataset = dataset.map(formatting_prompts_func, batched = True,remove_columns=["Question", "Complex_CoT", "Response"]) dataset.to_json("en_dataset.jsonl")

Fine-tuning with Microsoft Olive

olive finetune \ --method lora \ --model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} \ --trust_remote_code \ --data_name json \ --data_files ./en_dataset.jsonl \ --train_split "train[:16000]" \ --eval_split "train[16000:19700]" \ --text_field "text" \ --max_steps 100 \ --logging_steps 10 \ --output_path {Your fine-tuning save path} \ --log_level 1

Convert the model to ONNX with Microsoft Olive

olive capture-onnx-graph \ --model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} \ --adapter_path {Your fine-tuning adapter path} \ --use_model_builder \ --output_path {Your save onnx path} \ --log_level 1 olive generate-adapter \ --model_name_or_path {Your save onnx path} \ --output_path {Your save onnx adapter path} \ --log_level 1

Run the model with onnxruntime-genai-cuda

Install onnxruntime-genai-cuda SDK import onnxruntime_genai as og import numpy as np import os model_folder = "./models/phi-4-mini-reasoning/adapter-onnx/model/" model = og.Model(model_folder) adapters = og.Adapters(model) adapters.load('./models/phi-4-mini-reasoning/adapter-onnx/model/adapter_weights.onnx_adapter', "en_medical_reasoning") tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() search_options = {} search_options['max_length'] = 200 search_options['past_present_share_buffer'] = False search_options['temperature'] = 1 search_options['top_k'] = 1 prompt_template = """<|user|>{}<|end|><|assistant|><think>""" question = """ A 33-year-old woman is brought to the emergency department 15 minutes after being stabbed in the chest with a screwdriver. Given her vital signs of pulse 110\/min, respirations 22\/min, and blood pressure 90\/65 mm Hg, along with the presence of a 5-cm deep stab wound at the upper border of the 8th rib in the left midaxillary line, which anatomical structure in her chest is most likely to be injured? """ prompt = prompt_template.format(question, "") input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.set_active_adapter(adapters, "en_medical_reasoning") generator.append_tokens(input_tokens) while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end='', flush=True) inference model with onnxruntime-genai cuda olive finetune \ --method lora \ --model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} \ --trust_remote_code \ --data_name json \ --data_files ./en_dataset.jsonl \ --train_split "train[:16000]" \ --eval_split "train[16000:19700]" \ --text_field "text" \ --max_steps 100 \ --logging_steps 10 \ --output_path {Your fine-tuning save path} \ --log_level 1

Fine-tune Phi-4-mini-reasoning using Apple MLX locally on MacOS

Note- we recommend that you use devices with a minimum of 64GB Memory and Apple Silicon devices

Get the DataSet from Hugging face datasets

pip install datasets run this script to get train and valid data from datasets import load_dataset prompt_template = """<|user|>{}<|end|><|assistant|><think>{}</think>{}<|end|>""" def formatting_prompts_func(examples): inputs = examples["Question"] cots = examples["Complex_CoT"] outputs = examples["Response"] texts = [] for input, cot, output in zip(inputs, cots, outputs): # text = prompt_template.format(input, cot, output) + "<|end|>" text = prompt_template.format(input, cot, output) + "<|endoftext|>" texts.append(text) return { "text": texts, } dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", trust_remote_code=True) split_dataset = dataset["train"].train_test_split(test_size=0.2, seed=200) train_dataset = split_dataset['train'] validation_dataset = split_dataset['test'] train_dataset = train_dataset.map(formatting_prompts_func, batched = True,remove_columns=["Question", "Complex_CoT", "Response"]) train_dataset.to_json("./data/train.jsonl") validation_dataset = validation_dataset.map(formatting_prompts_func, batched = True,remove_columns=["Question", "Complex_CoT", "Response"]) validation_dataset.to_json("./data/valid.jsonl")

Fine-tuning with Apple MLX

python -m mlx_lm.lora --model ./phi-4-mini-reasoning --train --data ./data --iters 100

Running the model

! python -m mlx_lm.generate --model ./phi-4-mini-reasoning --adapter-path ./adapters --max-token 4096 --prompt "A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?" --extra-eos-token "<|end|>"

Get Notebook from Phi Cookbook : https://aka.ms/phicookbook

We hope this sample has inspired you to use Phi-4-mini-reasoning and Phi-4-reasoning to complete industry reasoning for our own scenarios.

Related resources

Phi4-mini-reasoning Tech Report https://aka.ms/phi4-mini-reasoning/techreport

Phi-4-Mini-Reasoning technical Report· microsoft/Phi-4-mini-reasoning

Phi-4-mini-reasoning on Azure AI Foundry https://aka.ms/phi4-mini-reasoning/azure

Phi-4 Reasoning Blog https://aka.ms/phi4-mini-reasoning/blog

Phi Cookbook https://aka.ms/phicookbook

Showcasing Phi-4-Reasoning: A Game-Changer for AI Developers | Microsoft Community Hub

Models
Phi-4 Reasoning https://huggingface.co/microsoft/Phi-4-reasoning
Phi-4 Reasoning Plus https://huggingface.co/microsoft/Phi-4-reasoning-plus
Phi-4-mini-reasoning Hugging Face https://aka.ms/phi4-mini-reasoning/hf
Phi-4-mini-reasoning on Azure AI Foundry https://aka.ms/phi4-mini-reasoning/azure
Microsoft (Microsoft) Models on Hugging Face
Phi-4 Reasoning Models Azure AI Foundry Models

Access Phi-4-reasoning models

Phi Models at Azure AI Foundry Models
Phi Models on Hugging Face
Phi Models on GitHub Marketplace Models

Updated Apr 30, 2025