Add library_name and improve paper links
#5
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,69 +1,70 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
tags:
|
| 6 |
-
- multimodal
|
| 7 |
-
- vision-language
|
| 8 |
-
- reasoning
|
| 9 |
-
- math
|
| 10 |
-
- ocr
|
| 11 |
-
- gui-grounding
|
| 12 |
-
- computer-use
|
| 13 |
-
- chain-of-thought
|
| 14 |
base_model: microsoft/Phi-4-reasoning
|
|
|
|
|
|
|
|
|
|
| 15 |
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
model-index:
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
---
|
| 68 |
|
| 69 |
# Phi-4-Reasoning-Vision-15B
|
|
@@ -71,13 +72,15 @@ model-index:
|
|
| 71 |
[](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)
|
| 72 |
[](https://aka.ms/Phi-4-r-v-foundry)
|
| 73 |
[](https://github.com/microsoft/phi-4-reasoning-vision-15B)
|
| 74 |
-
[<br>
|
| 77 |
-
[Technical Report](https://
|
| 78 |
[Github](https://github.com/microsoft/phi-4-reasoning-vision-15B)<br>
|
| 79 |
[Try Phi-4-Reasoning-Vision-15B on Microsoft Foundry](https://aka.ms/Phi-4-r-v-foundry)<br>
|
| 80 |
|
|
|
|
|
|
|
| 81 |
**Developer:** Microsoft Corporation
|
| 82 |
**Authorized Representative:** Microsoft Ireland Operations Limited, 70 Sir John Rogerson's Quay, Dublin 2, D02 R296, Ireland
|
| 83 |
**Release Date:** March 4, 2026
|
|
@@ -99,10 +102,6 @@ Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model b
|
|
| 99 |
|
| 100 |
Phi-4-Reasoning-Vision-15B is trained with Supervised Fine-Tuning (SFT) on a carefully curated mixture of reasoning and non-reasoning data. Rather than training separate models for each mode, the model operates as a single system that can invoke extended chain-of-thought reasoning (using `<think>...</think>` blocks) for tasks like mathematical and scientific reasoning, or default to direct inference (tagged with `<nothink>`) for perception-focused tasks such as captioning, object detection, and grounding. The training data consists primarily of meticulously filtered and improved open-source vision-language datasets, supplemented by high-quality domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on substantially more training data and compute.
|
| 101 |
|
| 102 |
-
### 1.1 Alignment Approach
|
| 103 |
-
|
| 104 |
-
Phi-4-Reasoning-Vision-15B has adopted a safety post-training approach leveraging a combination of open-source and in-house generated synthetic datasets. The safety alignment is achieved through Supervised Fine-Tuning (SFT) using data that includes both helpfulness and harmlessness examples, as well as targeted questions and answers across multiple safety categories. The model's training data explicitly includes safety-oriented samples designed to teach appropriate refusal behavior for harmful content categories including hate speech, violence, self-harm content, and sexually explicit material. Automated red teaming was performed on Azure to assess safety risks including groundedness, jailbreak susceptibility, harmful content generation, and copyright violations for protected material.
|
| 105 |
-
|
| 106 |
---
|
| 107 |
|
| 108 |
## 2. Usage
|
|
@@ -114,23 +113,68 @@ Phi-4-Reasoning-Vision-15B is designed for general-purpose multimodal AI systems
|
|
| 114 |
- **Scientific and mathematical reasoning over visual inputs:** such as solving math problems presented as handwritten equations or diagrams, extracting and reasoning over quantitative information in documents, charts, and tables, and supporting multi-step reasoning in educational or scientific analysis contexts.
|
| 115 |
- **Computer-use agent (CUA) tasks:** such as interpreting screen content, localizing interactive GUI elements, and selecting actions within graphical user interfaces.
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
-
|
| 131 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
-
### 2.
|
| 134 |
|
| 135 |
Given the nature of the training data, always use chat template and system prompt for inference. For example, for the prompt "Please describe the image", the fully formatted chat templated prompt is the following:
|
| 136 |
|
|
@@ -152,101 +196,33 @@ To force a thinking response, append the `<think>` token to the generation templ
|
|
| 152 |
<|im_start|>assistant<|im_sep|><think>
|
| 153 |
```
|
| 154 |
|
| 155 |
-
To force a non-thinking response, append the `<nothink>` token to the generation template:
|
| 156 |
-
|
| 157 |
-
```
|
| 158 |
-
<|im_start|>assistant<|im_sep|><nothink>
|
| 159 |
-
```
|
| 160 |
-
|
| 161 |
-
### 2.5 Technical Requirements and Integration Guidance
|
| 162 |
-
|
| 163 |
-
The following software packages are required for running Phi-4-Reasoning-Vision:
|
| 164 |
-
|
| 165 |
-
- `torch >= 2.7.1`
|
| 166 |
-
- `transformers >= 4.57.1`
|
| 167 |
-
- `vllm >= 0.15.2` (only required if using vLLM)
|
| 168 |
-
|
| 169 |
-
Phi-4-Reasoning-Vision-15B has been tested on NVIDIA A6000, A100, H100, and B200 GPUs with the Ubuntu 22.04.5 LTS operating system. In principle, other GPU architectures with enough memory to fit the model could suffice, but these have not been tested. It is recommended that users host Phi-4-Reasoning-Vision-15B on a vLLM server using bf16 precision.
|
| 170 |
-
|
| 171 |
-
### 2.6 Responsible AI Considerations
|
| 172 |
-
|
| 173 |
-
Like other models, Phi-4-Reasoning-Vision-15B can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|
| 174 |
-
|
| 175 |
-
- **Quality of Service:** The model is trained primarily on English text. Languages other than English may experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Phi-4-Reasoning-Vision-15B is not intended to support multilingual use.
|
| 176 |
-
- **Representation of Harms & Perpetuation of Stereotypes:** The model may over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
|
| 177 |
-
- **Inappropriate or Offensive Content:** The model may produce inappropriate or offensive content, which may make it inappropriate to deploy in sensitive contexts without additional mitigations specific to the use case.
|
| 178 |
-
- **Information Reliability:** Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
|
| 179 |
-
|
| 180 |
-
Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g., privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include:
|
| 181 |
-
|
| 182 |
-
- **Allocation:** Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (e.g., housing, employment, credit) without further assessments and additional debiasing techniques.
|
| 183 |
-
- **High-Risk Scenarios:** Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable, or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (e.g., legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
|
| 184 |
-
- **Misinformation:** Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
|
| 185 |
-
- **Generation of Harmful Content:** Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
|
| 186 |
-
- **Misuse:** Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
|
| 187 |
-
|
| 188 |
---
|
| 189 |
|
| 190 |
## 3. Quality and Performance Evaluation
|
| 191 |
|
| 192 |
-
Phi-4-Reasoning-Vision-15B was evaluated across a broad range of public benchmarks spanning multimodal reasoning, mathematical problem solving, document and chart understanding, visual perception, OCR, and computer-use grounding tasks.
|
| 193 |
-
|
| 194 |
-
The model was evaluated on the following benchmarks via VLMEvalKit: AI2D (diagram understanding), BLINK (core visual perception), ChartQA (chart reasoning), DocVQA (document question answering), HallusionBench (hallucination and visual illusion detection), MathVerse (visual math with varying multimodal information), MathVision (competition-level mathematical reasoning), MathVista (math reasoning in visual contexts), MMMU (multi-discipline multimodal understanding), MMStar (vision-indispensable multimodal evaluation), OCRBench (OCR capabilities), ScreenSpot-V2 for Desktop, Mobile, and Web (GUI element localization), WeMath (human-like mathematical reasoning process evaluation), WildVision (real-world human preference evaluation), and ZeroBench (challenging visual reasoning). During development, additional benchmarks including MMMU-CoT, ScreenSpot-Pro, and V*Bench were evaluated using Eureka ML Insights.
|
| 195 |
|
| 196 |
### Table 1: Accuracy Comparisons Relative to Popular Open-Weight, Non-Thinking Models
|
| 197 |
|
| 198 |
-
| Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B – force nothink | Phi-4-mm-instruct | Kimi-VL-A3B-Instruct | gemma-3-12b-it | Qwen3-VL-8B-Instruct-4K | Qwen3-VL-
|
| 199 |
-
|---|---|---|---|---|---|---|---|
|
| 200 |
-
| AI2D_TEST | 84.8 | 84.7 | 68.6 | 84.6 | 80.4 | 82.7 |
|
| 201 |
-
| ChartQA_TEST | 83.3 | 76.5 | 23.5 | 87 | 39 | 83.1 |
|
| 202 |
-
|
|
| 203 |
-
|
|
| 204 |
-
|
|
| 205 |
-
|
|
| 206 |
-
| MMMU_VAL | 54.3 | 52 | 42.3 | 52 | 50 | 60.7 | 64.6 | 68.6 | 70.6 |
|
| 207 |
-
| MMStar | 64.5 | 63.3 | 45.9 | 60 | 59.4 | 68.9 | 69.9 | 73.7 | 74.3 |
|
| 208 |
-
| OCRBench | 76 | 75.6 | 62.6 | 86.5 | 75.3 | 89.2 | 90 | 88.5 | 88.5 |
|
| 209 |
-
| ScreenSpot_v2 | 88.2 | 88.3 | 28.5 | 89.8 | 3.5 | 91.5 | 91.5 | 93.7 | 93.9 |
|
| 210 |
-
|
| 211 |
-
### Table 2: Accuracy Comparisons Relative to Popular Open-Weight, Thinking Models
|
| 212 |
-
|
| 213 |
-
| Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B - force thinking | Kimi-VL-A3B-Thinking | gemma3-12b-it | Qwen3-VL-8B-Thinking-4K | Qwen3-VL-8B-Thinking-40K | Qwen3-VL-32B-Thinking-4K | Qwen3-VL-32B-Thinking-40K |
|
| 214 |
-
|---|---|---|---|---|---|---|---|---|
|
| 215 |
-
| AI2D_TEST | 84.8 | 79.7 | 81.2 | 80.4 | 83.5 | 83.9 | 86.9 | 87.2 |
|
| 216 |
-
| ChartQA_TEST | 83.3 | 82.9 | 73.3 | 39 | 78 | 78.6 | 78.5 | 79.1 |
|
| 217 |
-
| HallusionBench | 64.4 | 63.9 | 70.6 | 65.3 | 71.6 | 73 | 76.4 | 76.6 |
|
| 218 |
-
| MathVerse_MINI | 44.9 | 53.1 | 61 | 29.8 | 67.3 | 73.3 | 78.3 | 78.2 |
|
| 219 |
-
| MathVision_MINI | 36.2 | 36.2 | 50.3 | 31.9 | 43.1 | 50.7 | 60.9 | 58.6 |
|
| 220 |
-
| MathVista_MINI | 75.2 | 74.1 | 78.6 | 57.4 | 77.7 | 79.5 | 83.9 | 83.8 |
|
| 221 |
-
| MMMU_VAL | 54.3 | 55 | 60.2 | 50 | 59.3 | 65.3 | 72 | 72.2 |
|
| 222 |
-
| MMStar | 64.5 | 63.9 | 69.6 | 59.4 | 69.3 | 72.3 | 75.5 | 75.7 |
|
| 223 |
-
| OCRBench | 76 | 73.7 | 79.9 | 75.3 | 81.2 | 82 | 83.7 | 85 |
|
| 224 |
-
| ScreenSpot_v2 | 88.2 | 88.1 | 81.8 | 3.5 | 93.3 | 92.7 | 83.1 | 83.1 |
|
| 225 |
-
|
| 226 |
-
### 3.1 Safety Evaluation and Red-Teaming
|
| 227 |
-
|
| 228 |
-
Phi-4-Reasoning-Vision-15B was trained on a mixture of public safety data and internally generated tasks that it ought to refuse based on Microsoft's Responsible AI Policy.
|
| 229 |
-
|
| 230 |
-
Phi-4-Reasoning-Vision-15B's safety was evaluated using both quantitative and qualitative approaches prior to release. Automated red teaming was performed on Azure to assess safety risks across multiple risk categories, including disallowed content (sexual, violent, hateful, or self-harm content), copyright content and intellectual property, and jailbreak susceptibility. The evaluation assessed the model's groundedness and its tendency to generate fabricated or misleading information.
|
| 231 |
-
|
| 232 |
-
The safety evaluation built upon the established practices from the Phi-4-Reasoning model's safety assessment. The model's training data included explicit safety-oriented samples across both reasoning and non-reasoning modes, designed to teach appropriate refusal and harm-avoidance behaviors. The multimodal nature of the model introduces additional safety considerations around visual content interpretation, and evaluations were conducted to assess the model's behavior when presented with potentially harmful or misleading visual inputs.
|
| 233 |
-
|
| 234 |
-
| Evaluation | Description | Defect Rate |
|
| 235 |
-
|---|---|---|
|
| 236 |
-
| Text to Text Safety | Automated content safety evaluation measuring safety policies | 1.4% |
|
| 237 |
-
| Image to Text Safety | Automated content safety evaluation measuring safety policies | 4.5% |
|
| 238 |
|
| 239 |
---
|
| 240 |
|
| 241 |
-
## 4.
|
| 242 |
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
To learn more about the training data used for Phi-4-Reasoning-Vision-15B please refer to the full data card: RRRR_nnnn_Data Card for Foundation+Frontier Models.
|
| 246 |
|
| 247 |
-
|
|
|
|
| 248 |
|
| 249 |
-
|
| 250 |
|
| 251 |
---
|
| 252 |
|
|
@@ -258,40 +234,15 @@ Authorized representative: Microsoft Ireland Operations Limited, 70 Sir John Rog
|
|
| 258 |
|
| 259 |
---
|
| 260 |
|
| 261 |
-
## 6.
|
| 262 |
-
|
| 263 |
-
### A. Benchmarking Methodology
|
| 264 |
-
|
| 265 |
-
Phi-4-Reasoning-Vision-15B was evaluated using two complementary open-source evaluation frameworks:
|
| 266 |
-
|
| 267 |
-
**1. [Eureka ML Insights](https://github.com/microsoft/eureka-ml-insights)**
|
| 268 |
-
|
| 269 |
-
Used during development for internal benchmarks and ablation studies. The following benchmarks were evaluated through this framework:
|
| 270 |
-
|
| 271 |
-
- **MathVista:** Mathematical reasoning over visual inputs including diagrams, charts, and figures
|
| 272 |
-
- **MMMU-CoT:** Multi-discipline multimodal understanding with chain-of-thought reasoning
|
| 273 |
-
- **ScreenSpot / ScreenSpot-V2:** GUI element localization on desktop and mobile screenshots
|
| 274 |
-
- **ScreenSpot-Pro:** High-resolution professional GUI grounding tasks
|
| 275 |
-
- **V\*Bench:** Visual reasoning benchmark
|
| 276 |
-
|
| 277 |
-
**2. [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)**
|
| 278 |
-
|
| 279 |
-
Used for standardized community benchmark evaluation. The following benchmarks were evaluated through this framework:
|
| 280 |
|
| 281 |
-
|
| 282 |
-
- **BLINK:** Core visual perception benchmark with 3,807 multiple-choice questions spanning 14 classic computer vision tasks including relative depth estimation, visual correspondence, and multi-view reasoning.
|
| 283 |
-
- **ChartQA (TEST split):** Chart understanding and reasoning benchmark with 9,600 human-written questions assessing complex visual and logical reasoning over chart data.
|
| 284 |
-
- **DocVQA (VAL split):** Document visual question answering over 12,000+ document images, evaluating text extraction and comprehension within document layouts.
|
| 285 |
-
- **HallusionBench:** Diagnostic benchmark evaluating image-context reasoning, language hallucination tendencies, and visual illusion susceptibility in vision-language models.
|
| 286 |
-
- **MathVerse (MINI split):** Visual math benchmark with 2,612 multi-subject math problems transformed into six versions offering varying degrees of multimodal information content.
|
| 287 |
-
- **MathVision (MINI split):** 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions, spanning 16 mathematical disciplines across 5 difficulty levels.
|
| 288 |
-
- **MathVista (MINI split):** Mathematical reasoning in visual contexts including geometry, algebra, and data interpretation.
|
| 289 |
-
- **MMMU (DEV_VAL split):** Massive multi-discipline multimodal understanding benchmark with 11.5K questions from college exams covering six core disciplines and 30 subjects.
|
| 290 |
-
- **MMStar:** Vision-indispensable multimodal benchmark with 1,500 carefully curated samples evaluating six core capabilities: coarse perception, fine-grained perception, instance reasoning, logical reasoning, science and technology, and mathematics.
|
| 291 |
-
- **OCRBench:** Comprehensive OCR evaluation with 1,000 question-answer pairs spanning text recognition, scene text VQA, document-oriented VQA, key information extraction, and handwritten mathematical expression recognition.
|
| 292 |
-
- **ScreenSpot-V2 (Desktop, Mobile, Web):** GUI element localization benchmark across desktop, mobile, and web interfaces.
|
| 293 |
-
- **WeMath:** Mathematical reasoning process benchmark with 6.5K visual math problems spanning 67 hierarchical knowledge concepts, evaluating knowledge acquisition and generalization beyond end-to-date performance.
|
| 294 |
-
- **WildVision:** Real-world human preference evaluation benchmark with 500 high-quality samples curated from 8,000 user submissions, using GPT-4o as judge.
|
| 295 |
-
- **ZeroBench:** Challenging visual reasoning benchmark with 100 manually curated questions designed to probe the limits of spatial reasoning, object recognition, and complex visual scene interpretation.
|
| 296 |
|
| 297 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model: microsoft/Phi-4-reasoning
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
license: mit
|
| 6 |
pipeline_tag: image-text-to-text
|
| 7 |
+
library_name: transformers
|
| 8 |
+
tags:
|
| 9 |
+
- multimodal
|
| 10 |
+
- vision-language
|
| 11 |
+
- reasoning
|
| 12 |
+
- math
|
| 13 |
+
- ocr
|
| 14 |
+
- gui-grounding
|
| 15 |
+
- computer-use
|
| 16 |
+
- chain-of-thought
|
| 17 |
model-index:
|
| 18 |
+
- name: Phi-4-Reasoning-Vision-15B
|
| 19 |
+
results:
|
| 20 |
+
- task:
|
| 21 |
+
type: visual-question-answering
|
| 22 |
+
dataset:
|
| 23 |
+
name: AI2D
|
| 24 |
+
type: ai2d
|
| 25 |
+
metrics:
|
| 26 |
+
- type: accuracy
|
| 27 |
+
value: 84.8
|
| 28 |
+
- task:
|
| 29 |
+
type: visual-question-answering
|
| 30 |
+
dataset:
|
| 31 |
+
name: ChartQA
|
| 32 |
+
type: chartqa
|
| 33 |
+
metrics:
|
| 34 |
+
- type: accuracy
|
| 35 |
+
value: 83.3
|
| 36 |
+
- task:
|
| 37 |
+
type: visual-question-answering
|
| 38 |
+
dataset:
|
| 39 |
+
name: MathVista (MINI)
|
| 40 |
+
type: mathvista
|
| 41 |
+
metrics:
|
| 42 |
+
- type: accuracy
|
| 43 |
+
value: 75.2
|
| 44 |
+
- task:
|
| 45 |
+
type: visual-question-answering
|
| 46 |
+
dataset:
|
| 47 |
+
name: MMMU
|
| 48 |
+
type: mmmu
|
| 49 |
+
metrics:
|
| 50 |
+
- type: accuracy
|
| 51 |
+
value: 54.3
|
| 52 |
+
- task:
|
| 53 |
+
type: visual-question-answering
|
| 54 |
+
dataset:
|
| 55 |
+
name: OCRBench
|
| 56 |
+
type: ocrbench
|
| 57 |
+
metrics:
|
| 58 |
+
- type: accuracy
|
| 59 |
+
value: 76.0
|
| 60 |
+
- task:
|
| 61 |
+
type: visual-question-answering
|
| 62 |
+
dataset:
|
| 63 |
+
name: ScreenSpot-V2
|
| 64 |
+
type: screenspot-v2
|
| 65 |
+
metrics:
|
| 66 |
+
- type: accuracy
|
| 67 |
+
value: 88.2
|
| 68 |
---
|
| 69 |
|
| 70 |
# Phi-4-Reasoning-Vision-15B
|
|
|
|
| 72 |
[](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)
|
| 73 |
[](https://aka.ms/Phi-4-r-v-foundry)
|
| 74 |
[](https://github.com/microsoft/phi-4-reasoning-vision-15B)
|
| 75 |
+
[](https://huggingface.co/papers/2603.03975)
|
| 76 |
|
| 77 |
[Official Microsoft Blog](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)<br>
|
| 78 |
+
[Technical Report](https://huggingface.co/papers/2603.03975)<br>
|
| 79 |
[Github](https://github.com/microsoft/phi-4-reasoning-vision-15B)<br>
|
| 80 |
[Try Phi-4-Reasoning-Vision-15B on Microsoft Foundry](https://aka.ms/Phi-4-r-v-foundry)<br>
|
| 81 |
|
| 82 |
+
**Authors:** Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas.
|
| 83 |
+
|
| 84 |
**Developer:** Microsoft Corporation
|
| 85 |
**Authorized Representative:** Microsoft Ireland Operations Limited, 70 Sir John Rogerson's Quay, Dublin 2, D02 R296, Ireland
|
| 86 |
**Release Date:** March 4, 2026
|
|
|
|
| 102 |
|
| 103 |
Phi-4-Reasoning-Vision-15B is trained with Supervised Fine-Tuning (SFT) on a carefully curated mixture of reasoning and non-reasoning data. Rather than training separate models for each mode, the model operates as a single system that can invoke extended chain-of-thought reasoning (using `<think>...</think>` blocks) for tasks like mathematical and scientific reasoning, or default to direct inference (tagged with `<nothink>`) for perception-focused tasks such as captioning, object detection, and grounding. The training data consists primarily of meticulously filtered and improved open-source vision-language datasets, supplemented by high-quality domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on substantially more training data and compute.
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
---
|
| 106 |
|
| 107 |
## 2. Usage
|
|
|
|
| 113 |
- **Scientific and mathematical reasoning over visual inputs:** such as solving math problems presented as handwritten equations or diagrams, extracting and reasoning over quantitative information in documents, charts, and tables, and supporting multi-step reasoning in educational or scientific analysis contexts.
|
| 114 |
- **Computer-use agent (CUA) tasks:** such as interpreting screen content, localizing interactive GUI elements, and selecting actions within graphical user interfaces.
|
| 115 |
|
| 116 |
+
### 2.2 Sample Usage (Azure Foundry)
|
| 117 |
+
|
| 118 |
+
The following sample script demonstrates how to interact with the model via Azure Foundry, as provided in the official repository:
|
| 119 |
+
|
| 120 |
+
```python
|
| 121 |
+
import base64
|
| 122 |
+
import os
|
| 123 |
+
import requests
|
| 124 |
+
|
| 125 |
+
IMAGE_PATH = "<replace_with_your_image>.jpg"
|
| 126 |
+
|
| 127 |
+
ENDPOINT_BASE = "<your_base_endpoint_url>"
|
| 128 |
+
API_KEY = "<your_api_key_here>"
|
| 129 |
+
DEPLOYMENT_NAME = "Phi-4-Reasoning-Vision-15B" # replace here with your deployment name
|
| 130 |
+
|
| 131 |
+
def main():
|
| 132 |
+
with open(IMAGE_PATH, "rb") as f:
|
| 133 |
+
image_b64 = base64.b64encode(f.read()).decode("utf-8")
|
| 134 |
+
|
| 135 |
+
payload = {
|
| 136 |
+
"model": "Phi-4-Reasoning-Vision-15B",
|
| 137 |
+
"messages": [
|
| 138 |
+
{
|
| 139 |
+
"role": "user",
|
| 140 |
+
"content": [
|
| 141 |
+
{"type": "text", "text": "Describe this image in detail."},
|
| 142 |
+
{
|
| 143 |
+
"type": "image_url",
|
| 144 |
+
"image_url": {
|
| 145 |
+
"url": f"data:image/jpeg;base64,{image_b64}"
|
| 146 |
+
},
|
| 147 |
+
},
|
| 148 |
+
],
|
| 149 |
+
}
|
| 150 |
+
],
|
| 151 |
+
"max_tokens": 4096,
|
| 152 |
+
"temperature": 0.0,
|
| 153 |
+
}
|
| 154 |
+
|
| 155 |
+
headers = {
|
| 156 |
+
"Authorization": f"Bearer {API_KEY}",
|
| 157 |
+
"Content-Type": "application/json",
|
| 158 |
+
"azureml-model-deployment": DEPLOYMENT_NAME,
|
| 159 |
+
}
|
| 160 |
+
|
| 161 |
+
url = f"{ENDPOINT_BASE}/v1/chat/completions"
|
| 162 |
+
print(f"Requesting: {url}")
|
| 163 |
+
|
| 164 |
+
resp = requests.post(url, json=payload, headers=headers, timeout=120)
|
| 165 |
+
resp.raise_for_status()
|
| 166 |
+
|
| 167 |
+
result = resp.json()
|
| 168 |
+
print("
|
| 169 |
+
--- Response ---")
|
| 170 |
+
print(result["choices"][0]["message"]["content"])
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
if __name__ == "__main__":
|
| 174 |
+
main()
|
| 175 |
+
```
|
| 176 |
|
| 177 |
+
### 2.3 Input Formats
|
| 178 |
|
| 179 |
Given the nature of the training data, always use chat template and system prompt for inference. For example, for the prompt "Please describe the image", the fully formatted chat templated prompt is the following:
|
| 180 |
|
|
|
|
| 196 |
<|im_start|>assistant<|im_sep|><think>
|
| 197 |
```
|
| 198 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
---
|
| 200 |
|
| 201 |
## 3. Quality and Performance Evaluation
|
| 202 |
|
| 203 |
+
Phi-4-Reasoning-Vision-15B was evaluated across a broad range of public benchmarks spanning multimodal reasoning, mathematical problem solving, document and chart understanding, visual perception, OCR, and computer-use grounding tasks.
|
|
|
|
|
|
|
| 204 |
|
| 205 |
### Table 1: Accuracy Comparisons Relative to Popular Open-Weight, Non-Thinking Models
|
| 206 |
|
| 207 |
+
| Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B – force nothink | Phi-4-mm-instruct | Kimi-VL-A3B-Instruct | gemma-3-12b-it | Qwen3-VL-8B-Instruct-4K | Qwen3-VL-32B-Instruct-4K |
|
| 208 |
+
|---|---|---|---|---|---|---|---|
|
| 209 |
+
| AI2D_TEST | 84.8 | 84.7 | 68.6 | 84.6 | 80.4 | 82.7 | 84.8 |
|
| 210 |
+
| ChartQA_TEST | 83.3 | 76.5 | 23.5 | 87 | 39 | 83.1 | 84.3 |
|
| 211 |
+
| MathVista_MINI | 75.2 | 68.7 | 50.5 | 67.1 | 57.4 | 77.1 | 82.5 |
|
| 212 |
+
| MMMU_VAL | 54.3 | 52 | 42.3 | 52 | 50 | 60.7 | 68.6 |
|
| 213 |
+
| OCRBench | 76 | 75.6 | 62.6 | 86.5 | 75.3 | 89.2 | 88.5 |
|
| 214 |
+
| ScreenSpot_v2 | 88.2 | 88.3 | 28.5 | 89.8 | 3.5 | 91.5 | 93.7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
---
|
| 217 |
|
| 218 |
+
## 4. Responsible AI Considerations
|
| 219 |
|
| 220 |
+
Like other models, Phi-4-Reasoning-Vision-15B can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|
|
|
|
|
|
|
| 221 |
|
| 222 |
+
- **Quality of Service:** The model is trained primarily on English text. Languages other than English may experience worse performance.
|
| 223 |
+
- **Information Reliability:** Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
|
| 224 |
|
| 225 |
+
Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations.
|
| 226 |
|
| 227 |
---
|
| 228 |
|
|
|
|
| 234 |
|
| 235 |
---
|
| 236 |
|
| 237 |
+
## 6. Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
|
| 239 |
+
If you use Phi-4-Reasoning-Vision in your research, please use the following BibTeX entry.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
|
| 241 |
+
```bibtex
|
| 242 |
+
@article{phi4vr14b2026,
|
| 243 |
+
title={Phi-4-reasoning-vision-15B Technical Report},
|
| 244 |
+
author={Aneja, Jyoti and Harrison, Michael and Joshi, Neel and LaBonte, Tyler and Langford, John and Salinas, Eduardo},
|
| 245 |
+
journal={arXiv:2511.19663},
|
| 246 |
+
year={2026}
|
| 247 |
+
}
|
| 248 |
+
```
|