Add library_name and improve paper links

#5
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +152 -201
README.md CHANGED
@@ -1,69 +1,70 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
- tags:
6
- - multimodal
7
- - vision-language
8
- - reasoning
9
- - math
10
- - ocr
11
- - gui-grounding
12
- - computer-use
13
- - chain-of-thought
14
  base_model: microsoft/Phi-4-reasoning
 
 
 
15
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
 
 
 
16
  model-index:
17
- - name: Phi-4-Reasoning-Vision-15B
18
- results:
19
- - task:
20
- type: visual-question-answering
21
- dataset:
22
- name: AI2D
23
- type: ai2d
24
- metrics:
25
- - type: accuracy
26
- value: 84.8
27
- - task:
28
- type: visual-question-answering
29
- dataset:
30
- name: ChartQA
31
- type: chartqa
32
- metrics:
33
- - type: accuracy
34
- value: 83.3
35
- - task:
36
- type: visual-question-answering
37
- dataset:
38
- name: MathVista (MINI)
39
- type: mathvista
40
- metrics:
41
- - type: accuracy
42
- value: 75.2
43
- - task:
44
- type: visual-question-answering
45
- dataset:
46
- name: MMMU
47
- type: mmmu
48
- metrics:
49
- - type: accuracy
50
- value: 54.3
51
- - task:
52
- type: visual-question-answering
53
- dataset:
54
- name: OCRBench
55
- type: ocrbench
56
- metrics:
57
- - type: accuracy
58
- value: 76.0
59
- - task:
60
- type: visual-question-answering
61
- dataset:
62
- name: ScreenSpot-V2
63
- type: screenspot-v2
64
- metrics:
65
- - type: accuracy
66
- value: 88.2
67
  ---
68
 
69
  # Phi-4-Reasoning-Vision-15B
@@ -71,13 +72,15 @@ model-index:
71
  [![Microsoft](https://img.shields.io/badge/Microsoft-Project-0078D4?logo=microsoft)](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)
72
  [![Foundry](https://img.shields.io/badge/Azure-Foundry-0089D6)](https://aka.ms/Phi-4-r-v-foundry)
73
  [![Github](https://img.shields.io/badge/Github-181717?logo=github&logoColor=white)](https://github.com/microsoft/phi-4-reasoning-vision-15B)
74
- [![Paper](https://img.shields.io/badge/Paper-2511.19663-red)](https://aka.ms/Phi-4-reasoning-vision-15B-TR)
75
 
76
  [Official Microsoft Blog](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)<br>
77
- [Technical Report](https://aka.ms/Phi-4-reasoning-vision-15B-TR)<br>
78
  [Github](https://github.com/microsoft/phi-4-reasoning-vision-15B)<br>
79
  [Try Phi-4-Reasoning-Vision-15B on Microsoft Foundry](https://aka.ms/Phi-4-r-v-foundry)<br>
80
 
 
 
81
  **Developer:** Microsoft Corporation
82
  **Authorized Representative:** Microsoft Ireland Operations Limited, 70 Sir John Rogerson's Quay, Dublin 2, D02 R296, Ireland
83
  **Release Date:** March 4, 2026
@@ -99,10 +102,6 @@ Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model b
99
 
100
  Phi-4-Reasoning-Vision-15B is trained with Supervised Fine-Tuning (SFT) on a carefully curated mixture of reasoning and non-reasoning data. Rather than training separate models for each mode, the model operates as a single system that can invoke extended chain-of-thought reasoning (using `<think>...</think>` blocks) for tasks like mathematical and scientific reasoning, or default to direct inference (tagged with `<nothink>`) for perception-focused tasks such as captioning, object detection, and grounding. The training data consists primarily of meticulously filtered and improved open-source vision-language datasets, supplemented by high-quality domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on substantially more training data and compute.
101
 
102
- ### 1.1 Alignment Approach
103
-
104
- Phi-4-Reasoning-Vision-15B has adopted a safety post-training approach leveraging a combination of open-source and in-house generated synthetic datasets. The safety alignment is achieved through Supervised Fine-Tuning (SFT) using data that includes both helpfulness and harmlessness examples, as well as targeted questions and answers across multiple safety categories. The model's training data explicitly includes safety-oriented samples designed to teach appropriate refusal behavior for harmful content categories including hate speech, violence, self-harm content, and sexually explicit material. Automated red teaming was performed on Azure to assess safety risks including groundedness, jailbreak susceptibility, harmful content generation, and copyright violations for protected material.
105
-
106
  ---
107
 
108
  ## 2. Usage
@@ -114,23 +113,68 @@ Phi-4-Reasoning-Vision-15B is designed for general-purpose multimodal AI systems
114
  - **Scientific and mathematical reasoning over visual inputs:** such as solving math problems presented as handwritten equations or diagrams, extracting and reasoning over quantitative information in documents, charts, and tables, and supporting multi-step reasoning in educational or scientific analysis contexts.
115
  - **Computer-use agent (CUA) tasks:** such as interpreting screen content, localizing interactive GUI elements, and selecting actions within graphical user interfaces.
116
 
117
- The model is also capable of general multimodal tasks including image captioning, visual question answering, optical character recognition, object localization, and grounding. Its hybrid reasoning design allows it to produce fast, direct responses for perception-focused tasks while engaging in structured chain-of-thought reasoning when the task benefits from it, making it suitable as a building block for generative AI-powered features across a range of applications.
118
-
119
- ### 2.2 Out-of-Scope Use Cases
120
-
121
- Phi-4-Reasoning-Vision-15B is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of vision-language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
122
-
123
- The model is trained primarily on English text and image-text pairs. Languages other than English may experience degraded performance. The model should not be used in scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (e.g., housing, employment, credit) without further assessments and additional debiasing techniques. It is not suitable for providing medical diagnoses, legal advice, or financial planning. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
124
-
125
- ### 2.3 Distribution Channels
126
-
127
- Some of Phi-4-Reasoning-Vision-15B's distribution channels include:
128
-
129
- - Public access through open-source repositories: [Hugging Face](https://huggingface.co/microsoft/Phi-4-Reasoning-Vision-15B)
130
- - Public access through open-source code repositories: [GitHub](https://github.com/microsoft/Phi-4-vision)
131
- - Enterprise or subscription-based access through [Azure AI Foundry](https://ai.azure.com)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
- ### 2.4 Input Formats
134
 
135
  Given the nature of the training data, always use chat template and system prompt for inference. For example, for the prompt "Please describe the image", the fully formatted chat templated prompt is the following:
136
 
@@ -152,101 +196,33 @@ To force a thinking response, append the `<think>` token to the generation templ
152
  <|im_start|>assistant<|im_sep|><think>
153
  ```
154
 
155
- To force a non-thinking response, append the `<nothink>` token to the generation template:
156
-
157
- ```
158
- <|im_start|>assistant<|im_sep|><nothink>
159
- ```
160
-
161
- ### 2.5 Technical Requirements and Integration Guidance
162
-
163
- The following software packages are required for running Phi-4-Reasoning-Vision:
164
-
165
- - `torch >= 2.7.1`
166
- - `transformers >= 4.57.1`
167
- - `vllm >= 0.15.2` (only required if using vLLM)
168
-
169
- Phi-4-Reasoning-Vision-15B has been tested on NVIDIA A6000, A100, H100, and B200 GPUs with the Ubuntu 22.04.5 LTS operating system. In principle, other GPU architectures with enough memory to fit the model could suffice, but these have not been tested. It is recommended that users host Phi-4-Reasoning-Vision-15B on a vLLM server using bf16 precision.
170
-
171
- ### 2.6 Responsible AI Considerations
172
-
173
- Like other models, Phi-4-Reasoning-Vision-15B can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
174
-
175
- - **Quality of Service:** The model is trained primarily on English text. Languages other than English may experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Phi-4-Reasoning-Vision-15B is not intended to support multilingual use.
176
- - **Representation of Harms & Perpetuation of Stereotypes:** The model may over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
177
- - **Inappropriate or Offensive Content:** The model may produce inappropriate or offensive content, which may make it inappropriate to deploy in sensitive contexts without additional mitigations specific to the use case.
178
- - **Information Reliability:** Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
179
-
180
- Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g., privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include:
181
-
182
- - **Allocation:** Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (e.g., housing, employment, credit) without further assessments and additional debiasing techniques.
183
- - **High-Risk Scenarios:** Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable, or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (e.g., legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
184
- - **Misinformation:** Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
185
- - **Generation of Harmful Content:** Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
186
- - **Misuse:** Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
187
-
188
  ---
189
 
190
  ## 3. Quality and Performance Evaluation
191
 
192
- Phi-4-Reasoning-Vision-15B was evaluated across a broad range of public benchmarks spanning multimodal reasoning, mathematical problem solving, document and chart understanding, visual perception, OCR, and computer-use grounding tasks. Two evaluation frameworks were used: Microsoft's Eureka ML Insights for internal development benchmarks, and VLMEvalKit for standardized community benchmarks. Evaluation logs will be released publicly.
193
-
194
- The model was evaluated on the following benchmarks via VLMEvalKit: AI2D (diagram understanding), BLINK (core visual perception), ChartQA (chart reasoning), DocVQA (document question answering), HallusionBench (hallucination and visual illusion detection), MathVerse (visual math with varying multimodal information), MathVision (competition-level mathematical reasoning), MathVista (math reasoning in visual contexts), MMMU (multi-discipline multimodal understanding), MMStar (vision-indispensable multimodal evaluation), OCRBench (OCR capabilities), ScreenSpot-V2 for Desktop, Mobile, and Web (GUI element localization), WeMath (human-like mathematical reasoning process evaluation), WildVision (real-world human preference evaluation), and ZeroBench (challenging visual reasoning). During development, additional benchmarks including MMMU-CoT, ScreenSpot-Pro, and V*Bench were evaluated using Eureka ML Insights.
195
 
196
  ### Table 1: Accuracy Comparisons Relative to Popular Open-Weight, Non-Thinking Models
197
 
198
- | Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B – force nothink | Phi-4-mm-instruct | Kimi-VL-A3B-Instruct | gemma-3-12b-it | Qwen3-VL-8B-Instruct-4K | Qwen3-VL-8B-Instruct-32K | Qwen3-VL-32B-Instruct-4K | Qwen3-VL-32B-Instruct-32K |
199
- |---|---|---|---|---|---|---|---|---|---|
200
- | AI2D_TEST | 84.8 | 84.7 | 68.6 | 84.6 | 80.4 | 82.7 | 83 | 84.8 | 85 |
201
- | ChartQA_TEST | 83.3 | 76.5 | 23.5 | 87 | 39 | 83.1 | 83.2 | 84.3 | 84 |
202
- | HallusionBench | 64.4 | 63.1 | 56 | 65.2 | 65.3 | 73.5 | 74.1 | 74.4 | 74.9 |
203
- | MathVerse_MINI | 44.9 | 43.8 | 32.4 | 41.7 | 29.8 | 54.5 | 57.4 | 64.2 | 64.2 |
204
- | MathVision_MINI | 36.2 | 34.2 | 20 | 28.3 | 31.9 | 45.7 | 50 | 54.3 | 60.5 |
205
- | MathVista_MINI | 75.2 | 68.7 | 50.5 | 67.1 | 57.4 | 77.1 | 76.4 | 82.5 | 81.8 |
206
- | MMMU_VAL | 54.3 | 52 | 42.3 | 52 | 50 | 60.7 | 64.6 | 68.6 | 70.6 |
207
- | MMStar | 64.5 | 63.3 | 45.9 | 60 | 59.4 | 68.9 | 69.9 | 73.7 | 74.3 |
208
- | OCRBench | 76 | 75.6 | 62.6 | 86.5 | 75.3 | 89.2 | 90 | 88.5 | 88.5 |
209
- | ScreenSpot_v2 | 88.2 | 88.3 | 28.5 | 89.8 | 3.5 | 91.5 | 91.5 | 93.7 | 93.9 |
210
-
211
- ### Table 2: Accuracy Comparisons Relative to Popular Open-Weight, Thinking Models
212
-
213
- | Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B - force thinking | Kimi-VL-A3B-Thinking | gemma3-12b-it | Qwen3-VL-8B-Thinking-4K | Qwen3-VL-8B-Thinking-40K | Qwen3-VL-32B-Thinking-4K | Qwen3-VL-32B-Thinking-40K |
214
- |---|---|---|---|---|---|---|---|---|
215
- | AI2D_TEST | 84.8 | 79.7 | 81.2 | 80.4 | 83.5 | 83.9 | 86.9 | 87.2 |
216
- | ChartQA_TEST | 83.3 | 82.9 | 73.3 | 39 | 78 | 78.6 | 78.5 | 79.1 |
217
- | HallusionBench | 64.4 | 63.9 | 70.6 | 65.3 | 71.6 | 73 | 76.4 | 76.6 |
218
- | MathVerse_MINI | 44.9 | 53.1 | 61 | 29.8 | 67.3 | 73.3 | 78.3 | 78.2 |
219
- | MathVision_MINI | 36.2 | 36.2 | 50.3 | 31.9 | 43.1 | 50.7 | 60.9 | 58.6 |
220
- | MathVista_MINI | 75.2 | 74.1 | 78.6 | 57.4 | 77.7 | 79.5 | 83.9 | 83.8 |
221
- | MMMU_VAL | 54.3 | 55 | 60.2 | 50 | 59.3 | 65.3 | 72 | 72.2 |
222
- | MMStar | 64.5 | 63.9 | 69.6 | 59.4 | 69.3 | 72.3 | 75.5 | 75.7 |
223
- | OCRBench | 76 | 73.7 | 79.9 | 75.3 | 81.2 | 82 | 83.7 | 85 |
224
- | ScreenSpot_v2 | 88.2 | 88.1 | 81.8 | 3.5 | 93.3 | 92.7 | 83.1 | 83.1 |
225
-
226
- ### 3.1 Safety Evaluation and Red-Teaming
227
-
228
- Phi-4-Reasoning-Vision-15B was trained on a mixture of public safety data and internally generated tasks that it ought to refuse based on Microsoft's Responsible AI Policy.
229
-
230
- Phi-4-Reasoning-Vision-15B's safety was evaluated using both quantitative and qualitative approaches prior to release. Automated red teaming was performed on Azure to assess safety risks across multiple risk categories, including disallowed content (sexual, violent, hateful, or self-harm content), copyright content and intellectual property, and jailbreak susceptibility. The evaluation assessed the model's groundedness and its tendency to generate fabricated or misleading information.
231
-
232
- The safety evaluation built upon the established practices from the Phi-4-Reasoning model's safety assessment. The model's training data included explicit safety-oriented samples across both reasoning and non-reasoning modes, designed to teach appropriate refusal and harm-avoidance behaviors. The multimodal nature of the model introduces additional safety considerations around visual content interpretation, and evaluations were conducted to assess the model's behavior when presented with potentially harmful or misleading visual inputs.
233
-
234
- | Evaluation | Description | Defect Rate |
235
- |---|---|---|
236
- | Text to Text Safety | Automated content safety evaluation measuring safety policies | 1.4% |
237
- | Image to Text Safety | Automated content safety evaluation measuring safety policies | 4.5% |
238
 
239
  ---
240
 
241
- ## 4. Data Overview
242
 
243
- ### 4.1 Training, Testing, and Validation Datasets
244
-
245
- To learn more about the training data used for Phi-4-Reasoning-Vision-15B please refer to the full data card: RRRR_nnnn_Data Card for Foundation+Frontier Models.
246
 
247
- ### 4.2 List of Data Sources
 
248
 
249
- To learn more about the training data used for Phi-4-Reasoning-Vision-15B please refer to the full data card: RRRR_nnnn_Data Card for Foundation+Frontier Models.
250
 
251
  ---
252
 
@@ -258,40 +234,15 @@ Authorized representative: Microsoft Ireland Operations Limited, 70 Sir John Rog
258
 
259
  ---
260
 
261
- ## 6. Appendix
262
-
263
- ### A. Benchmarking Methodology
264
-
265
- Phi-4-Reasoning-Vision-15B was evaluated using two complementary open-source evaluation frameworks:
266
-
267
- **1. [Eureka ML Insights](https://github.com/microsoft/eureka-ml-insights)**
268
-
269
- Used during development for internal benchmarks and ablation studies. The following benchmarks were evaluated through this framework:
270
-
271
- - **MathVista:** Mathematical reasoning over visual inputs including diagrams, charts, and figures
272
- - **MMMU-CoT:** Multi-discipline multimodal understanding with chain-of-thought reasoning
273
- - **ScreenSpot / ScreenSpot-V2:** GUI element localization on desktop and mobile screenshots
274
- - **ScreenSpot-Pro:** High-resolution professional GUI grounding tasks
275
- - **V\*Bench:** Visual reasoning benchmark
276
-
277
- **2. [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)**
278
-
279
- Used for standardized community benchmark evaluation. The following benchmarks were evaluated through this framework:
280
 
281
- - **AI2D (TEST split):** Diagram understanding over ~5K illustrative diagrams from grade school natural sciences, evaluating the ability to interpret diagrammatic elements, relationships, and structure.
282
- - **BLINK:** Core visual perception benchmark with 3,807 multiple-choice questions spanning 14 classic computer vision tasks including relative depth estimation, visual correspondence, and multi-view reasoning.
283
- - **ChartQA (TEST split):** Chart understanding and reasoning benchmark with 9,600 human-written questions assessing complex visual and logical reasoning over chart data.
284
- - **DocVQA (VAL split):** Document visual question answering over 12,000+ document images, evaluating text extraction and comprehension within document layouts.
285
- - **HallusionBench:** Diagnostic benchmark evaluating image-context reasoning, language hallucination tendencies, and visual illusion susceptibility in vision-language models.
286
- - **MathVerse (MINI split):** Visual math benchmark with 2,612 multi-subject math problems transformed into six versions offering varying degrees of multimodal information content.
287
- - **MathVision (MINI split):** 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions, spanning 16 mathematical disciplines across 5 difficulty levels.
288
- - **MathVista (MINI split):** Mathematical reasoning in visual contexts including geometry, algebra, and data interpretation.
289
- - **MMMU (DEV_VAL split):** Massive multi-discipline multimodal understanding benchmark with 11.5K questions from college exams covering six core disciplines and 30 subjects.
290
- - **MMStar:** Vision-indispensable multimodal benchmark with 1,500 carefully curated samples evaluating six core capabilities: coarse perception, fine-grained perception, instance reasoning, logical reasoning, science and technology, and mathematics.
291
- - **OCRBench:** Comprehensive OCR evaluation with 1,000 question-answer pairs spanning text recognition, scene text VQA, document-oriented VQA, key information extraction, and handwritten mathematical expression recognition.
292
- - **ScreenSpot-V2 (Desktop, Mobile, Web):** GUI element localization benchmark across desktop, mobile, and web interfaces.
293
- - **WeMath:** Mathematical reasoning process benchmark with 6.5K visual math problems spanning 67 hierarchical knowledge concepts, evaluating knowledge acquisition and generalization beyond end-to-date performance.
294
- - **WildVision:** Real-world human preference evaluation benchmark with 500 high-quality samples curated from 8,000 user submissions, using GPT-4o as judge.
295
- - **ZeroBench:** Challenging visual reasoning benchmark with 100 manually curated questions designed to probe the limits of spatial reasoning, object recognition, and complex visual scene interpretation.
296
 
297
- Evaluation logs will be released publicly.
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
2
  base_model: microsoft/Phi-4-reasoning
3
+ language:
4
+ - en
5
+ license: mit
6
  pipeline_tag: image-text-to-text
7
+ library_name: transformers
8
+ tags:
9
+ - multimodal
10
+ - vision-language
11
+ - reasoning
12
+ - math
13
+ - ocr
14
+ - gui-grounding
15
+ - computer-use
16
+ - chain-of-thought
17
  model-index:
18
+ - name: Phi-4-Reasoning-Vision-15B
19
+ results:
20
+ - task:
21
+ type: visual-question-answering
22
+ dataset:
23
+ name: AI2D
24
+ type: ai2d
25
+ metrics:
26
+ - type: accuracy
27
+ value: 84.8
28
+ - task:
29
+ type: visual-question-answering
30
+ dataset:
31
+ name: ChartQA
32
+ type: chartqa
33
+ metrics:
34
+ - type: accuracy
35
+ value: 83.3
36
+ - task:
37
+ type: visual-question-answering
38
+ dataset:
39
+ name: MathVista (MINI)
40
+ type: mathvista
41
+ metrics:
42
+ - type: accuracy
43
+ value: 75.2
44
+ - task:
45
+ type: visual-question-answering
46
+ dataset:
47
+ name: MMMU
48
+ type: mmmu
49
+ metrics:
50
+ - type: accuracy
51
+ value: 54.3
52
+ - task:
53
+ type: visual-question-answering
54
+ dataset:
55
+ name: OCRBench
56
+ type: ocrbench
57
+ metrics:
58
+ - type: accuracy
59
+ value: 76.0
60
+ - task:
61
+ type: visual-question-answering
62
+ dataset:
63
+ name: ScreenSpot-V2
64
+ type: screenspot-v2
65
+ metrics:
66
+ - type: accuracy
67
+ value: 88.2
68
  ---
69
 
70
  # Phi-4-Reasoning-Vision-15B
 
72
  [![Microsoft](https://img.shields.io/badge/Microsoft-Project-0078D4?logo=microsoft)](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)
73
  [![Foundry](https://img.shields.io/badge/Azure-Foundry-0089D6)](https://aka.ms/Phi-4-r-v-foundry)
74
  [![Github](https://img.shields.io/badge/Github-181717?logo=github&logoColor=white)](https://github.com/microsoft/phi-4-reasoning-vision-15B)
75
+ [![Paper](https://img.shields.io/badge/Paper-2603.03975-red)](https://huggingface.co/papers/2603.03975)
76
 
77
  [Official Microsoft Blog](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/)<br>
78
+ [Technical Report](https://huggingface.co/papers/2603.03975)<br>
79
  [Github](https://github.com/microsoft/phi-4-reasoning-vision-15B)<br>
80
  [Try Phi-4-Reasoning-Vision-15B on Microsoft Foundry](https://aka.ms/Phi-4-r-v-foundry)<br>
81
 
82
+ **Authors:** Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas.
83
+
84
  **Developer:** Microsoft Corporation
85
  **Authorized Representative:** Microsoft Ireland Operations Limited, 70 Sir John Rogerson's Quay, Dublin 2, D02 R296, Ireland
86
  **Release Date:** March 4, 2026
 
102
 
103
  Phi-4-Reasoning-Vision-15B is trained with Supervised Fine-Tuning (SFT) on a carefully curated mixture of reasoning and non-reasoning data. Rather than training separate models for each mode, the model operates as a single system that can invoke extended chain-of-thought reasoning (using `<think>...</think>` blocks) for tasks like mathematical and scientific reasoning, or default to direct inference (tagged with `<nothink>`) for perception-focused tasks such as captioning, object detection, and grounding. The training data consists primarily of meticulously filtered and improved open-source vision-language datasets, supplemented by high-quality domain-specific data from internal Microsoft teams and targeted data acquisitions. This data-centric approach, combined with moderate training compute requirements (240 NVIDIA B200 GPUs for 4 days), distinguishes Phi-4-Reasoning-Vision-15B from models that rely on substantially more training data and compute.
104
 
 
 
 
 
105
  ---
106
 
107
  ## 2. Usage
 
113
  - **Scientific and mathematical reasoning over visual inputs:** such as solving math problems presented as handwritten equations or diagrams, extracting and reasoning over quantitative information in documents, charts, and tables, and supporting multi-step reasoning in educational or scientific analysis contexts.
114
  - **Computer-use agent (CUA) tasks:** such as interpreting screen content, localizing interactive GUI elements, and selecting actions within graphical user interfaces.
115
 
116
+ ### 2.2 Sample Usage (Azure Foundry)
117
+
118
+ The following sample script demonstrates how to interact with the model via Azure Foundry, as provided in the official repository:
119
+
120
+ ```python
121
+ import base64
122
+ import os
123
+ import requests
124
+
125
+ IMAGE_PATH = "<replace_with_your_image>.jpg"
126
+
127
+ ENDPOINT_BASE = "<your_base_endpoint_url>"
128
+ API_KEY = "<your_api_key_here>"
129
+ DEPLOYMENT_NAME = "Phi-4-Reasoning-Vision-15B" # replace here with your deployment name
130
+
131
+ def main():
132
+ with open(IMAGE_PATH, "rb") as f:
133
+ image_b64 = base64.b64encode(f.read()).decode("utf-8")
134
+
135
+ payload = {
136
+ "model": "Phi-4-Reasoning-Vision-15B",
137
+ "messages": [
138
+ {
139
+ "role": "user",
140
+ "content": [
141
+ {"type": "text", "text": "Describe this image in detail."},
142
+ {
143
+ "type": "image_url",
144
+ "image_url": {
145
+ "url": f"data:image/jpeg;base64,{image_b64}"
146
+ },
147
+ },
148
+ ],
149
+ }
150
+ ],
151
+ "max_tokens": 4096,
152
+ "temperature": 0.0,
153
+ }
154
+
155
+ headers = {
156
+ "Authorization": f"Bearer {API_KEY}",
157
+ "Content-Type": "application/json",
158
+ "azureml-model-deployment": DEPLOYMENT_NAME,
159
+ }
160
+
161
+ url = f"{ENDPOINT_BASE}/v1/chat/completions"
162
+ print(f"Requesting: {url}")
163
+
164
+ resp = requests.post(url, json=payload, headers=headers, timeout=120)
165
+ resp.raise_for_status()
166
+
167
+ result = resp.json()
168
+ print("
169
+ --- Response ---")
170
+ print(result["choices"][0]["message"]["content"])
171
+
172
+
173
+ if __name__ == "__main__":
174
+ main()
175
+ ```
176
 
177
+ ### 2.3 Input Formats
178
 
179
  Given the nature of the training data, always use chat template and system prompt for inference. For example, for the prompt "Please describe the image", the fully formatted chat templated prompt is the following:
180
 
 
196
  <|im_start|>assistant<|im_sep|><think>
197
  ```
198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  ---
200
 
201
  ## 3. Quality and Performance Evaluation
202
 
203
+ Phi-4-Reasoning-Vision-15B was evaluated across a broad range of public benchmarks spanning multimodal reasoning, mathematical problem solving, document and chart understanding, visual perception, OCR, and computer-use grounding tasks.
 
 
204
 
205
  ### Table 1: Accuracy Comparisons Relative to Popular Open-Weight, Non-Thinking Models
206
 
207
+ | Benchmark | Phi-4-reasoning-vision-15B | Phi-4-reasoning-vision-15B – force nothink | Phi-4-mm-instruct | Kimi-VL-A3B-Instruct | gemma-3-12b-it | Qwen3-VL-8B-Instruct-4K | Qwen3-VL-32B-Instruct-4K |
208
+ |---|---|---|---|---|---|---|---|
209
+ | AI2D_TEST | 84.8 | 84.7 | 68.6 | 84.6 | 80.4 | 82.7 | 84.8 |
210
+ | ChartQA_TEST | 83.3 | 76.5 | 23.5 | 87 | 39 | 83.1 | 84.3 |
211
+ | MathVista_MINI | 75.2 | 68.7 | 50.5 | 67.1 | 57.4 | 77.1 | 82.5 |
212
+ | MMMU_VAL | 54.3 | 52 | 42.3 | 52 | 50 | 60.7 | 68.6 |
213
+ | OCRBench | 76 | 75.6 | 62.6 | 86.5 | 75.3 | 89.2 | 88.5 |
214
+ | ScreenSpot_v2 | 88.2 | 88.3 | 28.5 | 89.8 | 3.5 | 91.5 | 93.7 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  ---
217
 
218
+ ## 4. Responsible AI Considerations
219
 
220
+ Like other models, Phi-4-Reasoning-Vision-15B can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
 
 
221
 
222
+ - **Quality of Service:** The model is trained primarily on English text. Languages other than English may experience worse performance.
223
+ - **Information Reliability:** Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
224
 
225
+ Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations.
226
 
227
  ---
228
 
 
234
 
235
  ---
236
 
237
+ ## 6. Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
 
239
+ If you use Phi-4-Reasoning-Vision in your research, please use the following BibTeX entry.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
 
241
+ ```bibtex
242
+ @article{phi4vr14b2026,
243
+ title={Phi-4-reasoning-vision-15B Technical Report},
244
+ author={Aneja, Jyoti and Harrison, Michael and Joshi, Neel and LaBonte, Tyler and Langford, John and Salinas, Eduardo},
245
+ journal={arXiv:2511.19663},
246
+ year={2026}
247
+ }
248
+ ```