File size: 10,078 Bytes
e7f8e8e 6f07de2 e7f8e8e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
license: apache-2.0
---
# S1-VL-32B: Scientific Multimodal Reasoning Model
[δΈζη](./README_zh.md) | [English](./README.md)
## π¬ Introduction
**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne AI team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms β **Scientific Reasoning** and **Thinking with Images** β and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.
- **Scientific Reasoning**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
- **Thinking with Images**: Enables the model to actively invoke code tools during the reasoning process to perform image operations β including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking β before generating responses.
We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training reasoning trajectories.
<div align="center">
<img src="./image/data_pipeline.png"/>
</div>
We adopt a **four-stage progressive post-training procedure** to progressively unlock the scientific reasoning capabilities of S1-VL-32B:
- **Stage 1 - Scientific Reasoning SFT**: Large-scale multimodal instruction data spanning multiple disciplines β including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** β is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
- **Stage 2 - Thinking-with-Images Cold-Start SFT**: The **Thinking with Images** reasoning paradigm is introduced. Through joint training with high-quality **scientific reasoning curriculum learning data** and image-thinking data, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and astronomical observation data (S1-VL-32B-SFT).
- **Stage 3 - Scientific Reasoning RL**: Based on the **SAPO algorithm** and a multi-task scientific reward function, reinforcement learning is applied to challenging scientific multimodal reasoning samples to push beyond the performance ceiling of the SFT stage.
- **Stage 4 - Thinking-with-Images RL**: Based on the **SAPO algorithm** and a four-dimensional composite reward function, the model's image operation invocation timing and quality are further optimized, enabling stable and efficient multi-round visual reasoning (S1-VL-32B-RL).
<div align="center">
<img src="./image/s1-vl-training-pipeline.png"/>
</div>
π₯ **[NEW]** Technical report released: [S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images](https://arxiv.org/abs/2604.21409)
π₯ **[NEW]** Stage 3 and Stage 4 reinforcement learning training added; [S1-VL-32B-RL](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) model weights updated.
## π Model Weights
| Model | Parameters | HuggingFace | ModelScope |
|-------|-----------|-------------|------------|
| S1-VL-32B-SFT | 32B | π€ [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | π€ [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
| S1-VL-32B-RL | 32B | π€ [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) | π€ [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B-RL) |
## π Evaluation Results
The evaluation covers **2 dimensions** and **13 benchmarks**. The **Scientific Multimodal Reasoning** dimension includes MMMU, SFE, MathVision, Physics, ScienceOlympiad, VRSBench-MINI, GMAI-MMBench, and Galaxy-10-DECaLS, spanning mathematics, physics, medicine, remote sensing, astronomy, and other professional fields. The **Image Manipulation Reasoning** dimension includes HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, focusing on high-resolution image understanding and real-world visual reasoning.
<div align="center">
<img src="./image/s1-vl-32b-benchmark.png"/>
</div>
S1-VL-32B demonstrates outstanding overall competitiveness across the aforementioned evaluations. In **scientific multimodal reasoning** tasks, the model achieves significant advantages on multiple authoritative benchmarks β including MMMU, MathVision, and VRSBench-MINI β surpassing its base model Qwen3-VL-32B in overall performance, while remaining highly competitive against open-source models with substantially larger parameter scales (e.g., Qwen3-VL-235B, Intern-S1) as well as closed-source flagship models (e.g., Gemini 2.5 Pro, GPT-5). In **image operation reasoning** tasks, S1-VL-32B ranks **first across all five benchmark evaluations**, comprehensively outperforming models of comparable and larger scales, while also surpassing dedicated "Thinking with Images" models such as Thyme-VL and Skywork-R1V4. These results fully validate its ability to achieve efficient, high-quality multimodal reasoning at the 32B parameter scale.
## π§ Case Study
The following presents reasoning examples of S1-VL-32B operating in **Thinking with Images** mode. When processing a low-resolution cervical CT image, S1-VL-32B proactively invokes code tools during its reasoning process to perform **cropping and magnification** on the region of interest. By obtaining a clearer local image, the model then combines the enhanced visual information with its internal knowledge to complete the reasoning.
<div align="center">
<img src="./image/s1-vl-32b-twi.png"/>
</div>
π More cases are available in [CASES.md](./CASES.md).
## π Quick Start
### 1. Install Dependencies
```bash
# Requires vLLM >= 0.11.0
pip install -U vllm
pip install qwen-vl-utils==0.0.14
```
### 2. Start the vLLM Service
```bash
vllm serve ScienceOne-AI/S1-VL-32B-RL \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--limit-mm-per-prompt image=15 \
--reasoning-parser deepseek_r1 \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--port 9200
```
### 3. Multimodal Reasoning Mode
```python
from openai import OpenAI
import base64
client = OpenAI(api_key="EMPTY", base_url="http://localhost:9200/v1")
with open("path/to/your/image.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="ScienceOne-AI/S1-VL-32B-RL",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
{"type": "text", "text": "Please describe the physical phenomenon shown in the image and derive the relevant equations."},
],
}
],
temperature=0.2,
max_tokens=16384,
)
# The reasoning process is in the reasoning_content field
print("Thinking process:\n", response.choices[0].message.reasoning_content)
print("\nFinal answer:\n", response.choices[0].message.content)
```
### 4. Thinking with Images Mode
Thinking with Images mode requires deploying a **code sandbox** to support the model invoking code tools during reasoning for image operations (cropping, zooming, enhancement, annotation, etc.).
#### Step 1: Deploy the Code Sandbox
We recommend deploying the AIO Sandbox with Docker:
```bash
git clone https://github.com/agent-infra/sandbox
cd sandbox
# Mount the host image directory into the container
docker run -d \
--name twi-sandbox \
-p 18081:18081 \
-v /data/images:/mnt/data/images \ # host path β sandbox path
sandbox:latest
```
The mount path must match the path configuration in the FastAPI service.
#### Step 2: Start the Thinking with Images FastAPI Service
Download [twi_server.py](twi_server.py) and update the path configuration at the top of the file:
```python
CHAT_API = "http://localhost:9200/v1/chat/completions" # vLLM address
JUPYTER_API = "http://localhost:18081/v1/jupyter" # Sandbox address
HOST_IMG_DIR = "/data/images" # β Host image directory (must match docker -v mount)
```
Start the service:
```bash
pip install fastapi uvicorn httpx pillow
python twi_server.py # Listens on port 10044
```
#### Step 3: Call the Thinking with Images Endpoint
```python
import httpx
import base64
with open("path/to/your/image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode("utf-8")
messages = [
{"type": "text", "text": "Please carefully analyze this scientific image."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
]
response = httpx.post(
"http://localhost:10044/process",
json={
"messages": messages,
"image_path_list": ["/data/images/your_image.png"], # Absolute host path
},
timeout=300,
)
result = response.json()
# The final answer is the last message with role="assistant"
final = [m for m in result["messages"] if m["role"] == "assistant"][-1]
print(final["content"])
```
## π Citation
If you use S1-VL-32B in your research, please cite:
```latex
@article{li2026s1vl,
title = {S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images},
author = {Li, Qingxiao and Xu, Lifeng and Wang, QingLi and Bai, Yudong and Ou, Mingwei and Hu, Shu and Xu, Nan},
journal = {arXiv preprint arXiv:2604.21409},
year = {2026},
}
```
## π License
This project is released under the Apache 2.0 License.
## π Acknowledgements
We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B. |