File size: 10,078 Bytes
e7f8e8e
 
 
6f07de2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e7f8e8e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
license: apache-2.0
---
# S1-VL-32B: Scientific Multimodal Reasoning Model

[δΈ­ζ–‡η‰ˆ](./README_zh.md) | [English](./README.md)

## πŸ”¬ Introduction

**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne AI team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms β€” **Scientific Reasoning** and **Thinking with Images** β€” and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.

- **Scientific Reasoning**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
- **Thinking with Images**: Enables the model to actively invoke code tools during the reasoning process to perform image operations β€” including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking β€” before generating responses.

We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training reasoning trajectories.

<div align="center">
<img src="./image/data_pipeline.png"/>
</div>

We adopt a **four-stage progressive post-training procedure** to progressively unlock the scientific reasoning capabilities of S1-VL-32B:

- **Stage 1 - Scientific Reasoning SFT**: Large-scale multimodal instruction data spanning multiple disciplines β€” including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** β€” is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
- **Stage 2 - Thinking-with-Images Cold-Start SFT**: The **Thinking with Images** reasoning paradigm is introduced. Through joint training with high-quality **scientific reasoning curriculum learning data** and image-thinking data, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and astronomical observation data (S1-VL-32B-SFT).
- **Stage 3 - Scientific Reasoning RL**: Based on the **SAPO algorithm** and a multi-task scientific reward function, reinforcement learning is applied to challenging scientific multimodal reasoning samples to push beyond the performance ceiling of the SFT stage.
- **Stage 4 - Thinking-with-Images RL**: Based on the **SAPO algorithm** and a four-dimensional composite reward function, the model's image operation invocation timing and quality are further optimized, enabling stable and efficient multi-round visual reasoning (S1-VL-32B-RL).

    <div align="center">
    <img src="./image/s1-vl-training-pipeline.png"/>
    </div>

πŸ”₯ **[NEW]** Technical report released: [S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images](https://arxiv.org/abs/2604.21409)    
πŸ”₯ **[NEW]** Stage 3 and Stage 4 reinforcement learning training added; [S1-VL-32B-RL](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) model weights updated.


## πŸ“‚ Model Weights

| Model | Parameters | HuggingFace | ModelScope |
|-------|-----------|-------------|------------|
| S1-VL-32B-SFT | 32B | πŸ€— [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | πŸ€– [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
| S1-VL-32B-RL | 32B | πŸ€— [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) | πŸ€– [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B-RL) |


## πŸ† Evaluation Results

The evaluation covers **2 dimensions** and **13 benchmarks**. The **Scientific Multimodal Reasoning** dimension includes MMMU, SFE, MathVision, Physics, ScienceOlympiad, VRSBench-MINI, GMAI-MMBench, and Galaxy-10-DECaLS, spanning mathematics, physics, medicine, remote sensing, astronomy, and other professional fields. The **Image Manipulation Reasoning** dimension includes HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, focusing on high-resolution image understanding and real-world visual reasoning.

<div align="center">
<img src="./image/s1-vl-32b-benchmark.png"/>
</div>

S1-VL-32B demonstrates outstanding overall competitiveness across the aforementioned evaluations. In **scientific multimodal reasoning** tasks, the model achieves significant advantages on multiple authoritative benchmarks β€” including MMMU, MathVision, and VRSBench-MINI β€” surpassing its base model Qwen3-VL-32B in overall performance, while remaining highly competitive against open-source models with substantially larger parameter scales (e.g., Qwen3-VL-235B, Intern-S1) as well as closed-source flagship models (e.g., Gemini 2.5 Pro, GPT-5). In **image operation reasoning** tasks, S1-VL-32B ranks **first across all five benchmark evaluations**, comprehensively outperforming models of comparable and larger scales, while also surpassing dedicated "Thinking with Images" models such as Thyme-VL and Skywork-R1V4. These results fully validate its ability to achieve efficient, high-quality multimodal reasoning at the 32B parameter scale.

## 🧠 Case Study

The following presents reasoning examples of S1-VL-32B operating in **Thinking with Images** mode. When processing a low-resolution cervical CT image, S1-VL-32B proactively invokes code tools during its reasoning process to perform **cropping and magnification** on the region of interest. By obtaining a clearer local image, the model then combines the enhanced visual information with its internal knowledge to complete the reasoning.

<div align="center">
<img src="./image/s1-vl-32b-twi.png"/>
</div>

πŸ“ More cases are available in [CASES.md](./CASES.md).

## πŸš€ Quick Start

### 1. Install Dependencies

```bash
# Requires vLLM >= 0.11.0
pip install -U vllm
pip install qwen-vl-utils==0.0.14
```

### 2. Start the vLLM Service

```bash
vllm serve ScienceOne-AI/S1-VL-32B-RL \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --limit-mm-per-prompt image=15 \
    --reasoning-parser deepseek_r1 \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.95 \
    --port 9200
```

### 3. Multimodal Reasoning Mode

```python
from openai import OpenAI
import base64

client = OpenAI(api_key="EMPTY", base_url="http://localhost:9200/v1")

with open("path/to/your/image.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="ScienceOne-AI/S1-VL-32B-RL",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
                {"type": "text", "text": "Please describe the physical phenomenon shown in the image and derive the relevant equations."},
            ],
        }
    ],
    temperature=0.2,
    max_tokens=16384,
)

# The reasoning process is in the reasoning_content field
print("Thinking process:\n", response.choices[0].message.reasoning_content)
print("\nFinal answer:\n", response.choices[0].message.content)
```

### 4. Thinking with Images Mode

Thinking with Images mode requires deploying a **code sandbox** to support the model invoking code tools during reasoning for image operations (cropping, zooming, enhancement, annotation, etc.).

#### Step 1: Deploy the Code Sandbox

We recommend deploying the AIO Sandbox with Docker:

```bash
git clone https://github.com/agent-infra/sandbox
cd sandbox
# Mount the host image directory into the container
docker run -d \
    --name twi-sandbox \
    -p 18081:18081 \
    -v /data/images:/mnt/data/images \   # host path β†’ sandbox path
    sandbox:latest
```
The mount path must match the path configuration in the FastAPI service.

#### Step 2: Start the Thinking with Images FastAPI Service

Download [twi_server.py](twi_server.py) and update the path configuration at the top of the file:

```python
CHAT_API        = "http://localhost:9200/v1/chat/completions"  # vLLM address
JUPYTER_API     = "http://localhost:18081/v1/jupyter"          # Sandbox address
HOST_IMG_DIR    = "/data/images"     # ← Host image directory (must match docker -v mount)
```

Start the service:

```bash
pip install fastapi uvicorn httpx pillow
python twi_server.py   # Listens on port 10044
```

#### Step 3: Call the Thinking with Images Endpoint

```python
import httpx
import base64

with open("path/to/your/image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

messages = [
    {"type": "text", "text": "Please carefully analyze this scientific image."},
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
]

response = httpx.post(
    "http://localhost:10044/process",
    json={
        "messages": messages,
        "image_path_list": ["/data/images/your_image.png"],  # Absolute host path
    },
    timeout=300,
)

result = response.json()

# The final answer is the last message with role="assistant"
final = [m for m in result["messages"] if m["role"] == "assistant"][-1]
print(final["content"])
```

## πŸ“„ Citation

If you use S1-VL-32B in your research, please cite:

```latex
@article{li2026s1vl,
  title     = {S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images},
  author    = {Li, Qingxiao and Xu, Lifeng and Wang, QingLi and Bai, Yudong and Ou, Mingwei and Hu, Shu and Xu, Nan},
  journal   = {arXiv preprint arXiv:2604.21409},
  year      = {2026},
}
```

## πŸ“œ License

This project is released under the Apache 2.0 License.

## πŸ™ Acknowledgements

We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.