File size: 10,078 Bytes

---
license: apache-2.0
---
# S1-VL-32B: Scientific Multimodal Reasoning Model

[中文版](./README_zh.md) | [English](./README.md)

## 🔬 Introduction

**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne AI team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms — **Scientific Reasoning** and **Thinking with Images** — and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.

- **Scientific Reasoning**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
- **Thinking with Images**: Enables the model to actively invoke code tools during the reasoning process to perform image operations — including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking — before generating responses.

We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training reasoning trajectories.

<div align="center">
<img src="./image/data_pipeline.png"/>
</div>

We adopt a **four-stage progressive post-training procedure** to progressively unlock the scientific reasoning capabilities of S1-VL-32B:

- **Stage 1 - Scientific Reasoning SFT**: Large-scale multimodal instruction data spanning multiple disciplines — including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** — is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
- **Stage 2 - Thinking-with-Images Cold-Start SFT**: The **Thinking with Images** reasoning paradigm is introduced. Through joint training with high-quality **scientific reasoning curriculum learning data** and image-thinking data, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and astronomical observation data (S1-VL-32B-SFT).
- **Stage 3 - Scientific Reasoning RL**: Based on the **SAPO algorithm** and a multi-task scientific reward function, reinforcement learning is applied to challenging scientific multimodal reasoning samples to push beyond the performance ceiling of the SFT stage.
- **Stage 4 - Thinking-with-Images RL**: Based on the **SAPO algorithm** and a four-dimensional composite reward function, the model's image operation invocation timing and quality are further optimized, enabling stable and efficient multi-round visual reasoning (S1-VL-32B-RL).

    <div align="center">
    <img src="./image/s1-vl-training-pipeline.png"/>
    </div>

🔥 **[NEW]** Technical report released: [S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images](https://arxiv.org/abs/2604.21409)    
🔥 **[NEW]** Stage 3 and Stage 4 reinforcement learning training added; [S1-VL-32B-RL](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) model weights updated.


## 📂 Model Weights

| Model | Parameters | HuggingFace | ModelScope |
|-------|-----------|-------------|------------|
| S1-VL-32B-SFT | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
| S1-VL-32B-RL | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B-RL) |


## 🏆 Evaluation Results

The evaluation covers **2 dimensions** and **13 benchmarks**. The **Scientific Multimodal Reasoning** dimension includes MMMU, SFE, MathVision, Physics, ScienceOlympiad, VRSBench-MINI, GMAI-MMBench, and Galaxy-10-DECaLS, spanning mathematics, physics, medicine, remote sensing, astronomy, and other professional fields. The **Image Manipulation Reasoning** dimension includes HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, focusing on high-resolution image understanding and real-world visual reasoning.

<div align="center">
<img src="./image/s1-vl-32b-benchmark.png"/>
</div>

S1-VL-32B demonstrates outstanding overall competitiveness across the aforementioned evaluations. In **scientific multimodal reasoning** tasks, the model achieves significant advantages on multiple authoritative benchmarks — including MMMU, MathVision, and VRSBench-MINI — surpassing its base model Qwen3-VL-32B in overall performance, while remaining highly competitive against open-source models with substantially larger parameter scales (e.g., Qwen3-VL-235B, Intern-S1) as well as closed-source flagship models (e.g., Gemini 2.5 Pro, GPT-5). In **image operation reasoning** tasks, S1-VL-32B ranks **first across all five benchmark evaluations**, comprehensively outperforming models of comparable and larger scales, while also surpassing dedicated "Thinking with Images" models such as Thyme-VL and Skywork-R1V4. These results fully validate its ability to achieve efficient, high-quality multimodal reasoning at the 32B parameter scale.

## 🧠 Case Study

The following presents reasoning examples of S1-VL-32B operating in **Thinking with Images** mode. When processing a low-resolution cervical CT image, S1-VL-32B proactively invokes code tools during its reasoning process to perform **cropping and magnification** on the region of interest. By obtaining a clearer local image, the model then combines the enhanced visual information with its internal knowledge to complete the reasoning.

<div align="center">
<img src="./image/s1-vl-32b-twi.png"/>
</div>

📁 More cases are available in [CASES.md](./CASES.md).

## 🚀 Quick Start

### 1. Install Dependencies

```bash
# Requires vLLM >= 0.11.0
pip install -U vllm
pip install qwen-vl-utils==0.0.14
```

### 2. Start the vLLM Service

```bash
vllm serve ScienceOne-AI/S1-VL-32B-RL \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --limit-mm-per-prompt image=15 \
    --reasoning-parser deepseek_r1 \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.95 \
    --port 9200
```

### 3. Multimodal Reasoning Mode

```python
from openai import OpenAI
import base64

client = OpenAI(api_key="EMPTY", base_url="http://localhost:9200/v1")

with open("path/to/your/image.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="ScienceOne-AI/S1-VL-32B-RL",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
                {"type": "text", "text": "Please describe the physical phenomenon shown in the image and derive the relevant equations."},
            ],
        }
    ],
    temperature=0.2,
    max_tokens=16384,
)

# The reasoning process is in the reasoning_content field
print("Thinking process:\n", response.choices[0].message.reasoning_content)
print("\nFinal answer:\n", response.choices[0].message.content)
```

### 4. Thinking with Images Mode

Thinking with Images mode requires deploying a **code sandbox** to support the model invoking code tools during reasoning for image operations (cropping, zooming, enhancement, annotation, etc.).

#### Step 1: Deploy the Code Sandbox

We recommend deploying the AIO Sandbox with Docker:

```bash
git clone https://github.com/agent-infra/sandbox
cd sandbox
# Mount the host image directory into the container
docker run -d \
    --name twi-sandbox \
    -p 18081:18081 \
    -v /data/images:/mnt/data/images \   # host path → sandbox path
    sandbox:latest
```
The mount path must match the path configuration in the FastAPI service.

#### Step 2: Start the Thinking with Images FastAPI Service

Download [twi_server.py](twi_server.py) and update the path configuration at the top of the file:

```python
CHAT_API        = "http://localhost:9200/v1/chat/completions"  # vLLM address
JUPYTER_API     = "http://localhost:18081/v1/jupyter"          # Sandbox address
HOST_IMG_DIR    = "/data/images"     # ← Host image directory (must match docker -v mount)
```

Start the service:

```bash
pip install fastapi uvicorn httpx pillow
python twi_server.py   # Listens on port 10044
```

#### Step 3: Call the Thinking with Images Endpoint

```python
import httpx
import base64

with open("path/to/your/image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

messages = [
    {"type": "text", "text": "Please carefully analyze this scientific image."},
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
]

response = httpx.post(
    "http://localhost:10044/process",
    json={
        "messages": messages,
        "image_path_list": ["/data/images/your_image.png"],  # Absolute host path
    },
    timeout=300,
)

result = response.json()

# The final answer is the last message with role="assistant"
final = [m for m in result["messages"] if m["role"] == "assistant"][-1]
print(final["content"])
```

## 📄 Citation

If you use S1-VL-32B in your research, please cite:

```latex
@article{li2026s1vl,
  title     = {S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images},
  author    = {Li, Qingxiao and Xu, Lifeng and Wang, QingLi and Bai, Yudong and Ou, Mingwei and Hu, Shu and Xu, Nan},
  journal   = {arXiv preprint arXiv:2604.21409},
  year      = {2026},
}
```

## 📜 License

This project is released under the Apache 2.0 License.

## 🙏 Acknowledgements

We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.