π₯ News
- [2026.04.09] We officially open-source the CRUX dataset and ViRC models as scheduled. π
- [2026.03.01] The CURX dataset and ViRC models are ready and are currently under internal review at Ant Group. Full open-sourcing will be completed no later than 2026.04.19.β‘οΈ
- [2026.02.21] ViRC has been accepted by CVPR 2026. π
- [2025.12.16] We release the arxiv paper and the code. π₯
π About VΙͺRC
Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. We propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning.
To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem.
Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.
The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks and cross-domain high-resolution image benchmarks.
πͺ Get Started
Installation
Clone the repository:
git clone https://github.com/Leon-LihongWang/ViRC.git
cd ViRC
Create and activate a conda environment:
conda create -n virc python=3.10 -y
conda activate virc
Install additional dependencies:
bash src/setup.sh
π Training
Preparation
Download our dataset, and extract ViRC_images.tar.lz4:
huggingface-cli repo download --repo-type dataset LeonMiao/CRUX --local-dir ./data
cd ./data && lz4 -d ViRC_images.tar.lz4 | tar -xf -
Download Qwen2.5-VL-7B-Instruct, which is the base model used for training.
huggingface-cli repo download --repo-type model Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./model
Stage 1: Instructional SFT
cd ViRC
pip install datasets==3.6.0
DISABLE_VERSION_CHECK=1 llamafactory-cli train src/train/stage_1_InstrSFT.yaml
Stage 2: Practice SFT
DISABLE_VERSION_CHECK=1 llamafactory-cli train src/train/stage_2_PracSFT.yaml
Stage 3: Strategic RL
pip install datasets==4.0.0
# Start Qwen2.5-VL-72B-Instruct with vLLM, and update Lines 32, 36, and 37 accordingly (the model-related settings).
\cp -f src/train/reward_for_scale_dynamic.py ./verl/utils/reward_score/init.py
bash src/train/stage_3_StratRL
Notes on Training Config
- We provide
src/train/merge_rl_result.shto merge Strategic RL outputs and export the final model weights in safetensors format. - Our training data and scripts support two image-resolution settings:
- Dynamic resolution (for models such as Qwen2.5VL).
- Fixed resolution (images resized to 1000Γ1000, for models such as Qwen2VL/Qwen3VL).
These are distinguished by the suffixes_scale_dynamicand_scale_fixedunderdata/andsrc/.
- For reproducibility, we release a 50K subset of the training data (suffix
_50K) and the full 100K set (suffix_100K).
π« Inference
We provide inference scripts for two image-resolution settings:
- Dynamic resolution (
src/evaluation/ViRC_scale_dynamic.py): for ViRC-7B and ViRC-3B, based on the Qwen2.5-VL-Instruct series. - Fixed resolution (1000Γ1000) (
src/evaluation/ViRC_scale_fixed.py): for ViRC-Qwen2VL-7B and ViRC-Qwen2VL-2B, based on the Qwen2-VL-Instruct series.
src/evaluation/ also includes a sample input image (image.png) and an expected output example in src/evaluation/response/ for a quick sanity check.
Start the ViRC model with vLLM, then run the evaluation script:
model_path=./ViRC/models/ViRC-7B
model_name=ViRC
tensor_parallel_size=4
port=8000
python -u -m vllm.entrypoints.openai.api_server \
--model $model_path \
--served-model-name $model_name \
--dtype auto \
--tensor-parallel-size $tensor_parallel_size \
--gpu-memory-utilization 0.9 \
--port $port
python src/evaluation/ViRC_scale_dynamic.py
π₯³ Acknowledgements
We would like to thank LLaMA-Factory and verl, upon which our repo is built.
β Citation
@article{wang2025virc,
title={ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking},
author={Lihong, Wang and Liangqi, Li and Weiwei, Feng and Jiamin, Wu and Changtao, Miao and Tieru, Wu and Rui, Ma and Bo, Zhang and Zhe, Li},
journal={arXiv preprint arXiv:2512.14654},
year={2025}
}
- Downloads last month
- 332