Model Card for VGPO-RL-7B
π Overview of VGPO
Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution β generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.
VGPO addresses these issues through three key mechanisms:
- Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
- Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
- Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.
π Model Sources
- Github Repository:
VGPO - ArXiv Paper:
2604.09349
π Training Datasets
| Split | Dataset | Link |
|---|---|---|
| Train | ViRL39K | PAPOGalaxy/PAPO_ViRL39K_train |
| Val | MMK12 | PAPOGalaxy/PAPO_MMK12_test |
π Evaluation
We follow the evaluation script of Look-Back. All results are reported as average accuracy with inference temperature 0.0.
Supported Evaluation Benchmarks
| Benchmark | Focus Domain |
|---|---|
| MathVista | General Mathematical & Geometric Reasoning |
| MathVerse | General Mathematical & Geometric Reasoning |
| WeMath | General Mathematical & Geometric Reasoning |
| MMK12 | General Mathematical & Geometric Reasoning |
| GeoMath | General Mathematical & Geometric Reasoning |
| Geometry3K | General Mathematical & Geometric Reasoning |
| LogicVista | Vision-dependent Multimodal Reasoning |
| SuperClevr Counting | Vision-dependent Multimodal Reasoning |
| MMMU-Pro | Vision-dependent Multimodal Reasoning |
| MathVerse-V | Vision-dependent Multimodal Reasoning |
βοΈ Citation
If you find this codebase useful in your research, please consider giving us a star β and citing our work π:
@article{wang2026vgpo,
title={Visually-Guided Policy Optimization for Multimodal Reasoning},
author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
journal={arXiv preprint arXiv:2604.09349},
year={2026}
}
β€οΈ Acknowledgements
Our codebase is built upon EasyR1, VPPO-RL, PAPO, Look-Back. We thank the authors for their excellent work.
- Downloads last month
- 15
Model tree for MuMing0102/VGPO-RL-32B
Base model
Qwen/Qwen2.5-VL-32B-Instruct