Model Card for VGPO-RL-7B

📖 Overview of VGPO

Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution — generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.

VGPO addresses these issues through three key mechanisms:

Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

🔗 Model Sources

Github Repository: VGPO
ArXiv Paper: 2604.09349

📕 Training Datasets

Split	Dataset	Link
Train	ViRL39K	PAPOGalaxy/PAPO_ViRL39K_train
Val	MMK12	PAPOGalaxy/PAPO_MMK12_test

📊 Evaluation

We follow the evaluation script of Look-Back. All results are reported as average accuracy with inference temperature 0.0.

Supported Evaluation Benchmarks

Benchmark	Focus Domain
MathVista	General Mathematical & Geometric Reasoning
MathVerse	General Mathematical & Geometric Reasoning
WeMath	General Mathematical & Geometric Reasoning
MMK12	General Mathematical & Geometric Reasoning
GeoMath	General Mathematical & Geometric Reasoning
Geometry3K	General Mathematical & Geometric Reasoning
LogicVista	Vision-dependent Multimodal Reasoning
SuperClevr Counting	Vision-dependent Multimodal Reasoning
MMMU-Pro	Vision-dependent Multimodal Reasoning
MathVerse-V	Vision-dependent Multimodal Reasoning

✍️ Citation

If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work 📝:

@article{wang2026vgpo,
  title={Visually-Guided Policy Optimization for Multimodal Reasoning}, 
  author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
  journal={arXiv preprint arXiv:2604.09349},
  year={2026}
}

❤️ Acknowledgements

Our codebase is built upon EasyR1, VPPO-RL, PAPO, Look-Back. We thank the authors for their excellent work.

Downloads last month: 15

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for MuMing0102/VGPO-RL-32B

Base model

Qwen/Qwen2.5-VL-32B-Instruct

Finetuned

(60)

this model

Dataset used to train MuMing0102/VGPO-RL-32B

Paper for MuMing0102/VGPO-RL-32B

Visually-Guided Policy Optimization for Multimodal Reasoning

Paper • 2604.09349 • Published 5 days ago • 2