Model Card for VGPO-RL-7B

πŸ“– Overview of VGPO

Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution β€” generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.

VGPO addresses these issues through three key mechanisms:

  • Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
  • Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
  • Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

πŸ”— Model Sources

πŸ“• Training Datasets

Split Dataset Link
Train ViRL39K PAPOGalaxy/PAPO_ViRL39K_train
Val MMK12 PAPOGalaxy/PAPO_MMK12_test

πŸ“Š Evaluation

We follow the evaluation script of Look-Back. All results are reported as average accuracy with inference temperature 0.0.

Supported Evaluation Benchmarks

Benchmark Focus Domain
MathVista General Mathematical & Geometric Reasoning
MathVerse General Mathematical & Geometric Reasoning
WeMath General Mathematical & Geometric Reasoning
MMK12 General Mathematical & Geometric Reasoning
GeoMath General Mathematical & Geometric Reasoning
Geometry3K General Mathematical & Geometric Reasoning
LogicVista Vision-dependent Multimodal Reasoning
SuperClevr Counting Vision-dependent Multimodal Reasoning
MMMU-Pro Vision-dependent Multimodal Reasoning
MathVerse-V Vision-dependent Multimodal Reasoning

✍️ Citation

If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work πŸ“:

@article{wang2026vgpo,
  title={Visually-Guided Policy Optimization for Multimodal Reasoning}, 
  author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
  journal={arXiv preprint arXiv:2604.09349},
  year={2026}
}

❀️ Acknowledgements

Our codebase is built upon EasyR1, VPPO-RL, PAPO, Look-Back. We thank the authors for their excellent work.

Downloads last month
15
Safetensors
Model size
33B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MuMing0102/VGPO-RL-32B

Finetuned
(60)
this model

Dataset used to train MuMing0102/VGPO-RL-32B

Paper for MuMing0102/VGPO-RL-32B