Qwen3-VL-4B-Instruct-SFT-PRISM-DAPO
Overview
Large multimodal models (LMMs) are commonly post-trained through supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). However, SFT may introduce distributional drift, where the post-SFT policy neither fully preserves the original model capabilities nor faithfully matches the supervision distribution. This issue is particularly challenging for multimodal reasoning, where visual perception and logical reasoning errors can drift in different ways and further affect downstream RL.
We introduce PRISM, a three-stage post-training pipeline that inserts an explicit pre-alignment stage between SFT and RLVR. PRISM performs black-box adversarial on-policy distillation with a Mixture-of-Experts discriminator, providing separate corrective signals for visual grounding and reasoning consistency. This model, Qwen3-VL-4B-Instruct-SFT-PRISM-DAPO, is based on Qwen3-VL-4B-Instruct and further optimized with the PRISM pipeline using DAPO as the final RL algorithm.
Usage & Evaluation
For detailed instructions on inference, training, and evaluation, please refer to our GitHub repository. We recommend using the scripts and environment provided there to reproduce our results.
Citation
If you find PRISM useful for your research and applications, please cite using this BibTeX:
@misc{wang2026sfttorlprealignmentblackboxonpolicy,
title={Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL},
author={Sudong Wang and Weiquan Huang and Xiaomin Yu and Zuhao Yang and Hehai Lin and Keming Wu and Chaojun Xiao and Chen Chen and Wenxuan Wang and Beier Zhu and Yunjian Zhang and Chengwei Qin},
year={2026},
eprint={2604.28123},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.28123},
}
Check out this paper: https://arxiv.org/abs/2604.28123
Acknowledgements
We gratefully acknowledge the following open-source projects that made this work possible:
- LLaMA-Factory for the supervised fine-tuning infrastructure and tools.
- verl for the reinforcement learning training framework.
- lmms-eval for the comprehensive evaluation framework for large multimodal models.
We thank the developers and contributors of these projects for their excellent work and for making their code publicly available.
- Downloads last month
- 53
Model tree for prism-vlm/Qwen3-VL-4B-Instruct-SFT-PRISM-DAPO
Base model
Qwen/Qwen3-VL-4B-Instruct