--- license: apache-2.0 library_name: diffusers pipeline_tag: image-to-video base_model: - DarthZhu/VideoRLVR-Wan2.2-Base datasets: - DarthZhu/VideoRLVR-Data --- # VideoRLVR VideoRLVR is a reinforcement learning (RL) recipe for training video reasoning models with verifiable rewards. This model is a reinforcement-learning optimized version of [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), presented in the paper [Video Models Can Reason with Verifiable Rewards](https://huggingface.co/papers/2605.15458). The model uses an SDE-GRPO optimization backbone and rule-based feedback to improve visual reasoning in complex, procedurally generated tasks such as Maze, FlowFree, and Sokoban. - **Paper:** [Video Models Can Reason with Verifiable Rewards](https://huggingface.co/papers/2605.15458) - **Project Page:** [https://darthzhu.github.io/VideoRLVR-page/](https://darthzhu.github.io/VideoRLVR-page/) - **Code:** [https://github.com/luka-group/VideoRLVR](https://github.com/luka-group/VideoRLVR) ## Method Overview VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories. Key components include: 1. **SDE-GRPO**: An optimization backbone for video diffusion models. 2. **Dense Decomposed Rewards**: Verifiable, rule-based feedback to guide the model. 3. **Early-Step Focus**: A strategy that restricts policy optimization to the early denoising phase, significantly reducing training latency while preserving performance. ## Citation ```bibtex @article{zhu2026video, title={Video Models Can Reason with Verifiable Rewards}, author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen}, journal={arXiv preprint arXiv:2605.15458}, year={2026} } ```