DarthZhu
/

VideoRLVR-Wan2.2

Model card Files Files and versions

VideoRLVR-Wan2.2 / README.md

nielsr's picture

nielsr HF Staff

Improve model card with paper link and metadata

43ee3e4 verified about 13 hours ago

|

1.79 kB

	---
	license: apache-2.0
	library_name: diffusers
	pipeline_tag: image-to-video
	base_model:
	- DarthZhu/VideoRLVR-Wan2.2-Base
	datasets:
	- DarthZhu/VideoRLVR-Data
	---

	# VideoRLVR

	VideoRLVR is a reinforcement learning (RL) recipe for training video reasoning models with verifiable rewards. This model is a reinforcement-learning optimized version of [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), presented in the paper [Video Models Can Reason with Verifiable Rewards](https://huggingface.co/papers/2605.15458).

	The model uses an SDE-GRPO optimization backbone and rule-based feedback to improve visual reasoning in complex, procedurally generated tasks such as Maze, FlowFree, and Sokoban.

	- Paper: [Video Models Can Reason with Verifiable Rewards](https://huggingface.co/papers/2605.15458)
	- Project Page: [https://darthzhu.github.io/VideoRLVR-page/](https://darthzhu.github.io/VideoRLVR-page/)
	- Code: [https://github.com/luka-group/VideoRLVR](https://github.com/luka-group/VideoRLVR)

	## Method Overview

	VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories. Key components include:
	1. SDE-GRPO: An optimization backbone for video diffusion models.
	2. Dense Decomposed Rewards: Verifiable, rule-based feedback to guide the model.
	3. Early-Step Focus: A strategy that restricts policy optimization to the early denoising phase, significantly reducing training latency while preserving performance.

	## Citation

	```bibtex
	@article{zhu2026video,
	title={Video Models Can Reason with Verifiable Rewards},
	author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
	journal={arXiv preprint arXiv:2605.15458},
	year={2026}
	}
	```