Improve model card with paper link and metadata

This PR improves the model card by adding metadata (license, library name, and pipeline tag) and linking it to the associated paper [Video Models Can Reason with Verifiable Rewards](https://huggingface.co/papers/2605.15458), project page, and code repository.

Files changed (1) hide show

README.md +34 -3

README.md CHANGED Viewed

@@ -1,6 +1,37 @@
 ---
-datasets:
-- DarthZhu/VideoRLVR-Data
 base_model:
 - DarthZhu/VideoRLVR-Wan2.2-Base
----

 ---
+license: apache-2.0
+library_name: diffusers
+pipeline_tag: image-to-video
 base_model:
 - DarthZhu/VideoRLVR-Wan2.2-Base
+datasets:
+- DarthZhu/VideoRLVR-Data
+---
+# VideoRLVR
+VideoRLVR is a reinforcement learning (RL) recipe for training video reasoning models with verifiable rewards. This model is a reinforcement-learning optimized version of [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), presented in the paper [Video Models Can Reason with Verifiable Rewards](https://huggingface.co/papers/2605.15458).
+The model uses an SDE-GRPO optimization backbone and rule-based feedback to improve visual reasoning in complex, procedurally generated tasks such as Maze, FlowFree, and Sokoban.
+- **Paper:** [Video Models Can Reason with Verifiable Rewards](https://huggingface.co/papers/2605.15458)
+- **Project Page:** [https://darthzhu.github.io/VideoRLVR-page/](https://darthzhu.github.io/VideoRLVR-page/)
+- **Code:** [https://github.com/luka-group/VideoRLVR](https://github.com/luka-group/VideoRLVR)
+## Method Overview
+VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories. Key components include:
+1. **SDE-GRPO**: An optimization backbone for video diffusion models.
+2. **Dense Decomposed Rewards**: Verifiable, rule-based feedback to guide the model.
+3. **Early-Step Focus**: A strategy that restricts policy optimization to the early denoising phase, significantly reducing training latency while preserving performance.
+## Citation
+```bibtex
+@article{zhu2026video,
+  title={Video Models Can Reason with Verifiable Rewards},
+  author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
+  journal={arXiv preprint arXiv:2605.15458},
+  year={2026}
+}
+```