| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen2.5-VL-7B-Instruct |
| --- |
| # VIDEO-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning |
|
|
| This is the official implementation for Video-RTS. |
|
|
| [](https://sites.google.com/cs.unc.edu/videorts2025/) [](https://arxiv.org/abs/2507.06485) [](https://huggingface.co/Ted412/Video-RTS) |
|
|
| ### Authors: [Ziyang Wang*](https://ziyangw2000.github.io/), [Jaehong Yoon*](https://jaehong31.github.io/), [Shoubin Yu](https://yui010206.github.io/), [Md Mohaiminul Islam](https://md-mohaiminul.github.io/), [Gedas Bertasius](https://www.gedasbertasius.com/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/) |
|
|
| ### University of North Carolina at Chapel Hill |
|
|
|
|
| We introduce Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. |
|
|
|
|
| ## **Installation** |
|
|
| ```bash |
| git clone https://github.com/Ziyang412/Video-RTS.git |
| cd Video-RTS |
| |
| # build environment |
| conda create -n video-rts python=3.11 |
| conda activate video-rts |
| bash setup.sh |
| |
| # qwen video extraction setting, e.g., max frames, resolutions |
| # Use the [decord] feature to improve speed |
| cd src/qwen-vl-utils |
| pip install -e .[decord] |
| cd .. |
| ``` |
|
|
|
|
| Following Video-R1, please install the provided version of transformers |
|
|
| ```bash |
| unzip transformers-main.zip |
| cd ./transformers-main |
| pip install . |
| ``` |
|
|
| ## **Download Dataset** |
| Please refer to the official github of each dataset for video downloading. |
|
|
| For evaluation, we provide the annotation file in `./src/r1-v/Evaluation` and please refer to the `./src/r1-v/Evaluation/path_coversion.py` to update the video path. |
|
|
| For training, we provided the training data annotation in `./src/training_data` and please refer to the [CG-Bench](https://huggingface.co/datasets/CG-Bench/CG-Bench) repo for video data |
|
|
| ## **Download Video-RTS model checkpoint** |
| We provided the model checkpoint in [Huggingface](https://huggingface.co/Ted412/Video-RTS), noted that the model is only trained on about 2k samples but yield similar performance with the 6k sample training. |
|
|
| ## **Video-RTS Training** |
|
|
| We use the [Open-R1-Video](https://github.com/Wang-Xiaodong1899/Open-R1-Video) as trainig codebased. We provided our modification files in `./src/training_files` so please replace the exact same files in the original repo. You could also use the [Video-R1](https://github.com/tulerfeng/Video-R1/tree/main) as training codebase, we find the results are similar. |
|
|
|
|
| ## **Inference with S2D Video TTS** |
|
|
| Please update the input model / file name / output file in the given bash file. After running the inference code, please update the json_path in `cal_results_acc.py` to calculate the final video reasoning accuracy. |
| |
| ```bash |
| bash src/video_rts_eval.sh |
| python src/cal_results_acc.py |
| ``` |
| |
| |
| ## Acknowledgments |
| We thank the developers of [Open-R1-Video](https://github.com/Wang-Xiaodong1899/Open-R1-Video), [Video-R1](https://github.com/tulerfeng/Video-R1/tree/main), [Qwen-2.5-VL](https://github.com/QwenLM/Qwen2.5-VL/tree/main) and [TRL](https://github.com/huggingface/trl) for their public code release. |
| |
| # Reference |
| Please cite our paper if you use our models in your works: |
| |
| ```bibtex |
| |
| |
| @misc |
| {wang2025videortsrethinkingreinforcementlearning, |
| title={Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning}, |
| author={Ziyang Wang and Jaehong Yoon and Shoubin Yu and Md Mohaiminul Islam and Gedas Bertasius and Mohit Bansal}, |
| year={2025}, |
| eprint={2507.06485}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2507.06485}, |
| } |