mwxely commited on
Commit
e051198
·
verified ·
1 Parent(s): 08e2f90

Initial README

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-8B-Instruct
4
+ datasets:
5
+ - ParaVT/ParaVT-Parquet
6
+ - ParaVT/ParaVT-Source
7
+ license: apache-2.0
8
+ library_name: transformers
9
+ pipeline_tag: video-text-to-text
10
+ language:
11
+ - en
12
+ tags:
13
+ - video
14
+ - long-video
15
+ - reasoning
16
+ - tool-calling
17
+ - agentic-rl
18
+ - grpo
19
+ - multimodal
20
+ ---
21
+
22
+ # ParaVT: From Format Fragility to Parallel Tool Mastery in Agentic Video RL
23
+
24
+ <div align="center">
25
+
26
+ [![Paper](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](#citation)
27
+ [![Code](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/mwxely/ParaVT)
28
+ [![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff)](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet)
29
+ [![Source](https://img.shields.io/badge/Source-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff)](https://huggingface.co/datasets/ParaVT/ParaVT-Source)
30
+
31
+ </div>
32
+
33
+ ## Overview
34
+
35
+ Training large multimodal models (LMMs) via reinforcement learning to natively invoke video-processing tools (such as temporal cropping) has become a promising route to long-video understanding. Existing native-RL methods, however, dispatch tool calls sequentially (one per turn): a single wrong crop propagates errors without peer correction, multi-turn calls corrupt context, and inference cost scales linearly with the number of turns.
36
+
37
+ **ParaVT** is the first multi-agent end-to-end RL-trained framework for **Para**llel **V**ideo **T**ool calling: it dispatches multiple time-window crops in a single turn for cleaner context and better fault tolerance. Applying standard RL to ParaVT surfaces an obstacle we term the *Tool Prior Paradox*, where the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose a skip-tool reward shortcut under temperature sampling. We address this with **PARA-GRPO** (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, and a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it.
38
+
39
+ ## Model Card
40
+
41
+ This repository hosts the final post-RL checkpoint (`ParaVT-8B`), obtained by running PARA-GRPO on top of the cold-start SFT checkpoint [`mwxely/ParaVT-8B-SFT`](https://huggingface.co/mwxely/ParaVT-8B-SFT). The base architecture is `Qwen3VLForConditionalGeneration`, identical to [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct); only the language-model weights are updated.
42
+
43
+ | Field | Value |
44
+ |---|---|
45
+ | Architecture | `Qwen3VLForConditionalGeneration` |
46
+ | Parameters | 8 B |
47
+ | Base model | `Qwen/Qwen3-VL-8B-Instruct` |
48
+ | Training stages | SFT (Plan B, 500 steps) → PARA-GRPO (54 steps) |
49
+ | Training data | [`ParaVT/ParaVT-Parquet`](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet) (`sft` + `rl` configs) |
50
+ | Source videos | [`ParaVT/ParaVT-Source`](https://huggingface.co/datasets/ParaVT/ParaVT-Source) |
51
+ | Native tool | Temporal cropping (start time, end time, optional sub-frame count) |
52
+
53
+ ## Usage
54
+
55
+ `ParaVT-8B` is a drop-in `transformers` / `vllm` model for video-text-to-text. The full evaluation driver, prompt templates, and reproduction scripts live in the [ParaVT GitHub repository](https://github.com/mwxely/ParaVT); please refer to it for the exact environment that produced the reported numbers.
56
+
57
+ ```bash
58
+ # Reproduce the headline numbers (after installing the eval venv)
59
+ git clone https://github.com/mwxely/ParaVT.git && cd ParaVT
60
+ cp .secrets.env.example .secrets.env && $EDITOR .secrets.env
61
+ bash scripts/setup_env.sh eval
62
+ PARAVT_EVAL_MODEL=ParaVT/ParaVT-8B \
63
+ bash paravt/eval/scripts/reproduce_paravt_8b.sh
64
+ ```
65
+
66
+ For inference outside the eval driver, treat the model exactly like `Qwen/Qwen3-VL-8B-Instruct`: vLLM `--model ParaVT/ParaVT-8B`, the same tokenizer, the same chat template. The agentic system prompt and the tool schema used during PARA-GRPO are documented in [`paravt/eval/configs/withtool.yaml`](https://github.com/mwxely/ParaVT/blob/paravt-release/paravt/eval/configs/withtool.yaml) and [`paravt/eval/utils.py`](https://github.com/mwxely/ParaVT/blob/paravt-release/paravt/eval/utils.py).
67
+
68
+ ## Citation
69
+
70
+ If you find ParaVT useful for your research and applications, please cite:
71
+
72
+ ```bibtex
73
+ @misc{yang2026paravt,
74
+ title={{ParaVT}: From Format Fragility to Parallel Tool Mastery in Agentic Video {RL}},
75
+ author={Zuhao Yang and others},
76
+ year={2026},
77
+ archivePrefix={arXiv},
78
+ primaryClass={cs.CV}
79
+ }
80
+ ```
81
+
82
+ ## Acknowledgements
83
+
84
+ ParaVT builds on the [LongVT](https://github.com/EvolvingLMMs-Lab/LongVT) (CVPR 2026) framework for native video tool calling, the [`lmms-engine`](https://github.com/EvolvingLMMs-Lab/lmms-engine) cold-start SFT infrastructure, the [`AReaL`](https://github.com/inclusionAI/AReaL) RL training stack, and the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval) evaluation harness. We thank the maintainers of all of the above.