Video-Text-to-Text
Transformers
Safetensors
English
qwen3_vl
image-text-to-text
video
long-video
reasoning
tool-calling
agentic-rl
grpo
multimodal
Instructions to use ParaVT/ParaVT-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ParaVT/ParaVT-8B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ParaVT/ParaVT-8B") model = AutoModelForImageTextToText.from_pretrained("ParaVT/ParaVT-8B") - Notebooks
- Google Colab
- Kaggle
Initial README
Browse files
README.md
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-VL-8B-Instruct
|
| 4 |
+
datasets:
|
| 5 |
+
- ParaVT/ParaVT-Parquet
|
| 6 |
+
- ParaVT/ParaVT-Source
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
library_name: transformers
|
| 9 |
+
pipeline_tag: video-text-to-text
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
+
tags:
|
| 13 |
+
- video
|
| 14 |
+
- long-video
|
| 15 |
+
- reasoning
|
| 16 |
+
- tool-calling
|
| 17 |
+
- agentic-rl
|
| 18 |
+
- grpo
|
| 19 |
+
- multimodal
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# ParaVT: From Format Fragility to Parallel Tool Mastery in Agentic Video RL
|
| 23 |
+
|
| 24 |
+
<div align="center">
|
| 25 |
+
|
| 26 |
+
[](#citation)
|
| 27 |
+
[](https://github.com/mwxely/ParaVT)
|
| 28 |
+
[](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet)
|
| 29 |
+
[](https://huggingface.co/datasets/ParaVT/ParaVT-Source)
|
| 30 |
+
|
| 31 |
+
</div>
|
| 32 |
+
|
| 33 |
+
## Overview
|
| 34 |
+
|
| 35 |
+
Training large multimodal models (LMMs) via reinforcement learning to natively invoke video-processing tools (such as temporal cropping) has become a promising route to long-video understanding. Existing native-RL methods, however, dispatch tool calls sequentially (one per turn): a single wrong crop propagates errors without peer correction, multi-turn calls corrupt context, and inference cost scales linearly with the number of turns.
|
| 36 |
+
|
| 37 |
+
**ParaVT** is the first multi-agent end-to-end RL-trained framework for **Para**llel **V**ideo **T**ool calling: it dispatches multiple time-window crops in a single turn for cleaner context and better fault tolerance. Applying standard RL to ParaVT surfaces an obstacle we term the *Tool Prior Paradox*, where the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose a skip-tool reward shortcut under temperature sampling. We address this with **PARA-GRPO** (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, and a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it.
|
| 38 |
+
|
| 39 |
+
## Model Card
|
| 40 |
+
|
| 41 |
+
This repository hosts the final post-RL checkpoint (`ParaVT-8B`), obtained by running PARA-GRPO on top of the cold-start SFT checkpoint [`mwxely/ParaVT-8B-SFT`](https://huggingface.co/mwxely/ParaVT-8B-SFT). The base architecture is `Qwen3VLForConditionalGeneration`, identical to [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct); only the language-model weights are updated.
|
| 42 |
+
|
| 43 |
+
| Field | Value |
|
| 44 |
+
|---|---|
|
| 45 |
+
| Architecture | `Qwen3VLForConditionalGeneration` |
|
| 46 |
+
| Parameters | 8 B |
|
| 47 |
+
| Base model | `Qwen/Qwen3-VL-8B-Instruct` |
|
| 48 |
+
| Training stages | SFT (Plan B, 500 steps) → PARA-GRPO (54 steps) |
|
| 49 |
+
| Training data | [`ParaVT/ParaVT-Parquet`](https://huggingface.co/datasets/ParaVT/ParaVT-Parquet) (`sft` + `rl` configs) |
|
| 50 |
+
| Source videos | [`ParaVT/ParaVT-Source`](https://huggingface.co/datasets/ParaVT/ParaVT-Source) |
|
| 51 |
+
| Native tool | Temporal cropping (start time, end time, optional sub-frame count) |
|
| 52 |
+
|
| 53 |
+
## Usage
|
| 54 |
+
|
| 55 |
+
`ParaVT-8B` is a drop-in `transformers` / `vllm` model for video-text-to-text. The full evaluation driver, prompt templates, and reproduction scripts live in the [ParaVT GitHub repository](https://github.com/mwxely/ParaVT); please refer to it for the exact environment that produced the reported numbers.
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
# Reproduce the headline numbers (after installing the eval venv)
|
| 59 |
+
git clone https://github.com/mwxely/ParaVT.git && cd ParaVT
|
| 60 |
+
cp .secrets.env.example .secrets.env && $EDITOR .secrets.env
|
| 61 |
+
bash scripts/setup_env.sh eval
|
| 62 |
+
PARAVT_EVAL_MODEL=ParaVT/ParaVT-8B \
|
| 63 |
+
bash paravt/eval/scripts/reproduce_paravt_8b.sh
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
For inference outside the eval driver, treat the model exactly like `Qwen/Qwen3-VL-8B-Instruct`: vLLM `--model ParaVT/ParaVT-8B`, the same tokenizer, the same chat template. The agentic system prompt and the tool schema used during PARA-GRPO are documented in [`paravt/eval/configs/withtool.yaml`](https://github.com/mwxely/ParaVT/blob/paravt-release/paravt/eval/configs/withtool.yaml) and [`paravt/eval/utils.py`](https://github.com/mwxely/ParaVT/blob/paravt-release/paravt/eval/utils.py).
|
| 67 |
+
|
| 68 |
+
## Citation
|
| 69 |
+
|
| 70 |
+
If you find ParaVT useful for your research and applications, please cite:
|
| 71 |
+
|
| 72 |
+
```bibtex
|
| 73 |
+
@misc{yang2026paravt,
|
| 74 |
+
title={{ParaVT}: From Format Fragility to Parallel Tool Mastery in Agentic Video {RL}},
|
| 75 |
+
author={Zuhao Yang and others},
|
| 76 |
+
year={2026},
|
| 77 |
+
archivePrefix={arXiv},
|
| 78 |
+
primaryClass={cs.CV}
|
| 79 |
+
}
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Acknowledgements
|
| 83 |
+
|
| 84 |
+
ParaVT builds on the [LongVT](https://github.com/EvolvingLMMs-Lab/LongVT) (CVPR 2026) framework for native video tool calling, the [`lmms-engine`](https://github.com/EvolvingLMMs-Lab/lmms-engine) cold-start SFT infrastructure, the [`AReaL`](https://github.com/inclusionAI/AReaL) RL training stack, and the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval) evaluation harness. We thank the maintainers of all of the above.
|