Video-Text-to-Text
Transformers
Safetensors
qwen3_vl
image-text-to-text
grpo
trl
video
planner
long-video
Instructions to use michalsr/toolmerge-planner-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use michalsr/toolmerge-planner-grpo with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("michalsr/toolmerge-planner-grpo") model = AutoModelForImageTextToText.from_pretrained("michalsr/toolmerge-planner-grpo") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-VL-8B-Instruct | |
| library_name: transformers | |
| tags: | |
| - grpo | |
| - trl | |
| - video | |
| - video-text-to-text | |
| - planner | |
| - long-video | |
| pipeline_tag: video-text-to-text | |
| # ToolMerge planner — GRPO-finetuned Qwen3-VL-8B (step 50) | |
| GRPO-finetuned planner from | |
| [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct), used | |
| as the text-only query decomposer in the ToolMerge keyframe-retrieval pipeline. | |
| Trained with TRL's GRPO trainer on Molmo-2 Moments (M2M) training data, | |
| optimizing the `frames-in-GT` + `consistency` reward at `global_step=50`. | |
| ## Quick start | |
| ```python | |
| from transformers import AutoProcessor, AutoModelForCausalLM | |
| processor = AutoProcessor.from_pretrained("michalsr/toolmerge-planner-grpo") | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "michalsr/toolmerge-planner-grpo", | |
| torch_dtype="bfloat16", | |
| ) | |
| ``` | |
| To use inside ToolMerge, override the planner checkpoint at the CLI: | |
| ```bash | |
| toolmerge config=configs/m2m/qwen3_8.yaml \ | |
| model.base=michalsr/toolmerge-planner-grpo | |
| ``` | |
| ## Training recipe | |
| | Setting | Value | | |
| |---|---| | |
| | Base model | `Qwen/Qwen3-VL-8B-Instruct` | | |
| | Reward | `frames_in_gt=1.0`, `consistency=1.0` | | |
| | Training data | `train_correct_uniform_8f_clip_max1.json` (filtered M2M train split, ~1500 items) | | |
| | Optimizer | `paged_adamw_8bit`, lr=1e-6, bf16 | | |
| | Compute | 2 nodes × 4 GPUs | | |
| | Step | `global_step=50` | | |
| | Framework | TRL 0.27.2, transformers 4.57.6, PyTorch 2.10.0 | | |
| Full training config: [`training/configs/m2m_grpo.yaml`](https://github.com/michalsr/ToolMerge/blob/main/training/configs/m2m_grpo.yaml) | |
| in the ToolMerge repo. | |
| ## Citation | |
| ```bibtex | |
| @misc{shlapentokhrothman2026decomposingqueriestoolcalls, | |
| title = {Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval}, | |
| author = {Michal Shlapentokh-Rothman and Prachi Garg and Yu-Xiong Wang and Derek Hoiem}, | |
| year = {2026}, | |
| eprint = {2605.23826}, | |
| archivePrefix = {arXiv}, | |
| primaryClass = {cs.CV}, | |
| url = {https://arxiv.org/abs/2605.23826}, | |
| } | |
| ``` | |
| Cite the GRPO method: | |
| ```bibtex | |
| @article{shao2024deepseekmath, | |
| title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}}, | |
| author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo}, | |
| year = 2024, | |
| eprint = {arXiv:2402.03300}, | |
| } | |
| ``` | |
| Code repo: <https://github.com/michalsr/ToolMerge>. | |