michalsr
/

toolmerge-planner-grpo

Video-Text-to-Text

image-text-to-text

Model card Files Files and versions

toolmerge-planner-grpo / README.md

michalsr's picture

Update citation with arXiv bibtex

9053991 verified 2 days ago

|

history blame contribute delete

2.54 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-VL-8B-Instruct
	library_name: transformers
	tags:
	- grpo
	- trl
	- video
	- video-text-to-text
	- planner
	- long-video
	pipeline_tag: video-text-to-text
	---

	# ToolMerge planner — GRPO-finetuned Qwen3-VL-8B (step 50)

	GRPO-finetuned planner from
	[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct), used
	as the text-only query decomposer in the ToolMerge keyframe-retrieval pipeline.

	Trained with TRL's GRPO trainer on Molmo-2 Moments (M2M) training data,
	optimizing the `frames-in-GT` + `consistency` reward at `global_step=50`.

	## Quick start

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM

	processor = AutoProcessor.from_pretrained("michalsr/toolmerge-planner-grpo")
	model = AutoModelForCausalLM.from_pretrained(
	"michalsr/toolmerge-planner-grpo",
	torch_dtype="bfloat16",
	)
	```

	To use inside ToolMerge, override the planner checkpoint at the CLI:

	```bash
	toolmerge config=configs/m2m/qwen3_8.yaml \
	model.base=michalsr/toolmerge-planner-grpo
	```

	## Training recipe

	\| Setting \| Value \|
	\|---\|---\|
	\| Base model \| `Qwen/Qwen3-VL-8B-Instruct` \|
	\| Reward \| `frames_in_gt=1.0`, `consistency=1.0` \|
	\| Training data \| `train_correct_uniform_8f_clip_max1.json` (filtered M2M train split, ~1500 items) \|
	\| Optimizer \| `paged_adamw_8bit`, lr=1e-6, bf16 \|
	\| Compute \| 2 nodes × 4 GPUs \|
	\| Step \| `global_step=50` \|
	\| Framework \| TRL 0.27.2, transformers 4.57.6, PyTorch 2.10.0 \|

	Full training config: [`training/configs/m2m_grpo.yaml`](https://github.com/michalsr/ToolMerge/blob/main/training/configs/m2m_grpo.yaml)
	in the ToolMerge repo.

	## Citation

	```bibtex
	@misc{shlapentokhrothman2026decomposingqueriestoolcalls,
	title = {Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval},
	author = {Michal Shlapentokh-Rothman and Prachi Garg and Yu-Xiong Wang and Derek Hoiem},
	year = {2026},
	eprint = {2605.23826},
	archivePrefix = {arXiv},
	primaryClass = {cs.CV},
	url = {https://arxiv.org/abs/2605.23826},
	}
	```

	Cite the GRPO method:

	```bibtex
	@article{shao2024deepseekmath,
	title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
	author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
	year = 2024,
	eprint = {arXiv:2402.03300},
	}
	```

	Code repo: <https://github.com/michalsr/ToolMerge>.