---
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
tags:
  - grpo
  - trl
  - video
  - video-text-to-text
  - planner
  - long-video
pipeline_tag: video-text-to-text
---

# ToolMerge planner — GRPO-finetuned Qwen3-VL-8B (step 50)

GRPO-finetuned planner from
[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct), used
as the text-only query decomposer in the ToolMerge keyframe-retrieval pipeline.

Trained with TRL's GRPO trainer on Molmo-2 Moments (M2M) training data,
optimizing the `frames-in-GT` + `consistency` reward at `global_step=50`.

## Quick start

```python
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("michalsr/toolmerge-planner-grpo")
model = AutoModelForCausalLM.from_pretrained(
    "michalsr/toolmerge-planner-grpo",
    torch_dtype="bfloat16",
)
```

To use inside ToolMerge, override the planner checkpoint at the CLI:

```bash
toolmerge config=configs/m2m/qwen3_8.yaml \
    model.base=michalsr/toolmerge-planner-grpo
```

## Training recipe

| Setting | Value |
|---|---|
| Base model | `Qwen/Qwen3-VL-8B-Instruct` |
| Reward | `frames_in_gt=1.0`, `consistency=1.0` |
| Training data | `train_correct_uniform_8f_clip_max1.json` (filtered M2M train split, ~1500 items) |
| Optimizer | `paged_adamw_8bit`, lr=1e-6, bf16 |
| Compute | 2 nodes × 4 GPUs |
| Step | `global_step=50` |
| Framework | TRL 0.27.2, transformers 4.57.6, PyTorch 2.10.0 |

Full training config: [`training/configs/m2m_grpo.yaml`](https://github.com/michalsr/ToolMerge/blob/main/training/configs/m2m_grpo.yaml)
in the ToolMerge repo.

## Citation

```bibtex
@misc{shlapentokhrothman2026decomposingqueriestoolcalls,
  title         = {Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval},
  author        = {Michal Shlapentokh-Rothman and Prachi Garg and Yu-Xiong Wang and Derek Hoiem},
  year          = {2026},
  eprint        = {2605.23826},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.23826},
}
```

Cite the GRPO method:

```bibtex
@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}
```

Code repo: <https://github.com/michalsr/ToolMerge>.