RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
Paper • 2507.22844 • Published
LoRA adapter trained with RLVMR (GRPO-MR) from Kaito-F/qwen3-4b-sft-v13-mixed-rlvmr.
Based on RLVMR paper (arXiv:2507.22844):
| Task | Success |
|---|---|
| put | 31/48 (64.6%) |
| clean | 25/60 (41.7%) |
| heat | 6/30 (20.0%) |
| cool | 7/36 (19.4%) |
| examine | 28/54 (51.9%) |
| puttwo | 34/72 (47.2%) |
Base model
Qwen/Qwen3-4B-Instruct-2507