qwen3-4b-rlvmr-alfworld-v7

LoRA adapter trained with RLVMR (GRPO-MR) from Kaito-F/qwen3-4b-sft-v13-mixed-rlvmr.

Method: RLVMR (Reinforcement Learning with Verifiable Meta-Reasoning Rewards)

Based on RLVMR paper (arXiv:2507.22844):

  • Dense meta-reasoning rewards for planning, exploration, reflection, monitoring
  • GRPO-MR advantage: alpha=0.5 (outcome + tag-level grouping)
  • Format reward penalty for missing meta-reasoning tags

Rollout Statistics

  • Total trajectories: 300
  • Overall success: 43.7%
  • Tasks: all 6 ALFWorld types
  • K=6 per environment, max_steps=30
Task Success
put 31/48 (64.6%)
clean 25/60 (41.7%)
heat 6/30 (20.0%)
cool 7/36 (19.4%)
examine 28/54 (51.9%)
puttwo 34/72 (47.2%)

Training Configuration

  • Base model: Kaito-F/qwen3-4b-sft-v13-mixed-rlvmr
  • LoRA: r=16, alpha=32
  • LR: 2e-06, Epochs: 1
  • KL coefficient: 0.01
  • PPO clip epsilon: 0.2
  • Grad clip norm: 1.0
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kaito-F/qwen3-4b-rlvmr-alfworld-v7

Adapter
(3)
this model

Paper for Kaito-F/qwen3-4b-rlvmr-alfworld-v7