Qwen2.5-7B-Instruct GRPO (AlfWorld)

Fine-tuned with GRPO on AlfWorld using veRL on 4x B200 GPUs.

Results

Metric Value
Best Val Success Rate 86.7%
Best Step 105 (saved ckpt: 100)

Per-Task Breakdown

Task Val SR
pick_clean_then_place 96.0%
look_at_obj_in_light 80.0%
pick_and_place 75.9%
pick_cool_then_place 71.9%
pick_heat_then_place 71.4%
pick_two_obj_and_place 66.7%

Training Config

  • Framework: veRL (FSDP + vLLM)
  • Algorithm: GRPO, n=8, batch=16, lr=1e-6
  • Hardware: 4x NVIDIA B200 180GB
  • Max response tokens: 512
  • W&B: msuczk/agentmem-rl/32tpsj6p
Downloads last month
15
Safetensors
Model size
2B params
Tensor type
F32
·
Video Preview
loading

Model tree for zkchen/qwen25-7b-grpo-alfworld

Base model

Qwen/Qwen2.5-7B
Finetuned
(3214)
this model