Qwen2.5-7B-Instruct GRPO (AlfWorld)
Fine-tuned with GRPO on AlfWorld using veRL on 4x B200 GPUs.
Results
| Metric |
Value |
| Best Val Success Rate |
86.7% |
| Best Step |
105 (saved ckpt: 100) |
Per-Task Breakdown
| Task |
Val SR |
| pick_clean_then_place |
96.0% |
| look_at_obj_in_light |
80.0% |
| pick_and_place |
75.9% |
| pick_cool_then_place |
71.9% |
| pick_heat_then_place |
71.4% |
| pick_two_obj_and_place |
66.7% |
Training Config
- Framework: veRL (FSDP + vLLM)
- Algorithm: GRPO, n=8, batch=16, lr=1e-6
- Hardware: 4x NVIDIA B200 180GB
- Max response tokens: 512
- W&B: msuczk/agentmem-rl/32tpsj6p