zkchen
/

qwen25-7b-grpo-alfworld

Reinforcement Learning

Model card Files Files and versions

Qwen2.5-7B-Instruct GRPO (AlfWorld)

Fine-tuned with GRPO on AlfWorld using veRL on 4x B200 GPUs.

Results

Metric	Value
Best Val Success Rate	86.7%
Best Step	105 (saved ckpt: 100)

Per-Task Breakdown

Task	Val SR
pick_clean_then_place	96.0%
look_at_obj_in_light	80.0%
pick_and_place	75.9%
pick_cool_then_place	71.9%
pick_heat_then_place	71.4%
pick_two_obj_and_place	66.7%

Training Config

Framework: veRL (FSDP + vLLM)
Algorithm: GRPO, n=8, batch=16, lr=1e-6
Hardware: 4x NVIDIA B200 180GB
Max response tokens: 512
W&B: msuczk/agentmem-rl/32tpsj6p

Downloads last month: 15

Safetensors

Model size

2B params

Tensor type

F32

·

Video Preview

Reinforcement Learning

loading

Model tree for zkchen/qwen25-7b-grpo-alfworld

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(3214)

this model