Qwen3-32B-NL2Bash-31step

RL-trained Qwen3-32B on NL2Bash terminal tasks.

Training Details

  • Base model: Qwen/Qwen3-32B
  • Training method: RLOO (async)
  • Training data: 1,570 NL2Bash tasks (DCAgent2/nl2bash-tasks-cleaned-oracle)
  • Steps: 31 global steps (3 epochs)
  • Infrastructure: 17x4 GH200 GPU nodes (JSC), FSDP2 with TP=2 for inference engines (26 inference engines + 4 policy/ref nodes)
  • Sandbox environment: Beta9/Beam containers for code execution
  • Batch size: 64, 8 samples per prompt
  • Learning rate: 1e-5

Training Curve

Metric Step 1 Step 10 Step 20 Step 31
Avg Raw Reward 0.214 0.314 0.416 0.264
Pass@8 0.563 0.563 0.594 0.422

License

Apache 2.0

Downloads last month
33
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/Qwen3-32B-NL2Bash-31step

Base model

Qwen/Qwen3-32B
Finetuned
(473)
this model

Dataset used to train laion/Qwen3-32B-NL2Bash-31step