wordle-grpo-Qwen3-1.7B
Fine-tune of Qwen/Qwen3-1.7B trained to play Wordle using GRPO via TRL and the OpenEnv framework.
Training details
- Method: GRPO (Group Relative Policy Optimization)
- Base model: Qwen/Qwen3-1.7B (thinking disabled at inference —
enable_thinking=False) - Environment: TextArena Wordle-v0 via OpenEnv, hosted on a HuggingFace Space
- Compute: NVIDIA A100-80GB on Modal (~94 minutes)
- Steps: 62 gradient updates, 4 rollouts per prompt
- Total games played: ~15,800 live Wordle episodes during training
Results (50-episode eval)
| Baseline (Qwen3-1.7B) | GRPO fine-tuned | |
|---|---|---|
| Win rate | 0% | 8% |
| Avg shaped reward | 1.33 | 1.44 |
| Avg yellow reward | 0.184 | 0.236 |
- Downloads last month
- 974