wordle-grpo-Qwen3-1.7B

Fine-tune of Qwen/Qwen3-1.7B trained to play Wordle using GRPO via TRL and the OpenEnv framework.

Training details

  • Method: GRPO (Group Relative Policy Optimization)
  • Base model: Qwen/Qwen3-1.7B (thinking disabled at inference — enable_thinking=False)
  • Environment: TextArena Wordle-v0 via OpenEnv, hosted on a HuggingFace Space
  • Compute: NVIDIA A100-80GB on Modal (~94 minutes)
  • Steps: 62 gradient updates, 4 rollouts per prompt
  • Total games played: ~15,800 live Wordle episodes during training

Results (50-episode eval)

Baseline (Qwen3-1.7B) GRPO fine-tuned
Win rate 0% 8%
Avg shaped reward 1.33 1.44
Avg yellow reward 0.184 0.236

Visualize in Trackio

Downloads last month
974
Safetensors
Model size
2B params
Tensor type
F32
·
Video Preview
loading

Model tree for mrinaalarora/wordle-grpo-Qwen3-1.7B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(623)
this model