wordle-grpo-Qwen3-1.7B

Fine-tune of Qwen/Qwen3-1.7B trained to play Wordle using GRPO via TRL and the OpenEnv framework.

Training details

Method: GRPO (Group Relative Policy Optimization)
Base model: Qwen/Qwen3-1.7B (thinking disabled at inference — enable_thinking=False)
Environment: TextArena Wordle-v0 via OpenEnv, hosted on a HuggingFace Space
Compute: NVIDIA A100-80GB on Modal (~94 minutes)
Steps: 62 gradient updates, 4 rollouts per prompt
Total games played: ~15,800 live Wordle episodes during training

Safetensors

Model size

2B params

Tensor type

F32

Video Preview

Base model

Finetuned

Finetuned

(623)

this model