Gin Rummy GRPO LoRA v2 โ Run 9 (Qwen3-8B)
LoRA trained via GRPO on Gin Rummy with shaped rewards and WeakKnocker opponent. Final run of a 9-run iteration.
| Field | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| Method | GRPO (TRL GRPOTrainer) |
| LoRA | r=16, alpha=32, q/k/v/o_proj |
| Steps | 500 (~2000 games) |
| Opponent | WeakKnocker (random draws, smart discards, always knocks) |
| Win rate | 0.5% |
| Infra | Together AI 8x H100 80GB, ~3.5 hours |
| Git commit | 2a987af docs: add Phase 3 (V2 migration) to experiment log |
| W&B | good-start-labs/ginrummy-grpo |
| Dataset | Self-play rollouts vs WeakKnocker bot |
Config
enable_thinking=True, simplified rules-only prompt, single vLLM call/turnnum_generations=8,lr=1e-5,beta=0.1,max_completion_length=2048- Shaped reward: 40% terminal + 40% deadwood progress + 20% game length
Key Finding
GRPO alone cannot teach gin rummy from scratch. The game is too complex for random exploration. SFT on expert data should precede RL. See full experiment log.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora-v2")
- Downloads last month
- 30