Gin Rummy GRPO LoRA v2 โ€” Run 9 (Qwen3-8B)

LoRA trained via GRPO on Gin Rummy with shaped rewards and WeakKnocker opponent. Final run of a 9-run iteration.

Field Value
Base model Qwen/Qwen3-8B
Method GRPO (TRL GRPOTrainer)
LoRA r=16, alpha=32, q/k/v/o_proj
Steps 500 (~2000 games)
Opponent WeakKnocker (random draws, smart discards, always knocks)
Win rate 0.5%
Infra Together AI 8x H100 80GB, ~3.5 hours
Git commit 2a987af docs: add Phase 3 (V2 migration) to experiment log
W&B good-start-labs/ginrummy-grpo
Dataset Self-play rollouts vs WeakKnocker bot

Config

  • enable_thinking=True, simplified rules-only prompt, single vLLM call/turn
  • num_generations=8, lr=1e-5, beta=0.1, max_completion_length=2048
  • Shaped reward: 40% terminal + 40% deadwood progress + 20% game length

Key Finding

GRPO alone cannot teach gin rummy from scratch. The game is too complex for random exploration. SFT on expert data should precede RL. See full experiment log.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora-v2")
Downloads last month
30
Video Preview
loading

Model tree for GoodStartLabs/ginrummy-qwen3-8b-grpo-lora-v2

Finetuned
Qwen/Qwen3-8B
Adapter
(1072)
this model