Gin Rummy GRPO LoRA v2 — Run 9 (Qwen3-8B)

LoRA trained via GRPO on Gin Rummy with shaped rewards and WeakKnocker opponent. Final run of a 9-run iteration.

Field	Value
Base model	Qwen/Qwen3-8B
Method	GRPO (TRL GRPOTrainer)
LoRA	r=16, alpha=32, q/k/v/o_proj
Steps	500 (~2000 games)
Opponent	WeakKnocker (random draws, smart discards, always knocks)
Win rate	0.5%
Infra	Together AI 8x H100 80GB, ~3.5 hours
Git commit	`2a987af docs: add Phase 3 (V2 migration) to experiment log`
W&B	good-start-labs/ginrummy-grpo
Dataset	Self-play rollouts vs WeakKnocker bot

Config

enable_thinking=True, simplified rules-only prompt, single vLLM call/turn
num_generations=8, lr=1e-5, beta=0.1, max_completion_length=2048
Shaped reward: 40% terminal + 40% deadwood progress + 20% game length

Key Finding

GRPO alone cannot teach gin rummy from scratch. The game is too complex for random exploration. SFT on expert data should precede RL. See full experiment log.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora-v2")

Downloads last month: 30

Video Preview

Reinforcement Learning

Model tree for GoodStartLabs/ginrummy-qwen3-8b-grpo-lora-v2

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1072)

this model