ginrummy-qwen3-8b-grpo-lora

A LoRA adapter trained via GRPO (Group Relative Policy Optimization) on Gin Rummy self-play, built on Qwen/Qwen3-8B.

Evaluation Results vs Base Model

Benchmarked against the base Qwen3-8B (via OpenRouter) with n=300 samples per benchmark and Wilson 95% confidence intervals.

Benchmark	Base Qwen3-8B	This Model	Delta	95% CI (this model)	Significant?
GSM8K (math)	96.3%	91.1%	-5.2%	[87.9%, 93.5%]	Borderline
ARC Challenge (science)	70.7%	96.3%	+25.6%	[93.5%, 97.9%]	Yes
TruthfulQA (factual)	65.7%	69.0%	+3.3%	[63.6%, 74.0%]	No
MMLU-Pro (knowledge)	70.3%	59.7%	-10.6%	[54.1%, 65.1%]	Yes
HellaSwag (commonsense)	69.0%	73.3%	+4.3%	[68.0%, 78.0%]	No

Key findings:

Massive improvement on ARC Challenge (+25.6%), likely due to improved strategic reasoning from RL training
Regression on MMLU-Pro (-10.6%) and GSM8K (-5.2%), consistent with RL fine-tuning trading broad knowledge for task-specific reasoning
TruthfulQA and HellaSwag differences are within noise

Training Details

Parameter	Value
Method	GRPO (TRL GRPOTrainer)
Base model	Qwen/Qwen3-8B
Training steps	200 (800 games)
Learning rate	1e-6
Training setup	Self-play vs algorithmic bot (GinRummyBot)
Win rate achieved	16.8%
Hardware	Together AI 8x H100 80GB
Training time	11.9 minutes

LoRA Configuration

r = 16
lora_alpha = 32
target_modules = [q_proj, k_proj, v_proj, o_proj]
task_type = CAUSAL_LM

Training Hyperparameters

enable_thinking=False (no reasoning tokens)
num_generations=4
Sparse terminal reward only (+1/-1/0)
Minimal prompt (no tool use)

Usage

With PEFT (direct loading)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

With vLLM (serving)

vllm serve Qwen/Qwen3-8B \
  --enable-lora \
  --max-lora-rank 16 \
  --lora-modules ginrummy=GoodStartLabs/ginrummy-qwen3-8b-grpo-lora \
  --max-model-len 4096 \
  --enforce-eager \
  --port 8000

Then query via OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="ginrummy",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
)

Merge into base model (standalone)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora")
model = model.merge_and_unload()
model.save_pretrained("./qwen3-8b-ginrummy-merged")
AutoTokenizer.from_pretrained("Qwen/Qwen3-8B").save_pretrained("./qwen3-8b-ginrummy-merged")

Limitations

This is a baseline run (run 9 of iteration series) with no reasoning tokens or tool use
Win rate of 16.8% indicates early-stage training; further iterations expected
See the experiment log for the full iteration history

Eval Methodology

Evaluations run using Inspect AI (v0.3.x). Fine-tuned model served via vLLM 0.18.0 on A100-80GB. Base model accessed via OpenRouter. Full results with Wilson CIs available at GoodStartLabs/huggingface-evals.

Downloads last month: 48

Model tree for GoodStartLabs/ginrummy-qwen3-8b-grpo-lora

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1077)

this model

Evaluation results

Accuracy on GSM8K
self-reported

91.100
Accuracy on ARC Challenge
self-reported

96.300
Accuracy on TruthfulQA
self-reported

69.000
Accuracy on MMLU-Pro
self-reported

59.700
Accuracy on HellaSwag
self-reported

73.300