ginrummy-qwen3-8b-grpo-lora

A LoRA adapter trained via GRPO (Group Relative Policy Optimization) on Gin Rummy self-play, built on Qwen/Qwen3-8B.

Evaluation Results vs Base Model

Benchmarked against the base Qwen3-8B (via OpenRouter) with n=300 samples per benchmark and Wilson 95% confidence intervals.

Benchmark Base Qwen3-8B This Model Delta 95% CI (this model) Significant?
GSM8K (math) 96.3% 91.1% -5.2% [87.9%, 93.5%] Borderline
ARC Challenge (science) 70.7% 96.3% +25.6% [93.5%, 97.9%] Yes
TruthfulQA (factual) 65.7% 69.0% +3.3% [63.6%, 74.0%] No
MMLU-Pro (knowledge) 70.3% 59.7% -10.6% [54.1%, 65.1%] Yes
HellaSwag (commonsense) 69.0% 73.3% +4.3% [68.0%, 78.0%] No

Key findings:

  • Massive improvement on ARC Challenge (+25.6%), likely due to improved strategic reasoning from RL training
  • Regression on MMLU-Pro (-10.6%) and GSM8K (-5.2%), consistent with RL fine-tuning trading broad knowledge for task-specific reasoning
  • TruthfulQA and HellaSwag differences are within noise

Training Details

Parameter Value
Method GRPO (TRL GRPOTrainer)
Base model Qwen/Qwen3-8B
Training steps 200 (800 games)
Learning rate 1e-6
Training setup Self-play vs algorithmic bot (GinRummyBot)
Win rate achieved 16.8%
Hardware Together AI 8x H100 80GB
Training time 11.9 minutes

LoRA Configuration

r = 16
lora_alpha = 32
target_modules = [q_proj, k_proj, v_proj, o_proj]
task_type = CAUSAL_LM

Training Hyperparameters

  • enable_thinking=False (no reasoning tokens)
  • num_generations=4
  • Sparse terminal reward only (+1/-1/0)
  • Minimal prompt (no tool use)

Usage

With PEFT (direct loading)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

With vLLM (serving)

vllm serve Qwen/Qwen3-8B \
  --enable-lora \
  --max-lora-rank 16 \
  --lora-modules ginrummy=GoodStartLabs/ginrummy-qwen3-8b-grpo-lora \
  --max-model-len 4096 \
  --enforce-eager \
  --port 8000

Then query via OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="ginrummy",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
)

Merge into base model (standalone)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora")
model = model.merge_and_unload()
model.save_pretrained("./qwen3-8b-ginrummy-merged")
AutoTokenizer.from_pretrained("Qwen/Qwen3-8B").save_pretrained("./qwen3-8b-ginrummy-merged")

Limitations

  • This is a baseline run (run 9 of iteration series) with no reasoning tokens or tool use
  • Win rate of 16.8% indicates early-stage training; further iterations expected
  • See the experiment log for the full iteration history

Eval Methodology

Evaluations run using Inspect AI (v0.3.x). Fine-tuned model served via vLLM 0.18.0 on A100-80GB. Base model accessed via OpenRouter. Full results with Wilson CIs available at GoodStartLabs/huggingface-evals.

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GoodStartLabs/ginrummy-qwen3-8b-grpo-lora

Finetuned
Qwen/Qwen3-8B
Adapter
(1077)
this model

Evaluation results