ginrummy-qwen3-8b-grpo-lora
A LoRA adapter trained via GRPO (Group Relative Policy Optimization) on Gin Rummy self-play, built on Qwen/Qwen3-8B.
Evaluation Results vs Base Model
Benchmarked against the base Qwen3-8B (via OpenRouter) with n=300 samples per benchmark and Wilson 95% confidence intervals.
| Benchmark | Base Qwen3-8B | This Model | Delta | 95% CI (this model) | Significant? |
|---|---|---|---|---|---|
| GSM8K (math) | 96.3% | 91.1% | -5.2% | [87.9%, 93.5%] | Borderline |
| ARC Challenge (science) | 70.7% | 96.3% | +25.6% | [93.5%, 97.9%] | Yes |
| TruthfulQA (factual) | 65.7% | 69.0% | +3.3% | [63.6%, 74.0%] | No |
| MMLU-Pro (knowledge) | 70.3% | 59.7% | -10.6% | [54.1%, 65.1%] | Yes |
| HellaSwag (commonsense) | 69.0% | 73.3% | +4.3% | [68.0%, 78.0%] | No |
Key findings:
- Massive improvement on ARC Challenge (+25.6%), likely due to improved strategic reasoning from RL training
- Regression on MMLU-Pro (-10.6%) and GSM8K (-5.2%), consistent with RL fine-tuning trading broad knowledge for task-specific reasoning
- TruthfulQA and HellaSwag differences are within noise
Training Details
| Parameter | Value |
|---|---|
| Method | GRPO (TRL GRPOTrainer) |
| Base model | Qwen/Qwen3-8B |
| Training steps | 200 (800 games) |
| Learning rate | 1e-6 |
| Training setup | Self-play vs algorithmic bot (GinRummyBot) |
| Win rate achieved | 16.8% |
| Hardware | Together AI 8x H100 80GB |
| Training time | 11.9 minutes |
LoRA Configuration
r = 16
lora_alpha = 32
target_modules = [q_proj, k_proj, v_proj, o_proj]
task_type = CAUSAL_LM
Training Hyperparameters
enable_thinking=False(no reasoning tokens)num_generations=4- Sparse terminal reward only (+1/-1/0)
- Minimal prompt (no tool use)
Usage
With PEFT (direct loading)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
With vLLM (serving)
vllm serve Qwen/Qwen3-8B \
--enable-lora \
--max-lora-rank 16 \
--lora-modules ginrummy=GoodStartLabs/ginrummy-qwen3-8b-grpo-lora \
--max-model-len 4096 \
--enforce-eager \
--port 8000
Then query via OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="ginrummy",
messages=[{"role": "user", "content": "Your prompt here"}],
max_tokens=512,
)
Merge into base model (standalone)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "GoodStartLabs/ginrummy-qwen3-8b-grpo-lora")
model = model.merge_and_unload()
model.save_pretrained("./qwen3-8b-ginrummy-merged")
AutoTokenizer.from_pretrained("Qwen/Qwen3-8B").save_pretrained("./qwen3-8b-ginrummy-merged")
Limitations
- This is a baseline run (run 9 of iteration series) with no reasoning tokens or tool use
- Win rate of 16.8% indicates early-stage training; further iterations expected
- See the experiment log for the full iteration history
Eval Methodology
Evaluations run using Inspect AI (v0.3.x). Fine-tuned model served via vLLM 0.18.0 on A100-80GB. Base model accessed via OpenRouter. Full results with Wilson CIs available at GoodStartLabs/huggingface-evals.
- Downloads last month
- 48
Model tree for GoodStartLabs/ginrummy-qwen3-8b-grpo-lora
Evaluation results
- Accuracy on GSM8Kself-reported91.100
- Accuracy on ARC Challengeself-reported96.300
- Accuracy on TruthfulQAself-reported69.000
- Accuracy on MMLU-Proself-reported59.700
- Accuracy on HellaSwagself-reported73.300