Gin Rummy HBC - Qwen3.5 2B
Behavioral cloning model for Gin Rummy trained via supervised fine-tuning on expert trajectories.
This model was trained on 32,000 stratified expert game states to learn optimal Gin Rummy decision-making. It serves as the initialization for subsequent GRPO (Group Relative Policy Optimization) self-play training.
Model Details
- Model type: Causal language model (decoder-only transformer)
- Base model: Qwen/Qwen3.5-2B
- Parameters: 2B parameters
- Training method: LoRA (Low-Rank Adaptation) fine-tuning
- Task: Gin Rummy move prediction
- License: Apache 2.0
Training Data
Dataset: GoodStartLabs/gin-rummy-trajectories-32k
- Training samples: 32,000 (stratified sampling, minimum 1,000 per action type)
- Validation samples: 1,000 (perfectly balanced, 200 per action type)
- Source: Expert agent gameplay using Monte Carlo Tree Search (MCTS)
Action distribution (training set):
discard(discard a card): 44.6%draw(draw from stock): 33.1%+discard(pick from discard pile): 14.9%KNOCK-[card](knock and discard): 4.0%pass(pass on upcard): 3.5%
Validation set: Perfectly balanced with exactly 200 samples per action type for unbiased evaluation.
Training Procedure
Fine-tuning platform: Together AI (serverless LoRA training)
Hyperparameters:
- LoRA rank: 16 (0.8B, 2B) / 32 (4B)
- LoRA alpha: 16 (0.8B, 2B) / 32 (4B)
- LoRA dropout: 0.05
- LoRA modules: all-linear
- Learning rate: 1e-4 (0.8B) / 5e-5 (2B, 4B)
- Batch size: 8
- Epochs: 3
- Warmup ratio: 0.1
- Weight decay: 0.01
- Max gradient norm: 1.0
- Train on inputs: False (loss calculated only on assistant response tokens)
Training duration: ~2-4 hours per model
Infrastructure:
- Platform: Together AI
- GPUs: NVIDIA H100 (serverless)
- Precision: bfloat16
Intended Use
Primary Use Case
This model serves as the warm-start initialization for GRPO self-play training:
HBC (Behavioral Cloning) ← This model
- Learn from expert trajectories
- Acquire strong baseline policy
- Fast convergence to competent play
GRPO (Group Relative Policy Optimization) ← Next stage
- Self-play reinforcement learning
- Discover novel strategies
- Optimize for win rate
Inference
The model predicts the next action given the current game state formatted as a chat conversation:
Input format:
[SYSTEM]
You are an expert Gin Rummy player. Your goal is to minimize deadwood and form melds.
[USER]
History:
1. You: +D6x -C3
2. Opp: draw -CK
Now:
Hand: CK D2 D3 D4 D5 D6 D9 H7 HK HQ S9
Stock: 28 | Deadwood: 45 | Phase: discard_or_knock
YOUR TURN | Can: no
[ASSISTANT]
Output (predicted action):
-H7
Action format:
draw- Draw from stock pile+discard- Pick from discard pile-[CARD]- Discard a card (e.g.,-H7= discard 7 of Hearts)KNOCK-[CARD]- Knock and discard (e.g.,KNOCK-C3)pass- Pass on the initial upcard
Card notation: Rank (A/2-9/T/J/Q/K) + Suit (C/D/H/S)
- Example:
H7= 7 of Hearts,CK= King of Clubs,SA= Ace of Spades
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "GoodStartLabs/gin-rummy-hbc-qwen3.5-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
)
# Format game state as chat
messages = [
{
"role": "system",
"content": "You are an expert Gin Rummy player. Your goal is to minimize deadwood and form melds."
},
{
"role": "user",
"content": '''History:
1. Opp: draw -SQ
2. You: draw(DT) -DT
Now:
Hand: C9 D3 D9 H3 H6 HJ HQ HT S6 S9
Stock: 22 | Deadwood: 18 | Phase: draw
YOUR TURN | Can: no'''
}
]
# Generate prediction
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=20,
temperature=0.0, # Greedy decoding for deterministic play
do_sample=False,
)
action = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip()
print(f"Predicted action: {action}")
Limitations
- Behavioral cloning ceiling: Model is limited by the quality of expert demonstrations. Cannot exceed expert performance without RL.
- Distribution shift: May struggle on game states not represented in training data.
- Stochastic policy: Model predicts a distribution over actions; greedy decoding gives deterministic play but may not explore optimally.
- No opponent modeling: Does not explicitly model opponent strategy (though may learn implicit patterns from game history).
- Fixed strategy: Cannot adapt during a game; uses the same policy throughout.
Evaluation
Validation accuracy (on balanced 1K validation set):
- Overall: TBD (check W&B: good-start-labs/gin-rummy-hbc)
- Per action type: TBD
Win rate vs. baselines:
- Random policy: TBD
- Greedy heuristic: TBD
- Expert policy: TBD
Ethical Considerations
This model is trained for the game of Gin Rummy and should only be used for:
- Game AI research
- Educational purposes
- Entertainment (single-player practice, AI opponents)
Not intended for:
- Real-money gambling
- Cheating in online games
- Deceptive or manipulative applications
Citation
If you use this model in your research, please cite:
@misc{gin-rummy-hbc-2b,
author = {Good Start Labs},
title = {Gin Rummy HBC - Qwen3.5 2B},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{GoodStartLabs/gin-rummy-hbc-qwen3.5-2b}},
}
Model Card Authors
- Good Start Labs
- Contact: GitHub
Model Card Contact
For questions or issues with this model:
- Open an issue on the model repository
- Check W&B training logs
Model trained on Together AI • Base model: Qwen3.5 • Training date: March 2026
- Downloads last month
- 735