Gin Rummy HBC - Qwen3.5 2B

Behavioral cloning model for Gin Rummy trained via supervised fine-tuning on expert trajectories.

This model was trained on 32,000 stratified expert game states to learn optimal Gin Rummy decision-making. It serves as the initialization for subsequent GRPO (Group Relative Policy Optimization) self-play training.

Model Details

Model type: Causal language model (decoder-only transformer)
Base model: Qwen/Qwen3.5-2B
Parameters: 2B parameters
Training method: LoRA (Low-Rank Adaptation) fine-tuning
Task: Gin Rummy move prediction
License: Apache 2.0

Training Data

Dataset: GoodStartLabs/gin-rummy-trajectories-32k

Training samples: 32,000 (stratified sampling, minimum 1,000 per action type)
Validation samples: 1,000 (perfectly balanced, 200 per action type)
Source: Expert agent gameplay using Monte Carlo Tree Search (MCTS)

Action distribution (training set):

discard (discard a card): 44.6%
draw (draw from stock): 33.1%
+discard (pick from discard pile): 14.9%
KNOCK-[card] (knock and discard): 4.0%
pass (pass on upcard): 3.5%

Validation set: Perfectly balanced with exactly 200 samples per action type for unbiased evaluation.

Training Procedure

Fine-tuning platform: Together AI (serverless LoRA training)

Hyperparameters:

LoRA rank: 16 (0.8B, 2B) / 32 (4B)
LoRA alpha: 16 (0.8B, 2B) / 32 (4B)
LoRA dropout: 0.05
LoRA modules: all-linear
Learning rate: 1e-4 (0.8B) / 5e-5 (2B, 4B)
Batch size: 8
Epochs: 3
Warmup ratio: 0.1
Weight decay: 0.01
Max gradient norm: 1.0
Train on inputs: False (loss calculated only on assistant response tokens)

Training duration: ~2-4 hours per model

Infrastructure:

Platform: Together AI
GPUs: NVIDIA H100 (serverless)
Precision: bfloat16

Intended Use

Primary Use Case

This model serves as the warm-start initialization for GRPO self-play training:

HBC (Behavioral Cloning) ← This model
- Learn from expert trajectories
- Acquire strong baseline policy
- Fast convergence to competent play
GRPO (Group Relative Policy Optimization) ← Next stage
- Self-play reinforcement learning
- Discover novel strategies
- Optimize for win rate

Inference

The model predicts the next action given the current game state formatted as a chat conversation:

Input format:

[SYSTEM]
You are an expert Gin Rummy player. Your goal is to minimize deadwood and form melds.

[USER]
History:
1. You: +D6x -C3
2. Opp: draw -CK

Now:
Hand: CK D2 D3 D4 D5 D6 D9 H7 HK HQ S9
Stock: 28 | Deadwood: 45 | Phase: discard_or_knock
YOUR TURN | Can: no

[ASSISTANT]

Output (predicted action):

-H7

Action format:

draw - Draw from stock pile
+discard - Pick from discard pile
-[CARD] - Discard a card (e.g., -H7 = discard 7 of Hearts)
KNOCK-[CARD] - Knock and discard (e.g., KNOCK-C3)
pass - Pass on the initial upcard

Card notation: Rank (A/2-9/T/J/Q/K) + Suit (C/D/H/S)

Example: H7 = 7 of Hearts, CK = King of Clubs, SA = Ace of Spades

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "GoodStartLabs/gin-rummy-hbc-qwen3.5-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)

# Format game state as chat
messages = [
    {
        "role": "system",
        "content": "You are an expert Gin Rummy player. Your goal is to minimize deadwood and form melds."
    },
    {
        "role": "user",
        "content": '''History:
1. Opp: draw -SQ
2. You: draw(DT) -DT

Now:
Hand: C9 D3 D9 H3 H6 HJ HQ HT S6 S9
Stock: 22 | Deadwood: 18 | Phase: draw
YOUR TURN | Can: no'''
    }
]

# Generate prediction
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=20,
    temperature=0.0,  # Greedy decoding for deterministic play
    do_sample=False,
)

action = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip()
print(f"Predicted action: {action}")

Limitations

Behavioral cloning ceiling: Model is limited by the quality of expert demonstrations. Cannot exceed expert performance without RL.
Distribution shift: May struggle on game states not represented in training data.
Stochastic policy: Model predicts a distribution over actions; greedy decoding gives deterministic play but may not explore optimally.
No opponent modeling: Does not explicitly model opponent strategy (though may learn implicit patterns from game history).
Fixed strategy: Cannot adapt during a game; uses the same policy throughout.

Evaluation

Validation accuracy (on balanced 1K validation set):

Overall: TBD (check W&B: good-start-labs/gin-rummy-hbc)
Per action type: TBD

Win rate vs. baselines:

Random policy: TBD
Greedy heuristic: TBD
Expert policy: TBD

Ethical Considerations

This model is trained for the game of Gin Rummy and should only be used for:

Game AI research
Educational purposes
Entertainment (single-player practice, AI opponents)

Not intended for:

Real-money gambling
Cheating in online games
Deceptive or manipulative applications

Citation

If you use this model in your research, please cite:

@misc{gin-rummy-hbc-2b,
  author = {Good Start Labs},
  title = {Gin Rummy HBC - Qwen3.5 2B},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{GoodStartLabs/gin-rummy-hbc-qwen3.5-2b}},
}

Model Card Authors

Good Start Labs
Contact: GitHub

Model Card Contact

For questions or issues with this model:

Open an issue on the model repository
Check W&B training logs

Model trained on Together AI • Base model: Qwen3.5 • Training date: March 2026

Downloads last month: 735

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for GoodStartLabs/gin-rummy-hbc-qwen3.5-2b

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B