Gin Rummy HBC - Qwen3.5 2B

Behavioral cloning model for Gin Rummy trained via supervised fine-tuning on expert trajectories.

This model was trained on 32,000 stratified expert game states to learn optimal Gin Rummy decision-making. It serves as the initialization for subsequent GRPO (Group Relative Policy Optimization) self-play training.

Model Details

  • Model type: Causal language model (decoder-only transformer)
  • Base model: Qwen/Qwen3.5-2B
  • Parameters: 2B parameters
  • Training method: LoRA (Low-Rank Adaptation) fine-tuning
  • Task: Gin Rummy move prediction
  • License: Apache 2.0

Training Data

Dataset: GoodStartLabs/gin-rummy-trajectories-32k

  • Training samples: 32,000 (stratified sampling, minimum 1,000 per action type)
  • Validation samples: 1,000 (perfectly balanced, 200 per action type)
  • Source: Expert agent gameplay using Monte Carlo Tree Search (MCTS)

Action distribution (training set):

  • discard (discard a card): 44.6%
  • draw (draw from stock): 33.1%
  • +discard (pick from discard pile): 14.9%
  • KNOCK-[card] (knock and discard): 4.0%
  • pass (pass on upcard): 3.5%

Validation set: Perfectly balanced with exactly 200 samples per action type for unbiased evaluation.

Training Procedure

Fine-tuning platform: Together AI (serverless LoRA training)

Hyperparameters:

  • LoRA rank: 16 (0.8B, 2B) / 32 (4B)
  • LoRA alpha: 16 (0.8B, 2B) / 32 (4B)
  • LoRA dropout: 0.05
  • LoRA modules: all-linear
  • Learning rate: 1e-4 (0.8B) / 5e-5 (2B, 4B)
  • Batch size: 8
  • Epochs: 3
  • Warmup ratio: 0.1
  • Weight decay: 0.01
  • Max gradient norm: 1.0
  • Train on inputs: False (loss calculated only on assistant response tokens)

Training duration: ~2-4 hours per model

Infrastructure:

  • Platform: Together AI
  • GPUs: NVIDIA H100 (serverless)
  • Precision: bfloat16

Intended Use

Primary Use Case

This model serves as the warm-start initialization for GRPO self-play training:

  1. HBC (Behavioral Cloning)This model

    • Learn from expert trajectories
    • Acquire strong baseline policy
    • Fast convergence to competent play
  2. GRPO (Group Relative Policy Optimization)Next stage

    • Self-play reinforcement learning
    • Discover novel strategies
    • Optimize for win rate

Inference

The model predicts the next action given the current game state formatted as a chat conversation:

Input format:

[SYSTEM]
You are an expert Gin Rummy player. Your goal is to minimize deadwood and form melds.

[USER]
History:
1. You: +D6x -C3
2. Opp: draw -CK

Now:
Hand: CK D2 D3 D4 D5 D6 D9 H7 HK HQ S9
Stock: 28 | Deadwood: 45 | Phase: discard_or_knock
YOUR TURN | Can: no

[ASSISTANT]

Output (predicted action):

-H7

Action format:

  • draw - Draw from stock pile
  • +discard - Pick from discard pile
  • -[CARD] - Discard a card (e.g., -H7 = discard 7 of Hearts)
  • KNOCK-[CARD] - Knock and discard (e.g., KNOCK-C3)
  • pass - Pass on the initial upcard

Card notation: Rank (A/2-9/T/J/Q/K) + Suit (C/D/H/S)

  • Example: H7 = 7 of Hearts, CK = King of Clubs, SA = Ace of Spades

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "GoodStartLabs/gin-rummy-hbc-qwen3.5-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)

# Format game state as chat
messages = [
    {
        "role": "system",
        "content": "You are an expert Gin Rummy player. Your goal is to minimize deadwood and form melds."
    },
    {
        "role": "user",
        "content": '''History:
1. Opp: draw -SQ
2. You: draw(DT) -DT

Now:
Hand: C9 D3 D9 H3 H6 HJ HQ HT S6 S9
Stock: 22 | Deadwood: 18 | Phase: draw
YOUR TURN | Can: no'''
    }
]

# Generate prediction
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=20,
    temperature=0.0,  # Greedy decoding for deterministic play
    do_sample=False,
)

action = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip()
print(f"Predicted action: {action}")

Limitations

  • Behavioral cloning ceiling: Model is limited by the quality of expert demonstrations. Cannot exceed expert performance without RL.
  • Distribution shift: May struggle on game states not represented in training data.
  • Stochastic policy: Model predicts a distribution over actions; greedy decoding gives deterministic play but may not explore optimally.
  • No opponent modeling: Does not explicitly model opponent strategy (though may learn implicit patterns from game history).
  • Fixed strategy: Cannot adapt during a game; uses the same policy throughout.

Evaluation

Validation accuracy (on balanced 1K validation set):

Win rate vs. baselines:

  • Random policy: TBD
  • Greedy heuristic: TBD
  • Expert policy: TBD

Ethical Considerations

This model is trained for the game of Gin Rummy and should only be used for:

  • Game AI research
  • Educational purposes
  • Entertainment (single-player practice, AI opponents)

Not intended for:

  • Real-money gambling
  • Cheating in online games
  • Deceptive or manipulative applications

Citation

If you use this model in your research, please cite:

@misc{gin-rummy-hbc-2b,
  author = {Good Start Labs},
  title = {Gin Rummy HBC - Qwen3.5 2B},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{GoodStartLabs/gin-rummy-hbc-qwen3.5-2b}},
}

Model Card Authors

  • Good Start Labs
  • Contact: GitHub

Model Card Contact

For questions or issues with this model:


Model trained on Together AI • Base model: Qwen3.5 • Training date: March 2026

Downloads last month
735
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GoodStartLabs/gin-rummy-hbc-qwen3.5-2b

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(111)
this model

Dataset used to train GoodStartLabs/gin-rummy-hbc-qwen3.5-2b