Training Dashboard

GRPO fine-tuning on Qwen2.5-1.5B-Instruct · Parlay reward functions

Ready
Base
Qwen2.5-1.5B-Instruct
Avg Reward
Deal Rate
ZOPA Efficiency
ToM Accuracy
Drift Adapt.
SFT
Qwen2.5-1.5B + SFT Warmup
Avg Reward
Deal Rate
ZOPA Efficiency
ToM Accuracy
Drift Adapt.
GRPO
Qwen2.5-1.5B + SFT + GRPO
Avg Reward
Deal Rate
ZOPA Efficiency
ToM Accuracy
Drift Adapt.

Reward Curve

GRPO Configuration
Base Model Qwen/Qwen2.5-1.5B-Instruct
Generations (G) 8
KL β (beta) 0.001
Clip ε (epsilon) 0.2
Learning Rate 5e-7
Reward Scale batch
LoRA r 16
LoRA α 32
Target Modules q_proj, v_proj
Top Player θ 0.60
Data Generation
Min Episodes 500
Per Pair Min 20
Noise Rate 30%
Drift Rate 40%
Coalition (Act 3) 25%
Train / Eval Split 90 / 10
LLM Generator Gemini 2.0 Flash
Personas × Scenarios 5 × 5 = 25
Run Training Pipeline
1
Install dependencies and set environment variables:
pip install -r requirements.txt then set GOOGLE_API_KEY and HF_TOKEN.
2
Generate self-play training data via Gemini:
python -m training.generate_data --episodes 500
3
SFT warmup on top-player episodes:
python -m training.sft_train
4
GRPO fine-tuning with Parlay reward functions:
python -m training.grpo_train
5
Evaluate and generate the comparison chart:
python -m training.evaluate --output results/
6
Push to Hugging Face Hub:
python -m training.push_to_hub
Reward Function Weights
Deal Efficiency (ZOPA) 0.35
Capitulation Cliff penalty −Ω
ToM Accuracy 0.20
Drift Adaptation 0.15
Move Diversity 0.10
Act Completion Bonus 0.10
Format Validity 0.10
Training Progress Step 0 / —
SFT Data Gen SFT Train GRPO Train Evaluate
Training Log
[ Parlay Training Dashboard ] Ready. Click "Run Training" to begin, or use the CLI commands above.