Training Dashboard
GRPO fine-tuning on Qwen2.5-1.5B-Instruct · Parlay reward functions
Ready
Base
Qwen2.5-1.5B-Instruct
Avg Reward
—
Deal Rate
—
ZOPA Efficiency
—
ToM Accuracy
—
Drift Adapt.
—
SFT
Qwen2.5-1.5B + SFT Warmup
Avg Reward
—
Deal Rate
—
ZOPA Efficiency
—
ToM Accuracy
—
Drift Adapt.
—
GRPO
Qwen2.5-1.5B + SFT + GRPO
Avg Reward
—
Deal Rate
—
ZOPA Efficiency
—
ToM Accuracy
—
Drift Adapt.
—
Reward Curve
GRPO Configuration
Base Model
Qwen/Qwen2.5-1.5B-Instruct
Generations (G)
8
KL β (beta)
0.001
Clip ε (epsilon)
0.2
Learning Rate
5e-7
Reward Scale
batch
LoRA r
16
LoRA α
32
Target Modules
q_proj, v_proj
Top Player θ
0.60
Data Generation
Min Episodes
500
Per Pair Min
20
Noise Rate
30%
Drift Rate
40%
Coalition (Act 3)
25%
Train / Eval Split
90 / 10
LLM Generator
Gemini 2.0 Flash
Personas × Scenarios
5 × 5 = 25
Run Training Pipeline
1
Install dependencies and set environment variables:
pip install -r requirements.txt
then set GOOGLE_API_KEY and HF_TOKEN.
2
Generate self-play training data via Gemini:
python -m training.generate_data --episodes 500
3
SFT warmup on top-player episodes:
python -m training.sft_train
4
GRPO fine-tuning with Parlay reward functions:
python -m training.grpo_train
5
Evaluate and generate the comparison chart:
python -m training.evaluate --output results/
6
Push to Hugging Face Hub:
python -m training.push_to_hub
Reward Function Weights
Deal Efficiency (ZOPA)
0.35
Capitulation Cliff penalty
−Ω
ToM Accuracy
0.20
Drift Adaptation
0.15
Move Diversity
0.10
Act Completion Bonus
0.10
Format Validity
0.10
Training Progress
Step 0 / —
SFT Data Gen
SFT Train
GRPO Train
Evaluate
Training Log
[ Parlay Training Dashboard ]
Ready. Click "Run Training" to begin, or use the CLI commands above.