LFM2-2.6B-ttt-sft
Supervised Fine-Tuning checkpoint of LiquidAI/LFM2-2.6B for Tic Tac Toe.
The goal of this SFT warm-up was to teach the model the correct output format and valid move syntax, before applying Reinforcement Learning. The model is not a strong player at this stage.
This is an intermediate checkpoint from 🎓 LLM RL Environments Lil Course, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is anakin87/LFM2-2.6B-mr-tictactoe.
🤗🕹️ Play against the final model
Training
- Method: SFT with PRIME-RL
- Dataset: anakin87/tictactoe-filtered (174 examples, ~5.5 epochs)
- Steps: 30, batch size 32, lr 1e-5, seq_len 700
- Hardware: NVIDIA RTX Pro 6000 96GB (~5 min)
Evaluation
100 games per setting.
| Model vs random opponent | % Wins | % Draws | % Losses | % Follows format | % Games w invalid moves |
|---|---|---|---|---|---|
| LiquidAI/LFM2-2.6B | 40 | 11 | 49 | 27.8 | 40 |
| anakin87/LFM2-2.6B-ttt-sft | 74 | 13 | 13 | 99.8 | 11 |
| Model vs optimal opponent | % Wins | % Draws | % Losses | % Follows format | % Games w invalid moves |
| LiquidAI/LFM2-2.6B | 0 | 11 | 89 | 24.7 | 43 |
| anakin87/LFM2-2.6B-ttt-sft | 0 | 52 | 48 | 99 | 14 |
Format following jumped from <30% to 99%. Gameplay strategy improved as a side effect.
- Downloads last month
- 8
Model tree for anakin87/LFM2-2.6B-ttt-sft
Base model
LiquidAI/LFM2-2.6B