LFM2-2.6B-ttt-sft

Supervised Fine-Tuning checkpoint of LiquidAI/LFM2-2.6B for Tic Tac Toe.

The goal of this SFT warm-up was to teach the model the correct output format and valid move syntax, before applying Reinforcement Learning. The model is not a strong player at this stage.

This is an intermediate checkpoint from 🎓 LLM RL Environments Lil Course, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is anakin87/LFM2-2.6B-mr-tictactoe.

🤗🕹️ Play against the final model

Training

Method: SFT with PRIME-RL
Dataset: anakin87/tictactoe-filtered (174 examples, ~5.5 epochs)
Steps: 30, batch size 32, lr 1e-5, seq_len 700
Hardware: NVIDIA RTX Pro 6000 96GB (~5 min)

Evaluation

100 games per setting.

Model vs random opponent	% Wins	% Draws	% Losses	% Follows format	% Games w invalid moves
LiquidAI/LFM2-2.6B	40	11	49	27.8	40
anakin87/LFM2-2.6B-ttt-sft	74	13	13	99.8	11

Model vs optimal opponent	% Wins	% Draws	% Losses	% Follows format	% Games w invalid moves
LiquidAI/LFM2-2.6B	0	11	89	24.7	43
anakin87/LFM2-2.6B-ttt-sft	0	52	48	99	14