Spaces:
Sleeping
Sleeping
Training plots
Evidence of real training runs. Two distinct phases captured:
SFT prime (Algorithm Distillation, the final pipeline)
sft_v3_training_loss.pngβ Loss curve from the SFT v3 run (525 steps, 5,040 (state, response) pairs from gpt-5.4 teacher trajectories, Qwen 2.5-3B- 4-bit + LoRA r=16). Loss drops from 2.77 β 0.083, smooth convergence, no overfitting.
- Source:
InosLihka/rhythm-env-meta-trained-sft-v3/log_history.json
GRPO iteration 2 (the journey before the AD pivot)
These came from a real 400-step GRPO run on Qwen 2.5-3B + Unsloth, before we discovered that pure GRPO from scratch wasn't going to beat heuristic at this model scale and pivoted to Algorithm Distillation:
grpo_iter2_training_loss.pngβ GRPO loss over 400 stepsgrpo_iter2_baseline_vs_trained.pngβ final scores vs random + heuristic across 3 eval conditions
(More detailed component plots β reward curve, reward_components, belief_accuracy trajectory β are available in the iter2 model repo. They were too large to inline here without Git LFS setup.)
The full iteration journey (5 GRPO iters β AD pivot) is in
docs/iterations.md.