Spaces:
Sleeping
Sleeping
| # Training plots | |
| Evidence of real training runs. Two distinct phases captured: | |
| ## SFT prime (Algorithm Distillation, the final pipeline) | |
| - **`sft_v3_training_loss.png`** β Loss curve from the SFT v3 run (525 steps, | |
| 5,040 (state, response) pairs from gpt-5.4 teacher trajectories, Qwen 2.5-3B | |
| + 4-bit + LoRA r=16). Loss drops from 2.77 β 0.083, smooth convergence, | |
| no overfitting. | |
| - Source: `InosLihka/rhythm-env-meta-trained-sft-v3/log_history.json` | |
| ## GRPO iteration 2 (the journey before the AD pivot) | |
| These came from a real 400-step GRPO run on Qwen 2.5-3B + Unsloth, before we | |
| discovered that pure GRPO from scratch wasn't going to beat heuristic at this | |
| model scale and pivoted to Algorithm Distillation: | |
| - **`grpo_iter2_training_loss.png`** β GRPO loss over 400 steps | |
| - **`grpo_iter2_baseline_vs_trained.png`** β final scores vs random + heuristic | |
| across 3 eval conditions | |
| (More detailed component plots β reward curve, reward_components, belief_accuracy | |
| trajectory β are available in the iter2 model repo. They were too large to | |
| inline here without Git LFS setup.) | |
| The full iteration journey (5 GRPO iters β AD pivot) is in | |
| [`docs/iterations.md`](../docs/iterations.md). | |