rhythm_env / plots /README.md
InosLihka's picture
Add plots/ folder: SFT v3 loss + GRPO iter2 reward curves
f2401bf

Training plots

Evidence of real training runs. Two distinct phases captured:

SFT prime (Algorithm Distillation, the final pipeline)

  • sft_v3_training_loss.png β€” Loss curve from the SFT v3 run (525 steps, 5,040 (state, response) pairs from gpt-5.4 teacher trajectories, Qwen 2.5-3B
    • 4-bit + LoRA r=16). Loss drops from 2.77 β†’ 0.083, smooth convergence, no overfitting.
    • Source: InosLihka/rhythm-env-meta-trained-sft-v3/log_history.json

GRPO iteration 2 (the journey before the AD pivot)

These came from a real 400-step GRPO run on Qwen 2.5-3B + Unsloth, before we discovered that pure GRPO from scratch wasn't going to beat heuristic at this model scale and pivoted to Algorithm Distillation:

  • grpo_iter2_training_loss.png β€” GRPO loss over 400 steps
  • grpo_iter2_baseline_vs_trained.png β€” final scores vs random + heuristic across 3 eval conditions

(More detailed component plots β€” reward curve, reward_components, belief_accuracy trajectory β€” are available in the iter2 model repo. They were too large to inline here without Git LFS setup.)

The full iteration journey (5 GRPO iters β†’ AD pivot) is in docs/iterations.md.