Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / plots /README.md

InosLihka

Add plots/ folder: SFT v3 loss + GRPO iter2 reward curves

f2401bf 12 days ago

preview code

raw

history blame contribute delete

1.21 kB

Training plots

Evidence of real training runs. Two distinct phases captured:

SFT prime (Algorithm Distillation, the final pipeline)

sft_v3_training_loss.png — Loss curve from the SFT v3 run (525 steps, 5,040 (state, response) pairs from gpt-5.4 teacher trajectories, Qwen 2.5-3B
- 4-bit + LoRA r=16). Loss drops from 2.77 → 0.083, smooth convergence, no overfitting.
- Source: InosLihka/rhythm-env-meta-trained-sft-v3/log_history.json

GRPO iteration 2 (the journey before the AD pivot)

These came from a real 400-step GRPO run on Qwen 2.5-3B + Unsloth, before we discovered that pure GRPO from scratch wasn't going to beat heuristic at this model scale and pivoted to Algorithm Distillation:

grpo_iter2_training_loss.png — GRPO loss over 400 steps
grpo_iter2_baseline_vs_trained.png — final scores vs random + heuristic across 3 eval conditions

(More detailed component plots — reward curve, reward_components, belief_accuracy trajectory — are available in the iter2 model repo. They were too large to inline here without Git LFS setup.)

The full iteration journey (5 GRPO iters → AD pivot) is in docs/iterations.md.