Spaces:

InosLihka
/

rhythm_env

Sleeping

Add plots/ folder: SFT v3 loss + GRPO iter2 reward curves

f2401bf 12 days ago

1.21 kB

	# Training plots

	Evidence of real training runs. Two distinct phases captured:

	## SFT prime (Algorithm Distillation, the final pipeline)

	- `sft_v3_training_loss.png` — Loss curve from the SFT v3 run (525 steps,
	5,040 (state, response) pairs from gpt-5.4 teacher trajectories, Qwen 2.5-3B
	+ 4-bit + LoRA r=16). Loss drops from 2.77 → 0.083, smooth convergence,
	no overfitting.
	- Source: `InosLihka/rhythm-env-meta-trained-sft-v3/log_history.json`

	## GRPO iteration 2 (the journey before the AD pivot)

	These came from a real 400-step GRPO run on Qwen 2.5-3B + Unsloth, before we
	discovered that pure GRPO from scratch wasn't going to beat heuristic at this
	model scale and pivoted to Algorithm Distillation:

	- `grpo_iter2_training_loss.png` — GRPO loss over 400 steps
	- `grpo_iter2_baseline_vs_trained.png` — final scores vs random + heuristic
	across 3 eval conditions

	(More detailed component plots — reward curve, reward_components, belief_accuracy
	trajectory — are available in the iter2 model repo. They were too large to
	inline here without Git LFS setup.)

	The full iteration journey (5 GRPO iters → AD pivot) is in
	[`docs/iterations.md`](../docs/iterations.md).