File size: 1,210 Bytes
f2401bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Training plots

Evidence of real training runs. Two distinct phases captured:

## SFT prime (Algorithm Distillation, the final pipeline)

- **`sft_v3_training_loss.png`** β€” Loss curve from the SFT v3 run (525 steps,
  5,040 (state, response) pairs from gpt-5.4 teacher trajectories, Qwen 2.5-3B
  + 4-bit + LoRA r=16). Loss drops from 2.77 β†’ 0.083, smooth convergence,
  no overfitting.
  - Source: `InosLihka/rhythm-env-meta-trained-sft-v3/log_history.json`

## GRPO iteration 2 (the journey before the AD pivot)

These came from a real 400-step GRPO run on Qwen 2.5-3B + Unsloth, before we
discovered that pure GRPO from scratch wasn't going to beat heuristic at this
model scale and pivoted to Algorithm Distillation:

- **`grpo_iter2_training_loss.png`** β€” GRPO loss over 400 steps
- **`grpo_iter2_baseline_vs_trained.png`** β€” final scores vs random + heuristic
  across 3 eval conditions

(More detailed component plots β€” reward curve, reward_components, belief_accuracy
trajectory β€” are available in the iter2 model repo. They were too large to
inline here without Git LFS setup.)

The full iteration journey (5 GRPO iters β†’ AD pivot) is in
[`docs/iterations.md`](../docs/iterations.md).