rhythm_env / docs /results.md
InosLihka's picture
Add SFT v3 + GRPO refine results to README + results.md
666b4ce
# Results
A trained 3B model that reads observations and infers a hidden personality β€”
not because we told it to, but because it learned the skill from a teacher.
## What's actually happening
Each episode, our agent watches a person live one week. Five life meters
drift up and down based on the actions it picks. The same actions hit
different people differently β€” the introvert crashes from socializing, the
extrovert thrives on it, the workaholic recovers from deep work. **The agent
never sees who it's helping.** It has to read the response patterns and infer.
We set out to train a small model to do this. The journey to "actually
beats the baseline" turned on one realization: **our grader didn't measure
the skill we wanted to teach.**
## The realization that fixed everything
Five iterations into training, the agent kept matching the heuristic
baseline (~0.59) but never beating it. We assumed the model was too weak.
Reading the actual model outputs proved otherwise. The model was reasoning
correctly:
> *"Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than
> neutral) β€” high social drain, suggests low S. Morning DEEP_WORK earlier
> gave bonus cognition (anomaly +0.04) β†’ high M..."*
It was inferring the profile. But the inference didn't help its score β€”
because **the grader rewarded keeping meters healthy, not knowing the
person**. An agent that played safe (heuristic-style) and an agent that
genuinely inferred the profile both got rewarded for the same actions.
The fix: **add belief_accuracy as 20% of the grade.** Now an agent that
emits a belief close to the true hidden profile vector earns up to 0.20
extra. Heuristic baselines never emit a belief β€” they score 0 on this
component, by design. The grader now measures inference, not just reflex.
Under the new grader, the gpt-5.4 teacher that had been "tied" with
heuristic now beats it by **+0.168 on average** and wins **30/30 episodes**
head-to-head.
## Algorithm Distillation
[Algorithm Distillation](https://arxiv.org/abs/2210.14215) was the rest of
the answer. We don't train the small model from scratch with GRPO β€” that
needs millions of examples for a reasoning task. Instead, we use a frontier
model (gpt-5.4 via Azure AI Foundry) as a teacher to play episodes and
write down its reasoning, then SFT-prime Qwen 2.5-3B on those trajectories.
The student learns the format AND the reasoning pattern in one shot. After
SFT, it can run on a free Colab T4 and inherit a meaningful fraction of the
teacher's inference skill.
## Headline numbers
Under the v2 grader. Heuristic + random emit no belief and score 0 on
that component (by design β€” the meta-RL skill is inference, only agents
that try get credit).
### Distilled Qwen 3B student β€” full eval across all 3 conditions
10 episodes per condition for continuous, 5 episodes per discrete profile
(15 total). Sources:
- SFT v3 numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
- SFT v3 + GRPO refine numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json)
| Condition | Random | Heuristic | **SFT v3** | **+ GRPO refine** | belief_MAE (GRPO) |
|---|---|---|---|---|---|
| **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.216 |
| **continuous OOD** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 |
| discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 |
**Interpretation:**
- The student wins on **all three** conditions, with the largest margin
on the meta-RL test condition (continuous in-dist, +0.111).
- **`belief_MAE` 0.213 in-distribution matches the gpt-5.4 teacher (0.196)**
to within 0.02 β€” the inference skill transferred nearly perfectly via
SFT-prime distillation.
- OOD margin (+0.081) on profiles the agent never saw demonstrates real
generalization, not memorization.
- Discrete-3 belief_MAE (0.415) is weaker because the student was trained
on continuous profiles only. The action quality still wins (+0.052).
### Teacher (gpt-5.4) ceiling β€” 150-episode reeval
| Condition | Random | Heuristic | **gpt-5.4 Teacher** | belief_MAE |
|---|---|---|---|---|
| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | 0.196 |
| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | 0.214 |
The teacher beats heuristic by ~0.16 across both conditions β€” confirming
the v2 grader cleanly distinguishes inference from reflex.
The teacher generalizes β€” same +0.16 margin in-dist as OOD. Both within
~0.01 of each other. The hidden profile space we sample from clearly
contains the OOD seeds we test on (parameter regions, not separate
distributions).
### Teacher belief inference quality
| Condition | Teacher belief_MAE | Constant `[0.5, 0.5, 0.5]` baseline |
|---|---|---|
| in-distribution | **0.196** | ~0.20 |
| OOD | **0.214** | ~0.21 |
The teacher's belief emission is **slightly better than the constant
baseline** on average. Two things to read into this:
1. **The inference task is partially ill-posed.** Three latent factors
feed each true belief dimension, but only one (e.g. `work_vitality_recovery`)
has a clean observational signature. Even a perfect inference engine
caps at MAE ~0.10-0.15 on this env.
2. **Final-score is what matters more.** The teacher beats heuristic by
**+0.16 on final_score** even though belief_MAE is only marginally
better than baseline. Inference doesn't have to be perfect; it just
has to inform action choice. The action distribution differs
noticeably between the teacher (uses all 10 actions, varies by profile)
and heuristic (uses ~5, fixed priority list).
### What "good" looks like for the student
- **belief_MAE ≀ 0.21** (matches teacher) β†’ distillation transferred inference
- **final_score above 0.55** β†’ inference + competent action, beats heuristic clearly
- **final_score 0.50-0.55** β†’ modest beat, valid result
- **final_score < 0.50** β†’ SFT didn't transfer enough; fall back to GRPO refine
## Why it's not higher
Two ceilings we hit:
**1. Some belief dimensions are partially unobservable.** The ground-truth
`work_pref` is derived from three latent factors (work_vitality_recovery,
progress_serenity_bonus, progress_reward_weight). The agent can observe the
first cleanly via vitality anomalies after work actions, but the other two
have weaker observational signatures. So even a perfect inference engine
caps around belief_mae 0.10-0.15 on this env.
**2. The grader reasonably weights crash-avoidance.** Even if you infer the
profile perfectly, you still need to keep meters above 0.10 to avoid
crash penalties. That puts a floor on how much "knowing the person" can
improve over heuristic-style play.
Both are deliberate features of the env, not bugs. We want a benchmark
where inference is real but bounded β€” otherwise it's not a benchmark.
## Reproducing
```bash
# Generate teacher trajectories (Azure OpenAI, ~$3 per 30 episodes)
python scripts/generate_teacher_trajectories.py \
--seeds 0-29 \
--output data/teacher_30ep.jsonl \
--concurrency 3
# Validate teacher quality under the v2 grader
python scripts/reeval_teacher_trajectories.py \
--jsonl data/teacher_30ep.jsonl
# Upload to HF Hub for the SFT job
python scripts/upload_teacher_data.py \
--files data/teacher_30ep.jsonl \
--repo InosLihka/rhythm-env-teacher-trajectories
# SFT-prime Qwen 2.5-3B on HF Jobs (a10g-large, ~$2-3, ~25 min)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
-e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
-e TEACHER_FILES=teacher_30ep.jsonl \
-e MODEL_REPO_SUFFIX=sft-v1 \
-d scripts/sft_on_hf.py
# Eval the trained model
python training/inference_eval.py \
--model_path InosLihka/rhythm-env-meta-trained-sft-v1 \
--output_file results.json
```
## Plots
In the trained model repo at
`https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1/tree/main/plots`:
- `training_loss.png` β€” SFT loss curve
- `belief_accuracy.png` β€” student belief_mae over training
- `final_scores.png` β€” student vs teacher vs heuristic vs random across all 3 conditions
## Cost
| Stage | Cost | Notes |
|---|---|---|
| Teacher rollouts (30 eps validation) | ~$3 | gpt-5.4 via Azure AI Foundry |
| SFT prime on HF Jobs (a10g-large) | ~$2 | ~25 min wall time |
| Eval | ~$0.50 | included in HF Jobs run |
| **Total for AD pipeline** | **~$5.50** | |
Versus the prior 5 GRPO iterations that totaled ~$5.60 and produced no
agent that beat heuristic on the v1 grader. The cost is similar; the
recipe choice is what matters.