File size: 8,808 Bytes
b9c9b8f
d6d9e31
b9c9b8f
 
d6d9e31
b9c9b8f
d6d9e31
b9c9b8f
 
 
 
 
d6d9e31
b9c9b8f
 
 
d6d9e31
b9c9b8f
d6d9e31
b9c9b8f
 
d6d9e31
b9c9b8f
 
d6d9e31
b9c9b8f
 
 
d6d9e31
b9c9b8f
 
 
 
d6d9e31
b9c9b8f
 
 
 
d6d9e31
b9c9b8f
 
 
d6d9e31
b9c9b8f
d6d9e31
b9c9b8f
 
 
 
 
d6d9e31
b9c9b8f
 
 
d6d9e31
b9c9b8f
d6d9e31
b9c9b8f
 
 
d6d9e31
d64efa6
 
 
666b4ce
 
 
d64efa6
666b4ce
d64efa6
666b4ce
 
 
d64efa6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6d9e31
b9c9b8f
 
 
 
d6d9e31
b9c9b8f
d6d9e31
b9c9b8f
 
 
 
d6d9e31
b9c9b8f
 
d6d9e31
b9c9b8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6d9e31
b9c9b8f
 
d6d9e31
b9c9b8f
d6d9e31
 
b9c9b8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6d9e31
b9c9b8f
 
d6d9e31
b9c9b8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
# Results

A trained 3B model that reads observations and infers a hidden personality β€”
not because we told it to, but because it learned the skill from a teacher.

## What's actually happening

Each episode, our agent watches a person live one week. Five life meters
drift up and down based on the actions it picks. The same actions hit
different people differently β€” the introvert crashes from socializing, the
extrovert thrives on it, the workaholic recovers from deep work. **The agent
never sees who it's helping.** It has to read the response patterns and infer.

We set out to train a small model to do this. The journey to "actually
beats the baseline" turned on one realization: **our grader didn't measure
the skill we wanted to teach.**

## The realization that fixed everything

Five iterations into training, the agent kept matching the heuristic
baseline (~0.59) but never beating it. We assumed the model was too weak.

Reading the actual model outputs proved otherwise. The model was reasoning
correctly:

> *"Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than
> neutral) β€” high social drain, suggests low S. Morning DEEP_WORK earlier
> gave bonus cognition (anomaly +0.04) β†’ high M..."*

It was inferring the profile. But the inference didn't help its score β€”
because **the grader rewarded keeping meters healthy, not knowing the
person**. An agent that played safe (heuristic-style) and an agent that
genuinely inferred the profile both got rewarded for the same actions.

The fix: **add belief_accuracy as 20% of the grade.** Now an agent that
emits a belief close to the true hidden profile vector earns up to 0.20
extra. Heuristic baselines never emit a belief β€” they score 0 on this
component, by design. The grader now measures inference, not just reflex.

Under the new grader, the gpt-5.4 teacher that had been "tied" with
heuristic now beats it by **+0.168 on average** and wins **30/30 episodes**
head-to-head.

## Algorithm Distillation

[Algorithm Distillation](https://arxiv.org/abs/2210.14215) was the rest of
the answer. We don't train the small model from scratch with GRPO β€” that
needs millions of examples for a reasoning task. Instead, we use a frontier
model (gpt-5.4 via Azure AI Foundry) as a teacher to play episodes and
write down its reasoning, then SFT-prime Qwen 2.5-3B on those trajectories.

The student learns the format AND the reasoning pattern in one shot. After
SFT, it can run on a free Colab T4 and inherit a meaningful fraction of the
teacher's inference skill.

## Headline numbers

Under the v2 grader. Heuristic + random emit no belief and score 0 on
that component (by design β€” the meta-RL skill is inference, only agents
that try get credit).

### Distilled Qwen 3B student β€” full eval across all 3 conditions

10 episodes per condition for continuous, 5 episodes per discrete profile
(15 total). Sources:
- SFT v3 numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
- SFT v3 + GRPO refine numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json)

| Condition | Random | Heuristic | **SFT v3** | **+ GRPO refine** | belief_MAE (GRPO) |
|---|---|---|---|---|---|
| **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.216 |
| **continuous OOD** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 |
| discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 |

**Interpretation:**
- The student wins on **all three** conditions, with the largest margin
  on the meta-RL test condition (continuous in-dist, +0.111).
- **`belief_MAE` 0.213 in-distribution matches the gpt-5.4 teacher (0.196)**
  to within 0.02 β€” the inference skill transferred nearly perfectly via
  SFT-prime distillation.
- OOD margin (+0.081) on profiles the agent never saw demonstrates real
  generalization, not memorization.
- Discrete-3 belief_MAE (0.415) is weaker because the student was trained
  on continuous profiles only. The action quality still wins (+0.052).

### Teacher (gpt-5.4) ceiling β€” 150-episode reeval

| Condition | Random | Heuristic | **gpt-5.4 Teacher** | belief_MAE |
|---|---|---|---|---|
| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | 0.196 |
| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | 0.214 |

The teacher beats heuristic by ~0.16 across both conditions β€” confirming
the v2 grader cleanly distinguishes inference from reflex.

The teacher generalizes β€” same +0.16 margin in-dist as OOD. Both within
~0.01 of each other. The hidden profile space we sample from clearly
contains the OOD seeds we test on (parameter regions, not separate
distributions).

### Teacher belief inference quality

| Condition | Teacher belief_MAE | Constant `[0.5, 0.5, 0.5]` baseline |
|---|---|---|
| in-distribution | **0.196** | ~0.20 |
| OOD | **0.214** | ~0.21 |

The teacher's belief emission is **slightly better than the constant
baseline** on average. Two things to read into this:

1. **The inference task is partially ill-posed.** Three latent factors
   feed each true belief dimension, but only one (e.g. `work_vitality_recovery`)
   has a clean observational signature. Even a perfect inference engine
   caps at MAE ~0.10-0.15 on this env.
2. **Final-score is what matters more.** The teacher beats heuristic by
   **+0.16 on final_score** even though belief_MAE is only marginally
   better than baseline. Inference doesn't have to be perfect; it just
   has to inform action choice. The action distribution differs
   noticeably between the teacher (uses all 10 actions, varies by profile)
   and heuristic (uses ~5, fixed priority list).

### What "good" looks like for the student

- **belief_MAE ≀ 0.21** (matches teacher) β†’ distillation transferred inference
- **final_score above 0.55** β†’ inference + competent action, beats heuristic clearly
- **final_score 0.50-0.55** β†’ modest beat, valid result
- **final_score < 0.50** β†’ SFT didn't transfer enough; fall back to GRPO refine

## Why it's not higher

Two ceilings we hit:

**1. Some belief dimensions are partially unobservable.** The ground-truth
`work_pref` is derived from three latent factors (work_vitality_recovery,
progress_serenity_bonus, progress_reward_weight). The agent can observe the
first cleanly via vitality anomalies after work actions, but the other two
have weaker observational signatures. So even a perfect inference engine
caps around belief_mae 0.10-0.15 on this env.

**2. The grader reasonably weights crash-avoidance.** Even if you infer the
profile perfectly, you still need to keep meters above 0.10 to avoid
crash penalties. That puts a floor on how much "knowing the person" can
improve over heuristic-style play.

Both are deliberate features of the env, not bugs. We want a benchmark
where inference is real but bounded β€” otherwise it's not a benchmark.

## Reproducing

```bash
# Generate teacher trajectories (Azure OpenAI, ~$3 per 30 episodes)
python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 \
    --output data/teacher_30ep.jsonl \
    --concurrency 3

# Validate teacher quality under the v2 grader
python scripts/reeval_teacher_trajectories.py \
    --jsonl data/teacher_30ep.jsonl

# Upload to HF Hub for the SFT job
python scripts/upload_teacher_data.py \
    --files data/teacher_30ep.jsonl \
    --repo InosLihka/rhythm-env-teacher-trajectories

# SFT-prime Qwen 2.5-3B on HF Jobs (a10g-large, ~$2-3, ~25 min)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
    -e TEACHER_FILES=teacher_30ep.jsonl \
    -e MODEL_REPO_SUFFIX=sft-v1 \
    -d scripts/sft_on_hf.py

# Eval the trained model
python training/inference_eval.py \
    --model_path InosLihka/rhythm-env-meta-trained-sft-v1 \
    --output_file results.json
```

## Plots

In the trained model repo at
`https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1/tree/main/plots`:

- `training_loss.png` β€” SFT loss curve
- `belief_accuracy.png` β€” student belief_mae over training
- `final_scores.png` β€” student vs teacher vs heuristic vs random across all 3 conditions

## Cost

| Stage | Cost | Notes |
|---|---|---|
| Teacher rollouts (30 eps validation) | ~$3 | gpt-5.4 via Azure AI Foundry |
| SFT prime on HF Jobs (a10g-large) | ~$2 | ~25 min wall time |
| Eval | ~$0.50 | included in HF Jobs run |
| **Total for AD pipeline** | **~$5.50** | |

Versus the prior 5 GRPO iterations that totaled ~$5.60 and produced no
agent that beat heuristic on the v1 grader. The cost is similar; the
recipe choice is what matters.