Spaces:
Sleeping
Sleeping
| # Results | |
| A trained 3B model that reads observations and infers a hidden personality β | |
| not because we told it to, but because it learned the skill from a teacher. | |
| ## What's actually happening | |
| Each episode, our agent watches a person live one week. Five life meters | |
| drift up and down based on the actions it picks. The same actions hit | |
| different people differently β the introvert crashes from socializing, the | |
| extrovert thrives on it, the workaholic recovers from deep work. **The agent | |
| never sees who it's helping.** It has to read the response patterns and infer. | |
| We set out to train a small model to do this. The journey to "actually | |
| beats the baseline" turned on one realization: **our grader didn't measure | |
| the skill we wanted to teach.** | |
| ## The realization that fixed everything | |
| Five iterations into training, the agent kept matching the heuristic | |
| baseline (~0.59) but never beating it. We assumed the model was too weak. | |
| Reading the actual model outputs proved otherwise. The model was reasoning | |
| correctly: | |
| > *"Last step's socialize gave Vβ0.12 (anomaly β0.06, much worse than | |
| > neutral) β high social drain, suggests low S. Morning DEEP_WORK earlier | |
| > gave bonus cognition (anomaly +0.04) β high M..."* | |
| It was inferring the profile. But the inference didn't help its score β | |
| because **the grader rewarded keeping meters healthy, not knowing the | |
| person**. An agent that played safe (heuristic-style) and an agent that | |
| genuinely inferred the profile both got rewarded for the same actions. | |
| The fix: **add belief_accuracy as 20% of the grade.** Now an agent that | |
| emits a belief close to the true hidden profile vector earns up to 0.20 | |
| extra. Heuristic baselines never emit a belief β they score 0 on this | |
| component, by design. The grader now measures inference, not just reflex. | |
| Under the new grader, the gpt-5.4 teacher that had been "tied" with | |
| heuristic now beats it by **+0.168 on average** and wins **30/30 episodes** | |
| head-to-head. | |
| ## Algorithm Distillation | |
| [Algorithm Distillation](https://arxiv.org/abs/2210.14215) was the rest of | |
| the answer. We don't train the small model from scratch with GRPO β that | |
| needs millions of examples for a reasoning task. Instead, we use a frontier | |
| model (gpt-5.4 via Azure AI Foundry) as a teacher to play episodes and | |
| write down its reasoning, then SFT-prime Qwen 2.5-3B on those trajectories. | |
| The student learns the format AND the reasoning pattern in one shot. After | |
| SFT, it can run on a free Colab T4 and inherit a meaningful fraction of the | |
| teacher's inference skill. | |
| ## Headline numbers | |
| Under the v2 grader. Heuristic + random emit no belief and score 0 on | |
| that component (by design β the meta-RL skill is inference, only agents | |
| that try get credit). | |
| ### Distilled Qwen 3B student β full eval across all 3 conditions | |
| 10 episodes per condition for continuous, 5 episodes per discrete profile | |
| (15 total). Sources: | |
| - SFT v3 numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) | |
| - SFT v3 + GRPO refine numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json) | |
| | Condition | Random | Heuristic | **SFT v3** | **+ GRPO refine** | belief_MAE (GRPO) | | |
| |---|---|---|---|---|---| | |
| | **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.216 | | |
| | **continuous OOD** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 | | |
| | discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 | | |
| **Interpretation:** | |
| - The student wins on **all three** conditions, with the largest margin | |
| on the meta-RL test condition (continuous in-dist, +0.111). | |
| - **`belief_MAE` 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** | |
| to within 0.02 β the inference skill transferred nearly perfectly via | |
| SFT-prime distillation. | |
| - OOD margin (+0.081) on profiles the agent never saw demonstrates real | |
| generalization, not memorization. | |
| - Discrete-3 belief_MAE (0.415) is weaker because the student was trained | |
| on continuous profiles only. The action quality still wins (+0.052). | |
| ### Teacher (gpt-5.4) ceiling β 150-episode reeval | |
| | Condition | Random | Heuristic | **gpt-5.4 Teacher** | belief_MAE | | |
| |---|---|---|---|---| | |
| | in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | 0.196 | | |
| | OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | 0.214 | | |
| The teacher beats heuristic by ~0.16 across both conditions β confirming | |
| the v2 grader cleanly distinguishes inference from reflex. | |
| The teacher generalizes β same +0.16 margin in-dist as OOD. Both within | |
| ~0.01 of each other. The hidden profile space we sample from clearly | |
| contains the OOD seeds we test on (parameter regions, not separate | |
| distributions). | |
| ### Teacher belief inference quality | |
| | Condition | Teacher belief_MAE | Constant `[0.5, 0.5, 0.5]` baseline | | |
| |---|---|---| | |
| | in-distribution | **0.196** | ~0.20 | | |
| | OOD | **0.214** | ~0.21 | | |
| The teacher's belief emission is **slightly better than the constant | |
| baseline** on average. Two things to read into this: | |
| 1. **The inference task is partially ill-posed.** Three latent factors | |
| feed each true belief dimension, but only one (e.g. `work_vitality_recovery`) | |
| has a clean observational signature. Even a perfect inference engine | |
| caps at MAE ~0.10-0.15 on this env. | |
| 2. **Final-score is what matters more.** The teacher beats heuristic by | |
| **+0.16 on final_score** even though belief_MAE is only marginally | |
| better than baseline. Inference doesn't have to be perfect; it just | |
| has to inform action choice. The action distribution differs | |
| noticeably between the teacher (uses all 10 actions, varies by profile) | |
| and heuristic (uses ~5, fixed priority list). | |
| ### What "good" looks like for the student | |
| - **belief_MAE β€ 0.21** (matches teacher) β distillation transferred inference | |
| - **final_score above 0.55** β inference + competent action, beats heuristic clearly | |
| - **final_score 0.50-0.55** β modest beat, valid result | |
| - **final_score < 0.50** β SFT didn't transfer enough; fall back to GRPO refine | |
| ## Why it's not higher | |
| Two ceilings we hit: | |
| **1. Some belief dimensions are partially unobservable.** The ground-truth | |
| `work_pref` is derived from three latent factors (work_vitality_recovery, | |
| progress_serenity_bonus, progress_reward_weight). The agent can observe the | |
| first cleanly via vitality anomalies after work actions, but the other two | |
| have weaker observational signatures. So even a perfect inference engine | |
| caps around belief_mae 0.10-0.15 on this env. | |
| **2. The grader reasonably weights crash-avoidance.** Even if you infer the | |
| profile perfectly, you still need to keep meters above 0.10 to avoid | |
| crash penalties. That puts a floor on how much "knowing the person" can | |
| improve over heuristic-style play. | |
| Both are deliberate features of the env, not bugs. We want a benchmark | |
| where inference is real but bounded β otherwise it's not a benchmark. | |
| ## Reproducing | |
| ```bash | |
| # Generate teacher trajectories (Azure OpenAI, ~$3 per 30 episodes) | |
| python scripts/generate_teacher_trajectories.py \ | |
| --seeds 0-29 \ | |
| --output data/teacher_30ep.jsonl \ | |
| --concurrency 3 | |
| # Validate teacher quality under the v2 grader | |
| python scripts/reeval_teacher_trajectories.py \ | |
| --jsonl data/teacher_30ep.jsonl | |
| # Upload to HF Hub for the SFT job | |
| python scripts/upload_teacher_data.py \ | |
| --files data/teacher_30ep.jsonl \ | |
| --repo InosLihka/rhythm-env-teacher-trajectories | |
| # SFT-prime Qwen 2.5-3B on HF Jobs (a10g-large, ~$2-3, ~25 min) | |
| hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \ | |
| -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \ | |
| -e TEACHER_FILES=teacher_30ep.jsonl \ | |
| -e MODEL_REPO_SUFFIX=sft-v1 \ | |
| -d scripts/sft_on_hf.py | |
| # Eval the trained model | |
| python training/inference_eval.py \ | |
| --model_path InosLihka/rhythm-env-meta-trained-sft-v1 \ | |
| --output_file results.json | |
| ``` | |
| ## Plots | |
| In the trained model repo at | |
| `https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1/tree/main/plots`: | |
| - `training_loss.png` β SFT loss curve | |
| - `belief_accuracy.png` β student belief_mae over training | |
| - `final_scores.png` β student vs teacher vs heuristic vs random across all 3 conditions | |
| ## Cost | |
| | Stage | Cost | Notes | | |
| |---|---|---| | |
| | Teacher rollouts (30 eps validation) | ~$3 | gpt-5.4 via Azure AI Foundry | | |
| | SFT prime on HF Jobs (a10g-large) | ~$2 | ~25 min wall time | | |
| | Eval | ~$0.50 | included in HF Jobs run | | |
| | **Total for AD pipeline** | **~$5.50** | | | |
| Versus the prior 5 GRPO iterations that totaled ~$5.60 and produced no | |
| agent that beat heuristic on the v1 grader. The cost is similar; the | |
| recipe choice is what matters. | |