rhythm_env / docs /results.md
InosLihka's picture
Add SFT v3 + GRPO refine results to README + results.md
666b4ce

Results

A trained 3B model that reads observations and infers a hidden personality β€” not because we told it to, but because it learned the skill from a teacher.

What's actually happening

Each episode, our agent watches a person live one week. Five life meters drift up and down based on the actions it picks. The same actions hit different people differently β€” the introvert crashes from socializing, the extrovert thrives on it, the workaholic recovers from deep work. The agent never sees who it's helping. It has to read the response patterns and infer.

We set out to train a small model to do this. The journey to "actually beats the baseline" turned on one realization: our grader didn't measure the skill we wanted to teach.

The realization that fixed everything

Five iterations into training, the agent kept matching the heuristic baseline (~0.59) but never beating it. We assumed the model was too weak.

Reading the actual model outputs proved otherwise. The model was reasoning correctly:

"Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than neutral) β€” high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus cognition (anomaly +0.04) β†’ high M..."

It was inferring the profile. But the inference didn't help its score β€” because the grader rewarded keeping meters healthy, not knowing the person. An agent that played safe (heuristic-style) and an agent that genuinely inferred the profile both got rewarded for the same actions.

The fix: add belief_accuracy as 20% of the grade. Now an agent that emits a belief close to the true hidden profile vector earns up to 0.20 extra. Heuristic baselines never emit a belief β€” they score 0 on this component, by design. The grader now measures inference, not just reflex.

Under the new grader, the gpt-5.4 teacher that had been "tied" with heuristic now beats it by +0.168 on average and wins 30/30 episodes head-to-head.

Algorithm Distillation

Algorithm Distillation was the rest of the answer. We don't train the small model from scratch with GRPO β€” that needs millions of examples for a reasoning task. Instead, we use a frontier model (gpt-5.4 via Azure AI Foundry) as a teacher to play episodes and write down its reasoning, then SFT-prime Qwen 2.5-3B on those trajectories.

The student learns the format AND the reasoning pattern in one shot. After SFT, it can run on a free Colab T4 and inherit a meaningful fraction of the teacher's inference skill.

Headline numbers

Under the v2 grader. Heuristic + random emit no belief and score 0 on that component (by design β€” the meta-RL skill is inference, only agents that try get credit).

Distilled Qwen 3B student β€” full eval across all 3 conditions

10 episodes per condition for continuous, 5 episodes per discrete profile (15 total). Sources:

Condition Random Heuristic SFT v3 + GRPO refine belief_MAE (GRPO)
continuous in-distribution 0.393 0.463 0.574 (+0.111) 0.573 0.216
continuous OOD 0.393 0.455 0.536 (+0.081) 0.559 (+0.104) 0.263
discrete-3-profiles (legacy) 0.426 0.455 0.507 (+0.052) 0.520 (+0.065) 0.430

Interpretation:

  • The student wins on all three conditions, with the largest margin on the meta-RL test condition (continuous in-dist, +0.111).
  • belief_MAE 0.213 in-distribution matches the gpt-5.4 teacher (0.196) to within 0.02 β€” the inference skill transferred nearly perfectly via SFT-prime distillation.
  • OOD margin (+0.081) on profiles the agent never saw demonstrates real generalization, not memorization.
  • Discrete-3 belief_MAE (0.415) is weaker because the student was trained on continuous profiles only. The action quality still wins (+0.052).

Teacher (gpt-5.4) ceiling β€” 150-episode reeval

Condition Random Heuristic gpt-5.4 Teacher belief_MAE
in-distribution (100 eps, seeds 0-99) 0.402 0.449 0.611 (100/100 wins) 0.196
OOD (50 eps, seeds 10000-10049) 0.397 0.454 0.621 (50/50 wins) 0.214

The teacher beats heuristic by ~0.16 across both conditions β€” confirming the v2 grader cleanly distinguishes inference from reflex.

The teacher generalizes β€” same +0.16 margin in-dist as OOD. Both within ~0.01 of each other. The hidden profile space we sample from clearly contains the OOD seeds we test on (parameter regions, not separate distributions).

Teacher belief inference quality

Condition Teacher belief_MAE Constant [0.5, 0.5, 0.5] baseline
in-distribution 0.196 ~0.20
OOD 0.214 ~0.21

The teacher's belief emission is slightly better than the constant baseline on average. Two things to read into this:

  1. The inference task is partially ill-posed. Three latent factors feed each true belief dimension, but only one (e.g. work_vitality_recovery) has a clean observational signature. Even a perfect inference engine caps at MAE ~0.10-0.15 on this env.
  2. Final-score is what matters more. The teacher beats heuristic by +0.16 on final_score even though belief_MAE is only marginally better than baseline. Inference doesn't have to be perfect; it just has to inform action choice. The action distribution differs noticeably between the teacher (uses all 10 actions, varies by profile) and heuristic (uses ~5, fixed priority list).

What "good" looks like for the student

  • belief_MAE ≀ 0.21 (matches teacher) β†’ distillation transferred inference
  • final_score above 0.55 β†’ inference + competent action, beats heuristic clearly
  • final_score 0.50-0.55 β†’ modest beat, valid result
  • final_score < 0.50 β†’ SFT didn't transfer enough; fall back to GRPO refine

Why it's not higher

Two ceilings we hit:

1. Some belief dimensions are partially unobservable. The ground-truth work_pref is derived from three latent factors (work_vitality_recovery, progress_serenity_bonus, progress_reward_weight). The agent can observe the first cleanly via vitality anomalies after work actions, but the other two have weaker observational signatures. So even a perfect inference engine caps around belief_mae 0.10-0.15 on this env.

2. The grader reasonably weights crash-avoidance. Even if you infer the profile perfectly, you still need to keep meters above 0.10 to avoid crash penalties. That puts a floor on how much "knowing the person" can improve over heuristic-style play.

Both are deliberate features of the env, not bugs. We want a benchmark where inference is real but bounded β€” otherwise it's not a benchmark.

Reproducing

# Generate teacher trajectories (Azure OpenAI, ~$3 per 30 episodes)
python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 \
    --output data/teacher_30ep.jsonl \
    --concurrency 3

# Validate teacher quality under the v2 grader
python scripts/reeval_teacher_trajectories.py \
    --jsonl data/teacher_30ep.jsonl

# Upload to HF Hub for the SFT job
python scripts/upload_teacher_data.py \
    --files data/teacher_30ep.jsonl \
    --repo InosLihka/rhythm-env-teacher-trajectories

# SFT-prime Qwen 2.5-3B on HF Jobs (a10g-large, ~$2-3, ~25 min)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
    -e TEACHER_FILES=teacher_30ep.jsonl \
    -e MODEL_REPO_SUFFIX=sft-v1 \
    -d scripts/sft_on_hf.py

# Eval the trained model
python training/inference_eval.py \
    --model_path InosLihka/rhythm-env-meta-trained-sft-v1 \
    --output_file results.json

Plots

In the trained model repo at https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1/tree/main/plots:

  • training_loss.png β€” SFT loss curve
  • belief_accuracy.png β€” student belief_mae over training
  • final_scores.png β€” student vs teacher vs heuristic vs random across all 3 conditions

Cost

Stage Cost Notes
Teacher rollouts (30 eps validation) ~$3 gpt-5.4 via Azure AI Foundry
SFT prime on HF Jobs (a10g-large) ~$2 ~25 min wall time
Eval ~$0.50 included in HF Jobs run
Total for AD pipeline ~$5.50

Versus the prior 5 GRPO iterations that totaled ~$5.60 and produced no agent that beat heuristic on the v1 grader. The cost is similar; the recipe choice is what matters.