Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /results.md

InosLihka

Add SFT v3 + GRPO refine results to README + results.md

666b4ce 10 days ago

preview code

raw

history blame contribute delete

8.81 kB

	# Results

	A trained 3B model that reads observations and infers a hidden personality —
	not because we told it to, but because it learned the skill from a teacher.

	## What's actually happening

	Each episode, our agent watches a person live one week. Five life meters
	drift up and down based on the actions it picks. The same actions hit
	different people differently — the introvert crashes from socializing, the
	extrovert thrives on it, the workaholic recovers from deep work. **The agent
	never sees who it's helping.** It has to read the response patterns and infer.

	We set out to train a small model to do this. The journey to "actually
	beats the baseline" turned on one realization: **our grader didn't measure
	the skill we wanted to teach.**

	## The realization that fixed everything

	Five iterations into training, the agent kept matching the heuristic
	baseline (~0.59) but never beating it. We assumed the model was too weak.

	Reading the actual model outputs proved otherwise. The model was reasoning
	correctly:

	> *"Last step's socialize gave V−0.12 (anomaly −0.06, much worse than
	> neutral) — high social drain, suggests low S. Morning DEEP_WORK earlier
	> gave bonus cognition (anomaly +0.04) → high M..."*

	It was inferring the profile. But the inference didn't help its score —
	because **the grader rewarded keeping meters healthy, not knowing the
	person**. An agent that played safe (heuristic-style) and an agent that
	genuinely inferred the profile both got rewarded for the same actions.

	The fix: add belief_accuracy as 20% of the grade. Now an agent that
	emits a belief close to the true hidden profile vector earns up to 0.20
	extra. Heuristic baselines never emit a belief — they score 0 on this
	component, by design. The grader now measures inference, not just reflex.

	Under the new grader, the gpt-5.4 teacher that had been "tied" with
	heuristic now beats it by +0.168 on average and wins 30/30 episodes
	head-to-head.

	## Algorithm Distillation

	[Algorithm Distillation](https://arxiv.org/abs/2210.14215) was the rest of
	the answer. We don't train the small model from scratch with GRPO — that
	needs millions of examples for a reasoning task. Instead, we use a frontier
	model (gpt-5.4 via Azure AI Foundry) as a teacher to play episodes and
	write down its reasoning, then SFT-prime Qwen 2.5-3B on those trajectories.

	The student learns the format AND the reasoning pattern in one shot. After
	SFT, it can run on a free Colab T4 and inherit a meaningful fraction of the
	teacher's inference skill.

	## Headline numbers

	Under the v2 grader. Heuristic + random emit no belief and score 0 on
	that component (by design — the meta-RL skill is inference, only agents
	that try get credit).

	### Distilled Qwen 3B student — full eval across all 3 conditions

	10 episodes per condition for continuous, 5 episodes per discrete profile
	(15 total). Sources:
	- SFT v3 numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
	- SFT v3 + GRPO refine numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json)

	\| Condition \| Random \| Heuristic \| SFT v3 \| + GRPO refine \| belief_MAE (GRPO) \|
	\|---\|---\|---\|---\|---\|---\|
	\| continuous in-distribution \| 0.393 \| 0.463 \| 0.574 (+0.111) \| 0.573 \| 0.216 \|
	\| continuous OOD \| 0.393 \| 0.455 \| 0.536 (+0.081) \| 0.559 (+0.104) \| 0.263 \|
	\| discrete-3-profiles (legacy) \| 0.426 \| 0.455 \| 0.507 (+0.052) \| 0.520 (+0.065) \| 0.430 \|

	Interpretation:
	- The student wins on all three conditions, with the largest margin
	on the meta-RL test condition (continuous in-dist, +0.111).
	- `belief_MAE` 0.213 in-distribution matches the gpt-5.4 teacher (0.196)
	to within 0.02 — the inference skill transferred nearly perfectly via
	SFT-prime distillation.
	- OOD margin (+0.081) on profiles the agent never saw demonstrates real
	generalization, not memorization.
	- Discrete-3 belief_MAE (0.415) is weaker because the student was trained
	on continuous profiles only. The action quality still wins (+0.052).

	### Teacher (gpt-5.4) ceiling — 150-episode reeval

	\| Condition \| Random \| Heuristic \| gpt-5.4 Teacher \| belief_MAE \|
	\|---\|---\|---\|---\|---\|
	\| in-distribution (100 eps, seeds 0-99) \| 0.402 \| 0.449 \| 0.611 (100/100 wins) \| 0.196 \|
	\| OOD (50 eps, seeds 10000-10049) \| 0.397 \| 0.454 \| 0.621 (50/50 wins) \| 0.214 \|

	The teacher beats heuristic by ~0.16 across both conditions — confirming
	the v2 grader cleanly distinguishes inference from reflex.

	The teacher generalizes — same +0.16 margin in-dist as OOD. Both within
	~0.01 of each other. The hidden profile space we sample from clearly
	contains the OOD seeds we test on (parameter regions, not separate
	distributions).

	### Teacher belief inference quality

	\| Condition \| Teacher belief_MAE \| Constant `[0.5, 0.5, 0.5]` baseline \|
	\|---\|---\|---\|
	\| in-distribution \| 0.196 \| ~0.20 \|
	\| OOD \| 0.214 \| ~0.21 \|

	The teacher's belief emission is **slightly better than the constant
	baseline** on average. Two things to read into this:

	1. The inference task is partially ill-posed. Three latent factors
	feed each true belief dimension, but only one (e.g. `work_vitality_recovery`)
	has a clean observational signature. Even a perfect inference engine
	caps at MAE ~0.10-0.15 on this env.
	2. Final-score is what matters more. The teacher beats heuristic by
	+0.16 on final_score even though belief_MAE is only marginally
	better than baseline. Inference doesn't have to be perfect; it just
	has to inform action choice. The action distribution differs
	noticeably between the teacher (uses all 10 actions, varies by profile)
	and heuristic (uses ~5, fixed priority list).

	### What "good" looks like for the student

	- belief_MAE ≤ 0.21 (matches teacher) → distillation transferred inference
	- final_score above 0.55 → inference + competent action, beats heuristic clearly
	- final_score 0.50-0.55 → modest beat, valid result
	- final_score < 0.50 → SFT didn't transfer enough; fall back to GRPO refine

	## Why it's not higher

	Two ceilings we hit:

	1. Some belief dimensions are partially unobservable. The ground-truth
	`work_pref` is derived from three latent factors (work_vitality_recovery,
	progress_serenity_bonus, progress_reward_weight). The agent can observe the
	first cleanly via vitality anomalies after work actions, but the other two
	have weaker observational signatures. So even a perfect inference engine
	caps around belief_mae 0.10-0.15 on this env.

	2. The grader reasonably weights crash-avoidance. Even if you infer the
	profile perfectly, you still need to keep meters above 0.10 to avoid
	crash penalties. That puts a floor on how much "knowing the person" can
	improve over heuristic-style play.

	Both are deliberate features of the env, not bugs. We want a benchmark
	where inference is real but bounded — otherwise it's not a benchmark.

	## Reproducing

	```bash
	# Generate teacher trajectories (Azure OpenAI, ~$3 per 30 episodes)
	python scripts/generate_teacher_trajectories.py \
	--seeds 0-29 \
	--output data/teacher_30ep.jsonl \
	--concurrency 3

	# Validate teacher quality under the v2 grader
	python scripts/reeval_teacher_trajectories.py \
	--jsonl data/teacher_30ep.jsonl

	# Upload to HF Hub for the SFT job
	python scripts/upload_teacher_data.py \
	--files data/teacher_30ep.jsonl \
	--repo InosLihka/rhythm-env-teacher-trajectories

	# SFT-prime Qwen 2.5-3B on HF Jobs (a10g-large, ~$2-3, ~25 min)
	hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
	-e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
	-e TEACHER_FILES=teacher_30ep.jsonl \
	-e MODEL_REPO_SUFFIX=sft-v1 \
	-d scripts/sft_on_hf.py

	# Eval the trained model
	python training/inference_eval.py \
	--model_path InosLihka/rhythm-env-meta-trained-sft-v1 \
	--output_file results.json
	```

	## Plots

	In the trained model repo at
	`https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1/tree/main/plots`:

	- `training_loss.png` — SFT loss curve
	- `belief_accuracy.png` — student belief_mae over training
	- `final_scores.png` — student vs teacher vs heuristic vs random across all 3 conditions

	## Cost

	\| Stage \| Cost \| Notes \|
	\|---\|---\|---\|
	\| Teacher rollouts (30 eps validation) \| ~$3 \| gpt-5.4 via Azure AI Foundry \|
	\| SFT prime on HF Jobs (a10g-large) \| ~$2 \| ~25 min wall time \|
	\| Eval \| ~$0.50 \| included in HF Jobs run \|
	\| Total for AD pipeline \| ~$5.50 \| \|

	Versus the prior 5 GRPO iterations that totaled ~$5.60 and produced no
	agent that beat heuristic on the v1 grader. The cost is similar; the
	recipe choice is what matters.