Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /training.md

InosLihka

Clarify documentation: anomaly signal explainer, GRPO scope notes

361aed7 7 days ago

preview code

raw

history blame contribute delete

5.96 kB

	# Training Guide — RhythmEnv GRPO

	> Scope of this doc: the pure-GRPO-from-scratch training stack
	> exercised by [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb)
	> (Iter 5 in [`iterations.md`](iterations.md)). The submission's **headline
	> numbers come from a different recipe — Algorithm Distillation**
	> (teacher trajectories → SFT) — described in the [README](../README.md)
	> under Training: Algorithm Distillation. The GRPO machinery here is
	> still used in the final pipeline, but only as the *optional Stage 3
	> refine on top of the SFT checkpoint*, not as the from-scratch trainer.

	## What we're training

	A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.

	The goal is not to teach the model the rules of the environment — a capable LLM already understands them from the prompt. The goal is to calibrate a small model to do online behavioral inference: read who you're helping from how the environment responds, not from what it tells you.

	---

	## Stack

	\| Component \| Choice \|
	\|---\|---\|
	\| Model \| `unsloth/Qwen2.5-3B-Instruct` \|
	\| Quantization \| 4-bit NF4 via Unsloth \|
	\| LoRA rank \| 4 \|
	\| Training algorithm \| GRPO (TRL 0.22.2) \|
	\| Hardware \| Free Colab T4 (~3 hours for 500 steps) \|

	---

	## Four-layer reward stack

	Each training step scores four candidate completions per prompt across four reward functions:

	\| Layer \| Function \| Signal \| Pass \| Fail \|
	\|---\|---\|---\|---\|---\|
	\| 1 \| `format_valid` \| Is the output parseable as `ACTION + 3 belief digits`? \| +1.0 \| −2.0 \|
	\| 2 \| `action_legal` \| Is it one of the 10 valid `ActionType` values? \| +0.5 \| −1.0 \|
	\| 3 \| `env_reward` \| Real reward from stepping the environment \| varies \| −3.0 \|
	\| 4 \| `belief_accuracy` \| Cosine-MAE between emitted belief vector and the hidden profile (the meta-RL signal) \| +0.5 \| −0.5 \|

	`env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action — the reward cannot be fabricated.

	`belief_accuracy` is what the v2 grader was added to measure. The first
	five GRPO iterations under the v1 (three-layer) grader kept tying with
	heuristic because nothing was rewarding the agent for *learning who it was
	helping*. Adding `belief_accuracy` makes inference a first-class
	training signal. See [`iterations.md`](iterations.md) for the post-mortem.

	---

	## Key config choices

	```python
	GRPOConfig(
	beta=0.01, # KL penalty — default 0.04 caused explosion to kl=10731 at step 205
	max_completion_length=16, # Action names are ≤15 chars; prevents verbose drift
	learning_rate=2e-4,
	num_generations=4, # 4 candidates per prompt — enough variance for GRPO signal
	max_steps=500,
	per_device_train_batch_size=1,
	gradient_accumulation_steps=4,
	)
	```

	`beta=0.01` is the critical fix from the first training run. The default value caused the policy to drift so far from the reference model that completion length jumped from 4 tokens to 368 tokens, saturating the max. `max_completion_length=16` provides a hard cap as a second safeguard.

	---

	## Dataset

	Generated from 200 simulated episodes using a mixed strategy (heuristic + random actions) across all three profiles. Each sample is one step:

	```python
	{
	"prompt": [system_msg, user_observation],
	"seed": int, # episode seed → deterministic profile + events
	"step_index": int, # which step in the episode
	"action_history": list, # actions taken before this step
	}
	```

	The dataset gives the model exposure to all three profiles and a range of meter states. Mixed strategy (not pure heuristic) ensures the model sees suboptimal states to learn recovery from.

	---

	## Baselines

	Established before training on 5 episodes × 3 profiles:

	\| Strategy \| Introvert \| Extrovert \| Workaholic \|
	\|---\|---\|---\|---\|
	\| Random \| ~0.65 \| ~0.70 \| ~0.65 \|
	\| Heuristic \| ~0.78 \| ~0.76 \| ~0.82 \|

	The heuristic baseline uses observable rules only (sleep when vitality is low, meditate when serenity is low, socialise when connection drops). It cannot differentiate profiles.

	A trained agent should beat the heuristic on at least 2 of 3 profiles, with qualitatively different action sequences per profile — the introvert's week should look nothing like the workaholic's.

	---

	## How to run

	Open [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb) in Colab with a T4 GPU runtime.

	Run cells in order:
	1. Install dependencies
	2. Clone repo from HF Space
	3. Verify environment
	4. Run baseline evaluation (saves `baseline_results`)
	5. Generate dataset
	6. Load model (Qwen 2.5-3B + LoRA)
	7. Setup reward functions
	8. Configure training (`beta=0.01`, `max_completion_length=16`)
	9. Train (`trainer.train()`)
	10. Save model
	11. Generate training plots
	12. Evaluate trained model
	13. Generate comparison chart (`baseline_vs_trained.png`)

	---

	## Expected training behaviour

	Healthy run: `completion_length` stays at 3–16 tokens throughout, KL stays below 1.0, mean reward climbs from ~1.5 toward ~3.0 over 500 steps.

	Warning signs: `completion_length` spiking above 50, `clipped_ratio` approaching 1.0, KL above 5.0. If any of these appear, the `beta=0.01` fix is not being applied.

	---

	## Output artifacts

	After a successful run, download these and commit to the repo:

	```
	plots/training_loss.png — loss curve across 500 steps
	plots/reward_curve.png — mean reward with ±1 std band
	plots/baseline_vs_trained.png — comparison bar chart (random / heuristic / trained)
	plots/eval_results.json — raw per-episode scores
	```