rhythm_env / docs /training.md
InosLihka's picture
Clarify documentation: anomaly signal explainer, GRPO scope notes
361aed7

Training Guide β€” RhythmEnv GRPO

Scope of this doc: the pure-GRPO-from-scratch training stack exercised by training/RhythmEnv_GRPO_Training.ipynb (Iter 5 in iterations.md). The submission's headline numbers come from a different recipe β€” Algorithm Distillation (teacher trajectories β†’ SFT) β€” described in the README under Training: Algorithm Distillation. The GRPO machinery here is still used in the final pipeline, but only as the optional Stage 3 refine on top of the SFT checkpoint, not as the from-scratch trainer.

What we're training

A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.

The goal is not to teach the model the rules of the environment β€” a capable LLM already understands them from the prompt. The goal is to calibrate a small model to do online behavioral inference: read who you're helping from how the environment responds, not from what it tells you.


Stack

Component Choice
Model unsloth/Qwen2.5-3B-Instruct
Quantization 4-bit NF4 via Unsloth
LoRA rank 4
Training algorithm GRPO (TRL 0.22.2)
Hardware Free Colab T4 (~3 hours for 500 steps)

Four-layer reward stack

Each training step scores four candidate completions per prompt across four reward functions:

Layer Function Signal Pass Fail
1 format_valid Is the output parseable as ACTION + 3 belief digits? +1.0 βˆ’2.0
2 action_legal Is it one of the 10 valid ActionType values? +0.5 βˆ’1.0
3 env_reward Real reward from stepping the environment varies βˆ’3.0
4 belief_accuracy Cosine-MAE between emitted belief vector and the hidden profile (the meta-RL signal) +0.5 βˆ’0.5

env_reward uses seed-based episode replay: the dataset stores seed, step_index, and action_history alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β€” the reward cannot be fabricated.

belief_accuracy is what the v2 grader was added to measure. The first five GRPO iterations under the v1 (three-layer) grader kept tying with heuristic because nothing was rewarding the agent for learning who it was helping. Adding belief_accuracy makes inference a first-class training signal. See iterations.md for the post-mortem.


Key config choices

GRPOConfig(
    beta=0.01,               # KL penalty β€” default 0.04 caused explosion to kl=10731 at step 205
    max_completion_length=16, # Action names are ≀15 chars; prevents verbose drift
    learning_rate=2e-4,
    num_generations=4,        # 4 candidates per prompt β€” enough variance for GRPO signal
    max_steps=500,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
)

beta=0.01 is the critical fix from the first training run. The default value caused the policy to drift so far from the reference model that completion length jumped from 4 tokens to 368 tokens, saturating the max. max_completion_length=16 provides a hard cap as a second safeguard.


Dataset

Generated from 200 simulated episodes using a mixed strategy (heuristic + random actions) across all three profiles. Each sample is one step:

{
    "prompt": [system_msg, user_observation],
    "seed": int,              # episode seed β†’ deterministic profile + events
    "step_index": int,        # which step in the episode
    "action_history": list,   # actions taken before this step
}

The dataset gives the model exposure to all three profiles and a range of meter states. Mixed strategy (not pure heuristic) ensures the model sees suboptimal states to learn recovery from.


Baselines

Established before training on 5 episodes Γ— 3 profiles:

Strategy Introvert Extrovert Workaholic
Random ~0.65 ~0.70 ~0.65
Heuristic ~0.78 ~0.76 ~0.82

The heuristic baseline uses observable rules only (sleep when vitality is low, meditate when serenity is low, socialise when connection drops). It cannot differentiate profiles.

A trained agent should beat the heuristic on at least 2 of 3 profiles, with qualitatively different action sequences per profile β€” the introvert's week should look nothing like the workaholic's.


How to run

Open training/RhythmEnv_GRPO_Training.ipynb in Colab with a T4 GPU runtime.

Run cells in order:

  1. Install dependencies
  2. Clone repo from HF Space
  3. Verify environment
  4. Run baseline evaluation (saves baseline_results)
  5. Generate dataset
  6. Load model (Qwen 2.5-3B + LoRA)
  7. Setup reward functions
  8. Configure training (beta=0.01, max_completion_length=16)
  9. Train (trainer.train())
  10. Save model
  11. Generate training plots
  12. Evaluate trained model
  13. Generate comparison chart (baseline_vs_trained.png)

Expected training behaviour

Healthy run: completion_length stays at 3–16 tokens throughout, KL stays below 1.0, mean reward climbs from ~1.5 toward ~3.0 over 500 steps.

Warning signs: completion_length spiking above 50, clipped_ratio approaching 1.0, KL above 5.0. If any of these appear, the beta=0.01 fix is not being applied.


Output artifacts

After a successful run, download these and commit to the repo:

plots/training_loss.png         β€” loss curve across 500 steps
plots/reward_curve.png          β€” mean reward with Β±1 std band
plots/baseline_vs_trained.png   β€” comparison bar chart (random / heuristic / trained)
plots/eval_results.json         β€” raw per-episode scores