Spaces:
Sleeping
Sleeping
| # Training Guide β RhythmEnv GRPO | |
| > **Scope of this doc:** the **pure-GRPO-from-scratch** training stack | |
| > exercised by [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb) | |
| > (Iter 5 in [`iterations.md`](iterations.md)). The submission's **headline | |
| > numbers** come from a different recipe β **Algorithm Distillation** | |
| > (teacher trajectories β SFT) β described in the [README](../README.md) | |
| > under *Training: Algorithm Distillation*. The GRPO machinery here is | |
| > still used in the final pipeline, but only as the *optional Stage 3 | |
| > refine on top of the SFT checkpoint*, not as the from-scratch trainer. | |
| ## What we're training | |
| A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly. | |
| The goal is not to teach the model the rules of the environment β a capable LLM already understands them from the prompt. The goal is to calibrate a small model to do online behavioral inference: read who you're helping from how the environment responds, not from what it tells you. | |
| --- | |
| ## Stack | |
| | Component | Choice | | |
| |---|---| | |
| | Model | `unsloth/Qwen2.5-3B-Instruct` | | |
| | Quantization | 4-bit NF4 via Unsloth | | |
| | LoRA rank | 4 | | |
| | Training algorithm | GRPO (TRL 0.22.2) | | |
| | Hardware | Free Colab T4 (~3 hours for 500 steps) | | |
| --- | |
| ## Four-layer reward stack | |
| Each training step scores four candidate completions per prompt across four reward functions: | |
| | Layer | Function | Signal | Pass | Fail | | |
| |---|---|---|---|---| | |
| | 1 | `format_valid` | Is the output parseable as `ACTION + 3 belief digits`? | +1.0 | β2.0 | | |
| | 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | β1.0 | | |
| | 3 | `env_reward` | Real reward from stepping the environment | varies | β3.0 | | |
| | 4 | `belief_accuracy` | Cosine-MAE between emitted belief vector and the hidden profile (the meta-RL signal) | +0.5 | β0.5 | | |
| `env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β the reward cannot be fabricated. | |
| `belief_accuracy` is what the v2 grader was added to measure. The first | |
| five GRPO iterations under the v1 (three-layer) grader kept tying with | |
| heuristic because nothing was rewarding the agent for *learning who it was | |
| helping*. Adding `belief_accuracy` makes inference a first-class | |
| training signal. See [`iterations.md`](iterations.md) for the post-mortem. | |
| --- | |
| ## Key config choices | |
| ```python | |
| GRPOConfig( | |
| beta=0.01, # KL penalty β default 0.04 caused explosion to kl=10731 at step 205 | |
| max_completion_length=16, # Action names are β€15 chars; prevents verbose drift | |
| learning_rate=2e-4, | |
| num_generations=4, # 4 candidates per prompt β enough variance for GRPO signal | |
| max_steps=500, | |
| per_device_train_batch_size=1, | |
| gradient_accumulation_steps=4, | |
| ) | |
| ``` | |
| `beta=0.01` is the critical fix from the first training run. The default value caused the policy to drift so far from the reference model that completion length jumped from 4 tokens to 368 tokens, saturating the max. `max_completion_length=16` provides a hard cap as a second safeguard. | |
| --- | |
| ## Dataset | |
| Generated from 200 simulated episodes using a mixed strategy (heuristic + random actions) across all three profiles. Each sample is one step: | |
| ```python | |
| { | |
| "prompt": [system_msg, user_observation], | |
| "seed": int, # episode seed β deterministic profile + events | |
| "step_index": int, # which step in the episode | |
| "action_history": list, # actions taken before this step | |
| } | |
| ``` | |
| The dataset gives the model exposure to all three profiles and a range of meter states. Mixed strategy (not pure heuristic) ensures the model sees suboptimal states to learn recovery from. | |
| --- | |
| ## Baselines | |
| Established before training on 5 episodes Γ 3 profiles: | |
| | Strategy | Introvert | Extrovert | Workaholic | | |
| |---|---|---|---| | |
| | Random | ~0.65 | ~0.70 | ~0.65 | | |
| | Heuristic | ~0.78 | ~0.76 | ~0.82 | | |
| The heuristic baseline uses observable rules only (sleep when vitality is low, meditate when serenity is low, socialise when connection drops). It cannot differentiate profiles. | |
| A trained agent should beat the heuristic on at least 2 of 3 profiles, with qualitatively different action sequences per profile β the introvert's week should look nothing like the workaholic's. | |
| --- | |
| ## How to run | |
| Open [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb) in Colab with a T4 GPU runtime. | |
| Run cells in order: | |
| 1. Install dependencies | |
| 2. Clone repo from HF Space | |
| 3. Verify environment | |
| 4. Run baseline evaluation (saves `baseline_results`) | |
| 5. Generate dataset | |
| 6. Load model (Qwen 2.5-3B + LoRA) | |
| 7. Setup reward functions | |
| 8. Configure training (`beta=0.01`, `max_completion_length=16`) | |
| 9. Train (`trainer.train()`) | |
| 10. Save model | |
| 11. Generate training plots | |
| 12. Evaluate trained model | |
| 13. Generate comparison chart (`baseline_vs_trained.png`) | |
| --- | |
| ## Expected training behaviour | |
| Healthy run: `completion_length` stays at 3β16 tokens throughout, KL stays below 1.0, mean reward climbs from ~1.5 toward ~3.0 over 500 steps. | |
| Warning signs: `completion_length` spiking above 50, `clipped_ratio` approaching 1.0, KL above 5.0. If any of these appear, the `beta=0.01` fix is not being applied. | |
| --- | |
| ## Output artifacts | |
| After a successful run, download these and commit to the repo: | |
| ``` | |
| plots/training_loss.png β loss curve across 500 steps | |
| plots/reward_curve.png β mean reward with Β±1 std band | |
| plots/baseline_vs_trained.png β comparison bar chart (random / heuristic / trained) | |
| plots/eval_results.json β raw per-episode scores | |
| ``` | |