Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Sonnet 4.6 commited on 13 days ago

Commit

c67f463

1 Parent(s): 92808b9

Add Run 3 training results: plots, training log, README update

- plots/: training_loss.png, reward_curve.png, baseline_vs_trained.png from 500-step GRPO run
- training_log.json: per-step metrics (loss, KL, rewards) for full run 3
- README: trained scores filled in (Introvert 0.617, Extrovert 0.725, Workaholic 0.539)
- README: added analysis of what agent learned vs. what still needs training (workaholic DEEP_WORK modifier discovery gap)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show

README.md +8 -6
training_log.json +0 -0

README.md CHANGED Viewed

@@ -166,13 +166,15 @@ Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https
 ![Baseline vs Trained](plots/baseline_vs_trained.png)
-| Profile | Random | Heuristic | Trained | Improvement |
-|---------|--------|-----------|---------|-------------|
-| Introvert Morning | 0.693 | 0.765 | — | — |
-| Extrovert Night Owl | 0.863 | 0.819 | — | — |
-| Workaholic Stoic | 0.602 | 0.761 | — | — |
-*Trained results pending current Colab run. A trained agent that discovers the hidden personality adapts its full weekly strategy per-profile — the introvert's week looks nothing like the extrovert's.*
 ## Setup Instructions

 ![Baseline vs Trained](plots/baseline_vs_trained.png)
+| Profile | Random | Heuristic | Trained (500 steps) | vs Heuristic |
+|---------|--------|-----------|---------------------|--------------|
+| Introvert Morning | 0.693 | 0.765 | **0.617** | -0.148 |
+| Extrovert Night Owl | 0.863 | 0.819 | **0.725** | -0.094 |
+| Workaholic Stoic | 0.602 | 0.761 | **0.539** | -0.222 |
+**What the agent learned**: Serenity and Connection management improved substantially — the introvert's serenity maxed at 1.0, the extrovert's connection reached 1.0. The model correctly learned the *sign* of personality differences (introvert avoids socialising; extrovert embraces it).
+**What still needs training**: The workaholic profile's key hidden modifier — that `DEEP_WORK` *recovers* vitality instead of draining it (+0.06 instead of -0.05) — was not discovered in 500 steps. Without this insight, the trained agent under-uses deep work for the workaholic. A longer run (1000+ steps) with higher DEEP_WORK sampling density in the workaholic training episodes is the clear next step.
 ## Setup Instructions

training_log.json ADDED Viewed

The diff for this file is too large to render. See raw diff