Spaces:
Sleeping
Sleeping
Add Run 3 training results: plots, training log, README update
Browse files- plots/: training_loss.png, reward_curve.png, baseline_vs_trained.png from 500-step GRPO run
- training_log.json: per-step metrics (loss, KL, rewards) for full run 3
- README: trained scores filled in (Introvert 0.617, Extrovert 0.725, Workaholic 0.539)
- README: added analysis of what agent learned vs. what still needs training (workaholic DEEP_WORK modifier discovery gap)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README.md +8 -6
- training_log.json +0 -0
README.md
CHANGED
|
@@ -166,13 +166,15 @@ Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https
|
|
| 166 |
|
| 167 |

|
| 168 |
|
| 169 |
-
| Profile | Random | Heuristic | Trained |
|
| 170 |
-
|---------|--------|-----------|---------|-------------|
|
| 171 |
-
| Introvert Morning | 0.693 | 0.765 |
|
| 172 |
-
| Extrovert Night Owl | 0.863 | 0.819 |
|
| 173 |
-
| Workaholic Stoic | 0.602 | 0.761 |
|
| 174 |
|
| 175 |
-
*
|
|
|
|
|
|
|
| 176 |
|
| 177 |
## Setup Instructions
|
| 178 |
|
|
|
|
| 166 |
|
| 167 |

|
| 168 |
|
| 169 |
+
| Profile | Random | Heuristic | Trained (500 steps) | vs Heuristic |
|
| 170 |
+
|---------|--------|-----------|---------------------|--------------|
|
| 171 |
+
| Introvert Morning | 0.693 | 0.765 | **0.617** | -0.148 |
|
| 172 |
+
| Extrovert Night Owl | 0.863 | 0.819 | **0.725** | -0.094 |
|
| 173 |
+
| Workaholic Stoic | 0.602 | 0.761 | **0.539** | -0.222 |
|
| 174 |
|
| 175 |
+
**What the agent learned**: Serenity and Connection management improved substantially β the introvert's serenity maxed at 1.0, the extrovert's connection reached 1.0. The model correctly learned the *sign* of personality differences (introvert avoids socialising; extrovert embraces it).
|
| 176 |
+
|
| 177 |
+
**What still needs training**: The workaholic profile's key hidden modifier β that `DEEP_WORK` *recovers* vitality instead of draining it (+0.06 instead of -0.05) β was not discovered in 500 steps. Without this insight, the trained agent under-uses deep work for the workaholic. A longer run (1000+ steps) with higher DEEP_WORK sampling density in the workaholic training episodes is the clear next step.
|
| 178 |
|
| 179 |
## Setup Instructions
|
| 180 |
|
training_log.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|