InosLihka Claude Sonnet 4.6 commited on
Commit
c67f463
Β·
1 Parent(s): 92808b9

Add Run 3 training results: plots, training log, README update

Browse files

- plots/: training_loss.png, reward_curve.png, baseline_vs_trained.png from 500-step GRPO run
- training_log.json: per-step metrics (loss, KL, rewards) for full run 3
- README: trained scores filled in (Introvert 0.617, Extrovert 0.725, Workaholic 0.539)
- README: added analysis of what agent learned vs. what still needs training (workaholic DEEP_WORK modifier discovery gap)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show
  1. README.md +8 -6
  2. training_log.json +0 -0
README.md CHANGED
@@ -166,13 +166,15 @@ Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https
166
 
167
  ![Baseline vs Trained](plots/baseline_vs_trained.png)
168
 
169
- | Profile | Random | Heuristic | Trained | Improvement |
170
- |---------|--------|-----------|---------|-------------|
171
- | Introvert Morning | 0.693 | 0.765 | β€” | β€” |
172
- | Extrovert Night Owl | 0.863 | 0.819 | β€” | β€” |
173
- | Workaholic Stoic | 0.602 | 0.761 | β€” | β€” |
174
 
175
- *Trained results pending current Colab run. A trained agent that discovers the hidden personality adapts its full weekly strategy per-profile β€” the introvert's week looks nothing like the extrovert's.*
 
 
176
 
177
  ## Setup Instructions
178
 
 
166
 
167
  ![Baseline vs Trained](plots/baseline_vs_trained.png)
168
 
169
+ | Profile | Random | Heuristic | Trained (500 steps) | vs Heuristic |
170
+ |---------|--------|-----------|---------------------|--------------|
171
+ | Introvert Morning | 0.693 | 0.765 | **0.617** | -0.148 |
172
+ | Extrovert Night Owl | 0.863 | 0.819 | **0.725** | -0.094 |
173
+ | Workaholic Stoic | 0.602 | 0.761 | **0.539** | -0.222 |
174
 
175
+ **What the agent learned**: Serenity and Connection management improved substantially β€” the introvert's serenity maxed at 1.0, the extrovert's connection reached 1.0. The model correctly learned the *sign* of personality differences (introvert avoids socialising; extrovert embraces it).
176
+
177
+ **What still needs training**: The workaholic profile's key hidden modifier β€” that `DEEP_WORK` *recovers* vitality instead of draining it (+0.06 instead of -0.05) β€” was not discovered in 500 steps. Without this insight, the trained agent under-uses deep work for the workaholic. A longer run (1000+ steps) with higher DEEP_WORK sampling density in the workaholic training episodes is the clear next step.
178
 
179
  ## Setup Instructions
180
 
training_log.json ADDED
The diff for this file is too large to render. See raw diff