InosLihka Claude Sonnet 4.6 commited on
Commit
0bdfeaa
·
1 Parent(s): cc6473a

fix: reduce kl_coef to prevent training instability

Browse files

KL divergence exploded to 10731 at step 205 in the first run, causing
the policy to drift and generate verbose 368-token outputs for the rest
of training. Two fixes: kl_coef=0.01 (from default 0.1) and
max_completion_length=16 (action names are ≤15 chars, no need for more).

Also moves blog_post.md into docs/ for cleaner repo layout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs/blog_post.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Teaching an AI to Know You (Without Asking)
2
+
3
+ Ask someone how they'd build a personal AI assistant, and they'll say: give it a personality quiz. A preferences form. Maybe a settings page where you pick "introvert" or "morning person" from a dropdown.
4
+
5
+ Sounds reasonable. But it's the wrong approach entirely.
6
+
7
+ Think about the people who actually know you well -- a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
8
+
9
+ They learned your hidden patterns through trial, error, and feedback. What if we could train an AI the same way?
10
+
11
+ ## The Trait Decomposition
12
+
13
+ Here's how I think about modeling humans. Start with **traits** -- atomic behavioral properties that describe how a person responds to activities:
14
+
15
+ - **Social energy cost** -- how much does socializing drain you physically?
16
+ - **Cognitive peak time** -- when does your brain work best? Morning? Evening?
17
+ - **Guilt sensitivity** -- does leisure make you feel guilty, or recharge you?
18
+ - **Work-peace coupling** -- does productivity give you calm, or just tire you out?
19
+ - **Stress tolerance** -- how far can your serenity drop before everything spirals?
20
+ - **Metabolic rate** -- how fast do you burn baseline energy?
21
+ - **Event resilience** -- how much does unexpected chaos throw you off?
22
+ - **Solo recharge** -- does alone time restore your inner peace?
23
+ - **Social warmth** -- does socializing give you serenity, or drain it?
24
+
25
+ No single trait defines a person. It's the *combination* that creates a personality. An introvert morning person has high social energy cost, early cognitive peak, high guilt sensitivity, and strong solo recharge. An extrovert night owl has the opposite: low social cost, late cognitive peak, socializing actually gives them serenity.
26
+
27
+ Same list of traits. Different values. Different person.
28
+
29
+ ## The "Good Day" Definition
30
+
31
+ Here's the part that took me a while to articulate: two people can do the exact same activities and have completely different days. Not because the activities are different, but because they *value different outcomes*.
32
+
33
+ In RhythmEnv, each person has hidden **reward weights** -- a definition of what a good day means to them:
34
+
35
+ - The introvert values **serenity** above all (60% weight). A day where they maintained inner peace and made some progress is a great day. Connection barely registers (10%).
36
+ - The extrovert values **connection** above all (75% weight). A day full of meaningful social interactions is a great day, even if they didn't get much work done.
37
+ - The workaholic values **progress** above all (70% weight). A day of deep productive work is a great day. Everything else is secondary.
38
+
39
+ The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
40
+
41
+ ## The Environment
42
+
43
+ RhythmEnv simulates one week in a person's life. 7 days, 4 time slots each, 28 decisions total. Each decision picks an activity: deep work, exercise, sleep, meditation, family time, socializing, and so on. Ten options.
44
+
45
+ Five life meters track the person's state -- picture them like fuel gauges on a dashboard:
46
+
47
+ - **Vitality** -- physical energy. Sleep and exercise fill it up. Work drains it.
48
+ - **Cognition** -- mental sharpness. Peaks in the morning for some, in the evening for others.
49
+ - **Progress** -- career momentum. Only goes up when you work.
50
+ - **Serenity** -- inner calm. Meditation and downtime help. Overwork kills it.
51
+ - **Connection** -- relationship health. Decays on its own every time slot. If you don't actively socialize, it quietly drops.
52
+
53
+ After every action, meters shift. The agent sees the new meter values and gets a reward. That reward is the hidden weighted sum of what changed -- and the weights are different for every person type.
54
+
55
+ ## Why Identical Actions Produce Different Results
56
+
57
+ The trait modifiers change how actions physically affect the person, not just how rewards are computed. Here's what I mean:
58
+
59
+ Tell the introvert to socialize: their vitality drops 3x faster than normal. Their body physically rejects it. Tell the extrovert the same thing: barely any drain. They could socialize all day.
60
+
61
+ Tell the introvert to meditate: they get a bonus +0.10 serenity on top of the base effect. Alone time is their recharge mechanism. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
62
+
63
+ Tell the workaholic to do deep work: they get +0.06 vitality recovery -- productive work energizes them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, different physiological response.
64
+
65
+ These aren't arbitrary. They're modeled after real behavioral patterns. The introvert's social drain, the workaholic's anxiety from idleness, the night owl's morning penalty -- these are things people recognize in themselves but rarely articulate.
66
+
67
+ ## What the Agent Must Figure Out
68
+
69
+ The agent sees meters, time of day, and reward. It doesn't see:
70
+
71
+ - Which person type it's helping
72
+ - The trait values
73
+ - The reward weight decomposition
74
+
75
+ After a few actions, the patterns start showing. "I socialized and my vitality crashed -- this person drains from socializing." "I meditated and got a huge reward -- serenity must be heavily weighted." "Deep work in the morning gave double progress -- this person peaks early."
76
+
77
+ The trained agent should probe early, infer the person type by observing unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialize less. An agent that discovers it's helping a workaholic should maximize productive hours and minimize idle time.
78
+
79
+ ## The Training Pipeline
80
+
81
+ Training uses GRPO -- Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the environment, update the model to prefer the ones that scored higher. The environment *is* the critic.
82
+
83
+ The per-step reward signal is strongly differentiated by profile. At the same starting state (morning, all meters at 0.7), the best action is completely different:
84
+
85
+ | Profile | Best Action | Reward | Worst Action | Reward |
86
+ |---------|-------------|--------|--------------|--------|
87
+ | Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
88
+ | Extrovert | FAMILY_TIME | +2.63 | ME_TIME | -0.42 |
89
+ | Workaholic | DEEP_WORK | +1.57 | ME_TIME | -0.27 |
90
+
91
+ The model is Qwen 2.5-3B with 4-bit quantization and LoRA -- small enough to train on a free Colab T4.
92
+
93
+ ## What I'm Hoping To See
94
+
95
+ The heuristic baseline (fixed rules, no profile adaptation) scores around 0.76-0.82 depending on the profile. It treats everyone the same -- sleep at night, work in morning, socialize when connection drops. It works despite the hidden dynamics, not because it understands them.
96
+
97
+ A trained agent that discovers the hidden personality should adapt its behavior per profile. The dream is not just higher scores -- it's qualitatively different action sequences for different people. The introvert's week should look nothing like the extrovert's week.
98
+
99
+ No questionnaire. No settings page. Just attention, inference, and adjustment.
100
+
101
+ That's what personal AI should feel like.
102
+
103
+ ---
104
+
105
+ **Links:**
106
+ - [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
107
+ - [Training Notebook (Colab)](training/RhythmEnv_GRPO_Training.ipynb)
108
+ - [Source Code & README](https://huggingface.co/spaces/InosLihka/rhythm_env)
109
+
110
+ *Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*
training/RhythmEnv_GRPO_Training.ipynb CHANGED
@@ -224,46 +224,7 @@
224
  "execution_count": null,
225
  "metadata": {},
226
  "outputs": [],
227
- "source": [
228
- "from trl import GRPOConfig, GRPOTrainer\n",
229
- "\n",
230
- "MAX_STEPS = 500 # Increase to 1000 if time allows\n",
231
- "NUM_GENERATIONS = 4\n",
232
- "LEARNING_RATE = 2e-4\n",
233
- "\n",
234
- "max_prompt_length = 400\n",
235
- "max_completion_length = MAX_SEQ_LENGTH - max_prompt_length\n",
236
- "\n",
237
- "training_args = GRPOConfig(\n",
238
- " temperature=1.0,\n",
239
- " learning_rate=LEARNING_RATE,\n",
240
- " weight_decay=0.001,\n",
241
- " warmup_ratio=0.1,\n",
242
- " lr_scheduler_type=\"linear\",\n",
243
- " optim=\"adamw_8bit\",\n",
244
- " logging_steps=1,\n",
245
- " per_device_train_batch_size=1,\n",
246
- " gradient_accumulation_steps=4,\n",
247
- " num_generations=NUM_GENERATIONS,\n",
248
- " max_prompt_length=max_prompt_length,\n",
249
- " max_completion_length=max_completion_length,\n",
250
- " max_steps=MAX_STEPS,\n",
251
- " save_steps=100,\n",
252
- " report_to=REPORT_TO,\n",
253
- " output_dir=\"outputs/rhythmenv_trained\",\n",
254
- ")\n",
255
- "\n",
256
- "trainer = GRPOTrainer(\n",
257
- " model=model,\n",
258
- " processing_class=tokenizer,\n",
259
- " reward_funcs=reward_funcs,\n",
260
- " args=training_args,\n",
261
- " train_dataset=dataset,\n",
262
- ")\n",
263
- "\n",
264
- "print(f\"Training config: {MAX_STEPS} steps, {NUM_GENERATIONS} generations, lr={LEARNING_RATE}\")\n",
265
- "print(\"Starting training...\")"
266
- ]
267
  },
268
  {
269
  "cell_type": "code",
 
224
  "execution_count": null,
225
  "metadata": {},
226
  "outputs": [],
227
+ "source": "from trl import GRPOConfig, GRPOTrainer\n\nMAX_STEPS = 500 # Increase to 1000 if time allows\nNUM_GENERATIONS = 4\nLEARNING_RATE = 2e-4\n\nmax_prompt_length = 400\nmax_completion_length = 16 # Action names are 3-15 chars — no need for more\n\ntraining_args = GRPOConfig(\n temperature=1.0,\n learning_rate=LEARNING_RATE,\n kl_coef=0.01, # Default 0.1 caused KL explosion at step 205; 0.01 keeps drift in check\n weight_decay=0.001,\n warmup_ratio=0.1,\n lr_scheduler_type=\"linear\",\n optim=\"adamw_8bit\",\n logging_steps=1,\n per_device_train_batch_size=1,\n gradient_accumulation_steps=4,\n num_generations=NUM_GENERATIONS,\n max_prompt_length=max_prompt_length,\n max_completion_length=max_completion_length,\n max_steps=MAX_STEPS,\n save_steps=100,\n report_to=REPORT_TO,\n output_dir=\"outputs/rhythmenv_trained\",\n)\n\ntrainer = GRPOTrainer(\n model=model,\n processing_class=tokenizer,\n reward_funcs=reward_funcs,\n args=training_args,\n train_dataset=dataset,\n)\n\nprint(f\"Training config: {MAX_STEPS} steps, {NUM_GENERATIONS} generations, lr={LEARNING_RATE}\")\nprint(f\" kl_coef=0.01 (reduced from default 0.1 to prevent KL explosion)\")\nprint(f\" max_completion_length=16 (action names only, no verbose outputs)\")\nprint(\"Starting training...\")"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
  },
229
  {
230
  "cell_type": "code",