Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Opus 4.7 (1M context) commited on 12 days ago

Commit

64d24b3

1 Parent(s): e21a960

iter3: align reward with grader + belief-first format + exploration shaping

ITER 2 FAILURE: agent escaped single-action collapse only to find a 2-cycle
loop (MEDITATE+EXERCISE alternation). final_score 0.22, worse than random.
Belief learning (+0.36) works but doesn't influence action choice — belief
was emitted as a string AFTER the action, so action couldn't condition on it.

Five runtime-level fixes for iter 3:

[1] Per-step reward grader-alignment (server/rhythm_environment.py:_compute_reward)
Add profile-INDEPENDENT bias: +0.5*progress_delta + 0.4*connection_delta.
Profile-weighted reward still drives belief inference (varies by profile),
but agent now ALWAYS gets penalized for ignoring progress and connection.
Closes the proxy/goal misalignment that let M+E alternation win.

[2] Belief-first output format (training/reward_functions.py, dataset.py, inference.py)
Was: 'ACTION_NAME S M W' (belief was post-hoc afterthought, didn't condition action)
Now: 'S M W ACTION_NAME' (causal LM generates belief tokens FIRST, then
action conditions on those belief tokens via attention).
The parser accepts both for backward compat. System prompt updated to
explain the belief-action coupling so the model learns to use the format.

[3] N-cycle penalty (training/reward_functions.py:env_reward)
Iter 2 had 3-in-a-row penalty but the agent found a 2-cycle loophole.
New rule: if last 6 actions have <=2 unique values, -0.4. Closes the M-E
alternation pattern specifically and any longer N-cycle.

[4] New-action exploration bonus (training/reward_functions.py:env_reward)
+0.2 reward for taking an action that hasn't appeared in the current
episode yet (until 6 unique actions tried). Pushes the agent to PROBE
early in episodes — the canonical meta-RL exploration signal.

[5] Sparse terminal reward (server/rhythm_environment.py:step)
At step 28, add (final_score - 0.5) * 5 to per-step reward. Direct
supervision on the actual grader. Range [-2.5, +2.5] gives strong
end-of-episode signal that overwhelms any local reward-hack.

Plus training config bumps:
- MAX_STEPS 400 -> 800 (iter 2 was still rising at step 400)
- NUM_GENERATIONS 4 -> 8 (lower advantage variance)
- LORA_RANK 8 -> 16 (more capacity for policy + belief integration)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (5) hide show

inference.py +20 -13
scripts/train_on_hf.py +5 -5
server/rhythm_environment.py +20 -3
training/dataset.py +21 -15
training/reward_functions.py +80 -36

inference.py CHANGED Viewed

@@ -80,24 +80,31 @@ people differently — you must INFER who you're helping from the rewards and
 meter changes you observe.
 Each step, output ONE LINE in this exact format:
-    ACTION_NAME S M W
-where ACTION_NAME is one of:
   DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,
   FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH
-and S, M, W are single digits (0-9) representing your current belief:
-  S = social preference (0=hates being social, 9=loves being social)
-  M = morning preference (0=night owl, 9=morning person)
-  W = work preference   (0=avoids work, 9=workaholic)
-Example: DEEP_WORK 3 8 7
-Tips:
-- Update your belief from rewards: SOCIALIZE giving big reward → raise S.
-- Early in the week, PROBE different actions to learn the person.
-- Late in the week, EXPLOIT what you've learned.
-- Watch for crashes: any meter under 0.1 → big penalty.
 - Connection decays passively — actively maintain it.
 Respond with ONLY the format line, no other text.""")
@@ -212,7 +219,7 @@ Meters:
   Serenity:   {obs.serenity:.2f}
   Connection: {obs.connection:.2f}{event_str}{history_str}
-Choose your action (format: ACTION S M W):""")
     try:
         completion = llm_client.chat.completions.create(

 meter changes you observe.
 Each step, output ONE LINE in this exact format:
+    S M W ACTION_NAME
+First write your BELIEF as 3 digits 0-9, then the ACTION that fits:
+  S = social preference (0=hates social, 9=loves social)
+  M = morning preference (0=night owl, 9=morning person)
+  W = work preference   (0=avoids work, 9=workaholic)
+ACTION choices:
   DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,
   FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH
+Example: 3 8 7 DEEP_WORK
+Belief-action coupling guide:
+- High S: SOCIALIZE, FAMILY_TIME (extrovert boosts)
+- High M: DEEP_WORK in morning slots (morning-person bonus)
+- High W: DEEP_WORK, LEARN (workaholic energy)
+- Low S: MEDITATE, ME_TIME (introvert recharge)
+- Low M: DEEP_WORK in evening/night (night-owl bonus)
+Tactics:
+- Early week: PROBE varied actions to gather information.
+- Late week: EXPLOIT — pick actions matching your sharpened belief.
+- Don't repeat the same action; you'll get a repetition penalty.
+- Watch for crashes: any meter under 0.1 = big penalty.
 - Connection decays passively — actively maintain it.
 Respond with ONLY the format line, no other text.""")
   Serenity:   {obs.serenity:.2f}
   Connection: {obs.connection:.2f}{event_str}{history_str}
+Output belief then action (format: S M W ACTION_NAME):""")
     try:
         completion = llm_client.chat.completions.create(

scripts/train_on_hf.py CHANGED Viewed

@@ -57,13 +57,13 @@ PLOTS_DIR = "/tmp/rhythm_env/plots"
 FAST_MODE = os.environ.get("FAST_MODE", "0") == "1"
 if FAST_MODE:
-    # Smoke-train preset: 400 steps gives the policy room to escape iter-1 mode collapse
-    DEFAULTS = dict(MAX_STEPS=400, NUM_EPISODES=120, MAX_SAMPLES=1200,
-                    NUM_GENERATIONS=4, LORA_RANK=8, BETA=0.04,
                     LEARNING_RATE=5e-5, EVAL_EPISODES=2)
 else:
-    DEFAULTS = dict(MAX_STEPS=1500, NUM_EPISODES=300, MAX_SAMPLES=3000,
-                    NUM_GENERATIONS=8, LORA_RANK=8, BETA=0.04,
                     LEARNING_RATE=5e-5, EVAL_EPISODES=5)
 MAX_STEPS = int(os.environ.get("MAX_STEPS", str(DEFAULTS["MAX_STEPS"])))

 FAST_MODE = os.environ.get("FAST_MODE", "0") == "1"
 if FAST_MODE:
+    # Iter 3 preset: 800 steps + 8 generations + LoRA 16 to escape mode collapse for real
+    DEFAULTS = dict(MAX_STEPS=800, NUM_EPISODES=200, MAX_SAMPLES=2000,
+                    NUM_GENERATIONS=8, LORA_RANK=16, BETA=0.04,
                     LEARNING_RATE=5e-5, EVAL_EPISODES=2)
 else:
+    DEFAULTS = dict(MAX_STEPS=2000, NUM_EPISODES=400, MAX_SAMPLES=4000,
+                    NUM_GENERATIONS=8, LORA_RANK=16, BETA=0.04,
                     LEARNING_RATE=5e-5, EVAL_EPISODES=5)
 MAX_STEPS = int(os.environ.get("MAX_STEPS", str(DEFAULTS["MAX_STEPS"])))

server/rhythm_environment.py CHANGED Viewed

@@ -507,6 +507,13 @@ class RhythmEnvironment(Environment):
         if done:
             final_score = self._grade_episode()
             reward_breakdown["final_score"] = round(final_score, 4)
         # --- 14. Update state ---
         self._state.step_count = self._timestep
@@ -695,10 +702,20 @@ class RhythmEnvironment(Environment):
             self._vitality = max(0.0, self._vitality - vd)
     def _compute_reward(self, deltas: Dict[str, float]) -> float:
-        """Compute reward as hidden-weighted sum of meter deltas."""
         weights = self._profile["reward_weights"]
-        reward = sum(deltas[m] * weights[m] for m in METERS)
-        return reward * REWARD_SCALE
     def _grade_episode(self) -> float:
         """

         if done:
             final_score = self._grade_episode()
             reward_breakdown["final_score"] = round(final_score, 4)
+            # Iter 3 fix: sparse terminal reward — direct supervision on grader
+            # final_score is in [0, 1]; baseline-relative bonus gives strong signal
+            # for ending the episode well. Range: -2.5 (terrible) to +2.5 (perfect).
+            terminal_bonus = (final_score - 0.5) * 5.0
+            reward = max(-3.0, min(3.0, reward + terminal_bonus))
+            self._total_reward += terminal_bonus  # update tracking too
+            reward_breakdown["terminal_bonus"] = round(terminal_bonus, 4)
         # --- 14. Update state ---
         self._state.step_count = self._timestep
             self._vitality = max(0.0, self._vitality - vd)
     def _compute_reward(self, deltas: Dict[str, float]) -> float:
+        """Compute reward as hidden-weighted sum + grader-aligned bias.
+        Iter 3 fix: Add a profile-INDEPENDENT bias term for progress and
+        connection. The original profile-weighted reward drives belief inference
+        (varies by profile), but allowed agents to game it by spamming recovery
+        actions if the sampled profile didn't weight progress/connection. The
+        bias makes the per-step reward correlate with the FINAL grader (which
+        weights progress 0.25 and connection 0.15).
+        """
         weights = self._profile["reward_weights"]
+        profile_reward = sum(deltas[m] * weights[m] for m in METERS) * REWARD_SCALE
+        # Grader-aligned bias: scaled so max bonus is ~0.1/step (manageable vs profile_reward)
+        grader_bias = 0.5 * deltas["progress"] + 0.4 * deltas["connection"]
+        return profile_reward + grader_bias
     def _grade_episode(self) -> float:
         """

training/dataset.py CHANGED Viewed

@@ -32,23 +32,29 @@ SYSTEM_PROMPT = (
     "different people differently — you must INFER who you're helping from the\n"
     "rewards and meter changes you observe.\n\n"
     "Each step, output ONE LINE in this exact format:\n"
-    "    ACTION_NAME S M W\n\n"
-    "where ACTION_NAME is one of:\n"
-    "  DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,\n"
-    "  FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH\n\n"
-    "and S, M, W are single digits (0-9) representing your current belief:\n"
     "  S = social preference (0=hates being social, 9=loves being social)\n"
     "  M = morning preference (0=night owl, 9=morning person)\n"
     "  W = work preference   (0=avoids work, 9=workaholic)\n\n"
-    "Example: DEEP_WORK 3 8 7\n\n"
-    "Tips:\n"
-    "- Update your belief based on rewards: if SOCIALIZE gave a big positive reward,\n"
-    "  raise S; if it tanked vitality without a reward, lower S.\n"
-    "- Early in the week, PROBE different actions to learn the person.\n"
-    "- Late in the week, EXPLOIT what you've learned — pick actions matching\n"
-    "  the person's preferences (use your belief to guide).\n"
-    "- Watch for crashes: any meter under 0.1 → big penalty.\n"
-    "- Connection decays passively — actively maintain it.\n"
     "Respond with ONLY the format line, no other text."
 )
@@ -96,7 +102,7 @@ def format_observation_prompt(obs, profile_hint: dict | None = None) -> str:
         f"{event_str}"
         f"{history_str}"
         f"{hint_str}\n\n"
-        f"Choose your action (format: ACTION S M W):"
     )

     "different people differently — you must INFER who you're helping from the\n"
     "rewards and meter changes you observe.\n\n"
     "Each step, output ONE LINE in this exact format:\n"
+    "    S M W ACTION_NAME\n\n"
+    "First, write your belief about the person as 3 digits (0-9):\n"
     "  S = social preference (0=hates being social, 9=loves being social)\n"
     "  M = morning preference (0=night owl, 9=morning person)\n"
     "  W = work preference   (0=avoids work, 9=workaholic)\n\n"
+    "Then choose the action that BEST FITS that belief, from:\n"
+    "  DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,\n"
+    "  FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH\n\n"
+    "Example: 3 8 7 DEEP_WORK   (this person is moderately introverted, strongly\n"
+    "morning-oriented, fairly work-driven — so deep work in the morning fits)\n\n"
+    "Belief-action coupling guide:\n"
+    "- High S (extrovert): SOCIALIZE, FAMILY_TIME boost connection cheaply\n"
+    "- High M (morning person): DEEP_WORK in early slots gets bonus cognition\n"
+    "- High W (workaholic): DEEP_WORK, LEARN drive progress AND may energize\n"
+    "- Low S (introvert): MEDITATE, ME_TIME for solo recharge; avoid SOCIALIZE\n"
+    "- Low M (night owl): DEEP_WORK in evening/night slots\n\n"
+    "Tactics:\n"
+    "- Early in the week, PROBE varied actions to gather information.\n"
+    "- Update your belief from the rewards you see.\n"
+    "- Late in the week, EXPLOIT — pick actions matching your sharpened belief.\n"
+    "- Don't repeat the same action excessively; you'll get a repetition penalty.\n"
+    "- Watch for crashes: any meter under 0.1 = big penalty.\n"
+    "- Connection decays passively — actively maintain it via SOCIALIZE/FAMILY_TIME.\n"
     "Respond with ONLY the format line, no other text."
 )
         f"{event_str}"
         f"{history_str}"
         f"{hint_str}\n\n"
+        f"Output your belief, then your action (format: S M W ACTION_NAME):"
     )

training/reward_functions.py CHANGED Viewed

@@ -2,20 +2,29 @@
 Reward functions for RhythmEnv GRPO training (meta-RL version).
 Four-layer reward stack:
-1. format_valid    — does the LLM output have a parseable action + belief format?
 2. action_legal    — is the action one of the 10 valid actions?
 3. env_reward      — actual environment reward (seed-replay) for the chosen action
 4. belief_accuracy — how close is the belief vector to the hidden profile's true vector?
-Action output format: "ACTION_NAME S M W"
-  - ACTION_NAME: one of 10 valid actions
   - S, M, W: single digits 0-9 representing the agent's belief about the user
     S = social preference (0=hates social, 9=loves social)
     M = morning preference (0=night owl, 9=morning person)
     W = work preference  (0=avoids work, 9=workaholic)
-Example: "DEEP_WORK 3 8 7"
-  → action=DEEP_WORK, belief=[0.33, 0.89, 0.78]
 Each function returns a list of floats (one per completion).
 """
@@ -58,7 +67,7 @@ def extract_action_and_belief(text: str) -> tuple[ActionType | None, list[float]
     if not parts:
         return None, list(DEFAULT_BELIEF), False
-    # Find action and its index in parts (try first token, then any token)
     action: ActionType | None = None
     action_idx = -1
     for idx, p in enumerate(parts):
@@ -76,35 +85,53 @@ def extract_action_and_belief(text: str) -> tuple[ActionType | None, list[float]
             if action is not None:
                 break
-    # Parse next 3 tokens AFTER the action as belief digits/floats
     belief: list[float] = []
     belief_provided = False
     if action_idx >= 0:
-        for i in range(3):
-            j = action_idx + 1 + i
-            if j < len(parts):
-                p = parts[j].strip().rstrip(".")
-                if not p:
-                    belief.append(0.5)
-                    continue
-                try:
-                    if len(p) == 1 and p.isdigit():
-                        belief.append(int(p) / 9.0)
-                        belief_provided = True
                     else:
-                        val = float(p)
-                        if val > 1.0:
-                            val = val / 9.0  # interpret as 0-9 scale
-                        belief.append(max(0.0, min(1.0, val)))
-                        belief_provided = True
-                except (ValueError, IndexError):
-                    belief.append(0.5)
-            else:
-                belief.append(0.5)
-    else:
-        belief = list(DEFAULT_BELIEF)
-    if not belief:
         belief = list(DEFAULT_BELIEF)
     return action, belief, belief_provided
@@ -213,13 +240,30 @@ def env_reward(
             env = _replay_env(ep_seed, ep_history, ep_mode)
             obs = env.step(RhythmAction(action_type=action_type))
             reward = obs.reward
-            # Iter 2 fix: explicit repetition penalty as TRAINING signal (env already
-            # dampens repeats but doesn't add explicit negative reward). Discourages
-            # the lazy "spam one action" mode collapse pattern.
             if ep_history and len(ep_history) >= 2:
-                recent = ep_history[-3:]  # last 3 actions
-                if recent.count(action_type.value) >= 2:  # this would make 3+ in a row
                     reward -= 0.3
             scores.append(reward)
         except Exception:
             scores.append(-3.0)

 Reward functions for RhythmEnv GRPO training (meta-RL version).
 Four-layer reward stack:
+1. format_valid    — does the LLM output have a parseable belief + action format?
 2. action_legal    — is the action one of the 10 valid actions?
 3. env_reward      — actual environment reward (seed-replay) for the chosen action
+                     plus diversity/exploration shaping
 4. belief_accuracy — how close is the belief vector to the hidden profile's true vector?
+ITER 3 FORMAT (belief-first): "S M W ACTION_NAME"
   - S, M, W: single digits 0-9 representing the agent's belief about the user
     S = social preference (0=hates social, 9=loves social)
     M = morning preference (0=night owl, 9=morning person)
     W = work preference  (0=avoids work, 9=workaholic)
+  - ACTION_NAME: one of 10 valid actions
+Example: "3 8 7 DEEP_WORK"
+  → belief=[0.33, 0.89, 0.78], action=DEEP_WORK
+Why belief-first: in causal LM generation, tokens generated EARLIER condition
+LATER tokens. With "S M W ACTION", the model commits to a belief first, and
+the action is then conditioned on that belief — making the belief functionally
+useful for action selection. The previous "ACTION S M W" order made belief a
+post-hoc afterthought that didn't influence behavior.
+The parser ALSO accepts the old "ACTION S M W" format for backward compatibility.
 Each function returns a list of floats (one per completion).
 """
     if not parts:
         return None, list(DEFAULT_BELIEF), False
+    # Find action and its index in parts
     action: ActionType | None = None
     action_idx = -1
     for idx, p in enumerate(parts):
             if action is not None:
                 break
+    # Iter 3: parse belief from BEFORE the action (belief-first format).
+    # Falls back to AFTER the action (legacy format) if no digits found before.
+    def _parse_digit(token: str) -> float | None:
+        token = token.strip().rstrip(".")
+        if not token:
+            return None
+        try:
+            if len(token) == 1 and token.isdigit():
+                return int(token) / 9.0
+            val = float(token)
+            if val > 1.0:
+                val = val / 9.0
+            return max(0.0, min(1.0, val))
+        except (ValueError, IndexError):
+            return None
     belief: list[float] = []
     belief_provided = False
     if action_idx >= 0:
+        # Try belief-first: 3 digits BEFORE the action
+        if action_idx >= 3:
+            cand = [_parse_digit(parts[action_idx - 3 + i]) for i in range(3)]
+            if all(c is not None for c in cand):
+                belief = cand  # type: ignore[assignment]
+                belief_provided = True
+        # If belief-first didn't work, try legacy after-action format
+        if not belief_provided:
+            after_belief: list[float] = []
+            after_provided = False
+            for i in range(3):
+                j = action_idx + 1 + i
+                if j < len(parts):
+                    d = _parse_digit(parts[j])
+                    if d is not None:
+                        after_belief.append(d)
+                        after_provided = True
                     else:
+                        after_belief.append(0.5)
+                else:
+                    after_belief.append(0.5)
+            if after_provided:
+                belief = after_belief
+                belief_provided = True
+    if not belief or len(belief) != 3:
         belief = list(DEFAULT_BELIEF)
     return action, belief, belief_provided
             env = _replay_env(ep_seed, ep_history, ep_mode)
             obs = env.step(RhythmAction(action_type=action_type))
             reward = obs.reward
+            chosen = action_type.value
+            # Iter 2 fix: explicit 3-in-a-row repetition penalty
             if ep_history and len(ep_history) >= 2:
+                recent3 = ep_history[-3:]
+                if recent3.count(chosen) >= 2:  # this action would make 3+ in a row
                     reward -= 0.3
+            # Iter 3 fix: N-CYCLE penalty (catches the M-E-M-E-... loop iter 2 fell into)
+            # If last 6 actions (including this one) have <=2 unique values, apply penalty
+            if ep_history and len(ep_history) >= 5:
+                last6 = ep_history[-5:] + [chosen]
+                if len(set(last6)) <= 2:
+                    reward -= 0.4
+            # Iter 3 fix: NEW-ACTION exploration bonus
+            # If this action hasn't appeared yet in the current episode, +0.2.
+            # Strong incentive in early steps to TRY varied actions, fading as
+            # the action set grows. Stops once 6+ different actions tried.
+            if ep_history is not None:
+                seen = set(ep_history)
+                if chosen not in seen and len(seen) < 6:
+                    reward += 0.2
             scores.append(reward)
         except Exception:
             scores.append(-3.0)