Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Opus 4.7 (1M context) commited on 14 days ago

Commit

e21a960

1 Parent(s): dc0186f

iter2: fix mode collapse + 3 deeper bugs from code review

ITER 1 FAILURE: mode collapse to 'EXERCISE 5 5 5' at step 100. Root cause:
format_valid (+1.0) and action_legal (+0.5) saturated for any valid output,
contributing zero to GRPO advantage. Once 4/4 completions per prompt were
identical, reward_std=0 and the policy froze.

7 fixes applied (4 from initial diagnosis + 3 from deep review subagent):

[1] sampling temperature 1.0 -> 1.5
Forces diverse rollouts per prompt, breaks mode collapse mechanism

[2] reward_weights [0.3, 0.3, 1.0, 1.0] -> [0.05, 0.05, 1.5, 3.0]
Aggressively suppress saturated layers, amplify variable signals.
Belief weight bumped to 3.0 (CRITICAL-1) so emitting belief is clearly
rewarded vs the no-belief penalty.

[3] action_legal: drop +0.5 for valid, return 0 instead
Removes another constant-reward source. Layer is now pure penalty for
malformed outputs.

[4] explicit repetition penalty in env_reward (-0.3 if action seen 3+ in row)
Direct training signal against mode collapse. Env already does effect
dampening but doesn't add explicit negative reward for repetition.

[5] _grade_episode late_quality normalization fix (CRITICAL-2)
Per-step rewards are CLAMPED to [-3, +3] in step(), not [-1, +1] as the
old normalization assumed. Old grader saturated late_quality=1.0 for any
mean_late >= +1, blind to good vs excellent late-half. Fixed to use the
actual reward range.

[6] hint_fraction default 0.15 -> 0.0 (MAJOR-3)
Eliminates train-eval distribution mismatch. Eval never shows hints, so
training with hints creates a fraction of training examples whose lessons
don't transfer. Set to 0 by default; can be re-enabled if eval also adds
hint visibility.

[7] env_reward seed fallback hardening (MAJOR-1)
Replace 'i % 50' fallback with '(i * 17) ^ 0xBEEF' to break deterministic
seed clusters. Avoids worst-case where 4 completions in same position-class
get identical env_reward (zero GRPO advantage).

[8] FAST_MODE preset: MAX_STEPS 200 -> 400, samples 800 -> 1200
Iter 1 collapsed at step 100 leaving 100 stuck steps. Give iter 2 more
room to recover from any local optima.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

scripts/train_on_hf.py +2 -2
server/rhythm_environment.py +6 -4
training/reward_functions.py +16 -4
training/train.py +10 -6

scripts/train_on_hf.py CHANGED Viewed

@@ -57,8 +57,8 @@ PLOTS_DIR = "/tmp/rhythm_env/plots"
 FAST_MODE = os.environ.get("FAST_MODE", "0") == "1"
 if FAST_MODE:
-    # Smoke-train preset: enough signal in 200 steps to decide go/no-go
-    DEFAULTS = dict(MAX_STEPS=200, NUM_EPISODES=80, MAX_SAMPLES=800,
                     NUM_GENERATIONS=4, LORA_RANK=8, BETA=0.04,
                     LEARNING_RATE=5e-5, EVAL_EPISODES=2)
 else:

 FAST_MODE = os.environ.get("FAST_MODE", "0") == "1"
 if FAST_MODE:
+    # Smoke-train preset: 400 steps gives the policy room to escape iter-1 mode collapse
+    DEFAULTS = dict(MAX_STEPS=400, NUM_EPISODES=120, MAX_SAMPLES=1200,
                     NUM_GENERATIONS=4, LORA_RANK=8, BETA=0.04,
                     LEARNING_RATE=5e-5, EVAL_EPISODES=2)
 else:

server/rhythm_environment.py CHANGED Viewed

@@ -739,11 +739,13 @@ class RhythmEnvironment(Environment):
         if early and late:
             mean_early = sum(early) / len(early)
             mean_late = sum(late) / len(late)
-            # late_quality: rewards typically in [-1, 1] per step, normalize
-            late_quality = max(0.0, min(1.0, (mean_late + 1.0) / 2.0))
             gain = mean_late - mean_early
-            # gain typically in [-1, 1]; clip to [0, 1] (only positive counts)
-            gain_norm = max(0.0, min(1.0, gain))
             adaptation_score = gain_norm * late_quality
         else:
             adaptation_score = 0.0

         if early and late:
             mean_early = sum(early) / len(early)
             mean_late = sum(late) / len(late)
+            # Iter 2 fix: per-step rewards are CLAMPED to [-3, +3] in step(), not [-1, +1].
+            # Old normalization saturated late_quality at 1.0 for any mean_late >= +1,
+            # making the grader unable to distinguish good from excellent late-half.
+            late_quality = max(0.0, min(1.0, (mean_late + 3.0) / 6.0))
             gain = mean_late - mean_early
+            # gain in [-6, +6]; normalize to [0, 1] (only positive gain counts)
+            gain_norm = max(0.0, min(1.0, gain / 3.0))
             adaptation_score = gain_norm * late_quality
         else:
             adaptation_score = 0.0

training/reward_functions.py CHANGED Viewed

@@ -142,13 +142,15 @@ def action_legal(completions, **kwargs) -> list[float]:
     Layer 2: Is the parsed action one of the 10 valid actions?
     All 10 actions are always legal in this env (no state-dependent validity).
-    +0.5 if legal, -1.0 if not parseable.
     """
     scores = []
     for completion in completions:
         response = completion[0]["content"] if isinstance(completion, list) else completion
         action = extract_action(response)
-        scores.append(0.5 if action is not None else -1.0)
     return scores
@@ -201,14 +203,24 @@ def env_reward(
                 ep_history = prompt_data.get("action_history", [])
                 ep_mode = prompt_data.get("profile_mode", "continuous")
             else:
-                ep_seed = i % 50
                 ep_history = []
                 ep_mode = "continuous"
         try:
             env = _replay_env(ep_seed, ep_history, ep_mode)
             obs = env.step(RhythmAction(action_type=action_type))
-            scores.append(obs.reward)
         except Exception:
             scores.append(-3.0)

     Layer 2: Is the parsed action one of the 10 valid actions?
     All 10 actions are always legal in this env (no state-dependent validity).
+    Iter 2 fix: returns 0 for valid (was +0.5) so this layer becomes pure penalty
+    for malformed outputs. The +0.5 was a constant reward that contributed zero
+    to GRPO advantage and helped trigger mode collapse in iter 1.
     """
     scores = []
     for completion in completions:
         response = completion[0]["content"] if isinstance(completion, list) else completion
         action = extract_action(response)
+        scores.append(0.0 if action is not None else -1.0)
     return scores
                 ep_history = prompt_data.get("action_history", [])
                 ep_mode = prompt_data.get("profile_mode", "continuous")
             else:
+                # Iter 2 fix: mix index with prime to break deterministic seed clusters
+                # (avoids all completions in a position-class getting identical env_reward)
+                ep_seed = (i * 17) ^ 0xBEEF
                 ep_history = []
                 ep_mode = "continuous"
         try:
             env = _replay_env(ep_seed, ep_history, ep_mode)
             obs = env.step(RhythmAction(action_type=action_type))
+            reward = obs.reward
+            # Iter 2 fix: explicit repetition penalty as TRAINING signal (env already
+            # dampens repeats but doesn't add explicit negative reward). Discourages
+            # the lazy "spam one action" mode collapse pattern.
+            if ep_history and len(ep_history) >= 2:
+                recent = ep_history[-3:]  # last 3 actions
+                if recent.count(action_type.value) >= 2:  # this would make 3+ in a row
+                    reward -= 0.3
+            scores.append(reward)
         except Exception:
             scores.append(-3.0)

training/train.py CHANGED Viewed

@@ -41,8 +41,10 @@ def main():
                         help="KL penalty (TRL/DeepSeek default; raise to 0.1+ if KL diverges)")
     parser.add_argument("--lora_rank", type=int, default=8,
                         help="LoRA rank (8 = more capacity than original 4 for meta-RL)")
-    parser.add_argument("--hint_fraction", type=float, default=0.15,
-                        help="Fraction of dataset with profile hint visible (curriculum warmup)")
     parser.add_argument("--profile_mode", type=str, default="continuous",
                         choices=["continuous", "discrete"],
                         help="continuous = sampled per-episode (meta-RL); discrete = 3 hardcoded profiles")
@@ -144,13 +146,15 @@ def main():
     max_prompt_length = 600          # history + hint room
     max_completion_length = 32       # bumped from 20 to prevent silent truncation of belief digits
-    # reward_weights: scale per-layer to prevent format/legal (saturated near +1) from
-    # drowning out env_reward and belief_accuracy (the actual learning signals).
     # Order MUST match reward_funcs in main(): format_valid, action_legal, env_reward, belief_accuracy
-    reward_weights = [0.3, 0.3, 1.0, 1.0]
     training_args_kwargs = dict(
-        temperature=1.0,
         learning_rate=args.learning_rate,
         beta=args.beta,
         max_grad_norm=0.5,

                         help="KL penalty (TRL/DeepSeek default; raise to 0.1+ if KL diverges)")
     parser.add_argument("--lora_rank", type=int, default=8,
                         help="LoRA rank (8 = more capacity than original 4 for meta-RL)")
+    parser.add_argument("--hint_fraction", type=float, default=0.0,
+                        help="Fraction of dataset with profile hint visible. Default 0.0 (no hints) "
+                             "to eliminate train-eval distribution mismatch. Set >0 only if you ALSO "
+                             "show hints during eval.")
     parser.add_argument("--profile_mode", type=str, default="continuous",
                         choices=["continuous", "discrete"],
                         help="continuous = sampled per-episode (meta-RL); discrete = 3 hardcoded profiles")
     max_prompt_length = 600          # history + hint room
     max_completion_length = 32       # bumped from 20 to prevent silent truncation of belief digits
+    # reward_weights: aggressively suppress saturated format/legal layers, amplify
+    # the variable signals (env + belief). Iter 1 collapsed because format/legal at
+    # +1.0/+0.5 contributed zero to GRPO advantage; iter 2 makes belief the dominant
+    # learning signal.
     # Order MUST match reward_funcs in main(): format_valid, action_legal, env_reward, belief_accuracy
+    reward_weights = [0.05, 0.05, 1.5, 3.0]
     training_args_kwargs = dict(
+        temperature=1.5,  # bumped from 1.0 to force diverse rollouts and break mode collapse
         learning_rate=args.learning_rate,
         beta=args.beta,
         max_grad_norm=0.5,