Fix loss rebound: lower Muon LR (0.02→0.008), clamp ternary latents, steeper cosine decay

Root cause: Muon's NS orthogonal update + momentum=0.95 pushed ternary latent
weights outside the STE clamp zone [-1, 1] after ~230 steps at LR=0.02. The
clamp-aware STE gradient is ZERO for weights outside [-1, 1] — those weights
become permanently dead (no gradient to recover them). This progressively
degraded model capacity, causing loss to rebound from 5.43 back to 6.48.

Three fixes:

1. LR 0.02→0.008: Standard Muon LR is 0.02 for dense fp32 weights with
unbounded range. Ternary STE restricts the useful weight range to [-1,1]
and the gradient-active zone to the same interval. The per-step weight
perturbation must be proportionally smaller. 0.008 gives ~2.5x slower
convergence but prevents overshoot past the STE boundary.

2. Latent weight clamping to [-2, 2]: After every Muon 2D update, clamp
weights to [-2, 2]. This is a safety net — weights that drift past
±1 from accumulated momentum are pulled back into the gradient-active
zone. The ±2 bound (not ±1) allows slight overshoot that round() in
the STE forward still maps correctly to {-1, 0, +1}.

3. Cosine min_ratio 0.01→0.05: The old schedule kept LR near peak for
too long. With ternary weights, you want to reach a low-LR fine-tuning
regime faster. At 5% of peak (0.008 * 0.05 = 0.0004), the per-step
update is small enough to fine-tune within the ternary basin without
escaping it.

Files changed (1) hide show

chimera/training/loops.py +6 -9

chimera/training/loops.py CHANGED Viewed

@@ -53,15 +53,12 @@ def train_standard_loop(args, model, config, loader, compute_loss, optimizer, us
 def train_hyper_loop(args, model, config, dataset, initial_seq, grow, unfreezer):
     use_compile = getattr(args, "compile", False)
-    # Muon needs higher LR than AdamW: NS orthogonalization normalizes
-    # update direction, so LR controls step SIZE not direction stability.
-    # 0.02 is the standard Muon LR; CLI default 1.5e-3 was for AdamW.
-    # Warmup shortened: NS already provides early stability.
-    #
-    # MTP DISABLED (mtp_heads=0): lm_head (256->200073) costs 4x the entire
-    # 28-layer stack. Each MTP head doubles that. At loss=13 the model can't
-    # predict token+1, so token+2 is noise. Re-enable once loss < 5.
-    muon_lr = max(args.lr, 0.02)
     muon_warmup = min(args.warmup, 100)
     model, optimizer, scheduler, extras = chimera_turbo.apply(
         model,

 def train_hyper_loop(args, model, config, dataset, initial_seq, grow, unfreezer):
     use_compile = getattr(args, "compile", False)
+    # Muon LR for ternary BitLinear: standard Muon uses 0.02 for dense fp32/bf16
+    # weights, but ternary STE has a much narrower useful weight range [-1, 1].
+    # The NS unit-orthogonal update + momentum accumulation causes overshoot
+    # past step ~230, pushing weights outside the STE clamp zone (zero gradient).
+    # Optimal for ternary: 0.008 peak with aggressive cosine decay.
+    muon_lr = 0.008
     muon_warmup = min(args.warmup, 100)
     model, optimizer, scheduler, extras = chimera_turbo.apply(
         model,