Hopper: The Optimizer That Learns Parallelism 2x Faster Than Adam

Community Article Published February 6, 2026

Intro: Speeding Up Intelligence

A 0.5B model trained with a modified optimizer learned parallel reasoning at step 200 that Adam failed to learn even at step 400. This shouldn't be possible - smaller checkpoint, same model, just a different optimizer. But it happened.

I introduce Hopper: a modified Muon optimizer designed for RL stability. The headline result? Hopper at Step 200 ties with Adam at Step 400.

But the way they tie tells a deeper story about machine intelligence:

  • Adam (Step 400): Solved the "Sister's Age" arithmetic problem () but failed the tricky "Towel Drying" reasoning problem (thinking 6 towels take 6x longer to dry).
  • Hopper (Step 200): Failed the arithmetic (almost had it 😭) but solved the Towel Question correctly.

Hopper didn't just learn faster; it learned differently. By using orthogonal updates, it broke the model's reliance on linear heuristics ("more towels = more time") 2x faster than Adam, effectively accelerating the emergence of complex reasoning.

The Test That Changed Everything

I evaluated on GSM8K math problems. Two questions told the story:

Q1: "When I was 6, my sister was half my age. Now I'm 70. How old is my sister?"

  • Adam (step 400): βœ… Correctly calculated 67
  • Hopper (step 200): ❌ Made arithmetic error (got 3)

Q2: "3 towels dry in 1 hour. How long for 6 towels?"

  • Adam (step 400): ❌ Answered "2 hours" (linear scaling)
  • Hopper (step 200): βœ… Answered "1 hour" (understood parallelism!)

Same model. Same training data. Different optimizer. Adam learned arithmetic faster. Hopper learned reasoning faster.

Adam_cp400

Figure 1-1: Adam Eval at cp-400

Hopper_cp200

Figure 1-2: Hopper Eval at cp-200

Final_score_card

Figure 1-3: Final score card

Here is the story of the ablations, the failures, and the physics behind Hopper.


The Problem: Muon’s "Hollow Shell" in RL

Muon is incredible for pre-training, but in RL fine-tuning (specifically DAPO/GRPO on small models), I observed a consistent failure mode I call "Entropy Floor": the entropy refues to collapse at the end of the training.

  • The Symptom: Standard Muon maintains a high entropy floor. The model never "settles" on a format or answer.
  • The Result: "Mutant Chicken" hallucinations. The model invents logical absurdities (e.g., "chickens with 2 heads") just to satisfy the rigid geometric constraints of the optimizer without actually converging on the truth.
  • The Turbulence: While I never hit hard NaNs, the training was plagued by KL spikes and gradient explosions that produced garbage outputs.

I needed to understand why.


The Ablation Study

I ran a rigorous comparison of three baselines to isolate the variables: SGD, Adam, and Muon.

1. The "Metrics Lie" Discovery (SGD)

  • Observation: SGD's training curves (KL/Loss) looked surprisingly similar to Adam's.
  • Reality: On the hard evaluation set, SGD finished last.
  • Lesson: In RL, a smooth loss curve doesn't mean the model is smart; it often just means the model is safely outputting mediocrity.

SGD_has_better_train_rewards

Figure 2-1: SGD(red) had better rewards during training than Muon(grey), but worse eval results than Muon

SGD_and_adam_has simiar_entropy_profile

Figure 2-2: SGD(red )and Adam(brown) has simiar entropy profiles than Muon(grey)

2. The Hypothesis: Variance + Orthogonality This led to a component analysis:

  • Variance Term: The "Safety Belt." Adam has it; SGD doesn't. This explains Adam's stability.
  • Orthogonal Updates: The "Structure." Muon has it; Adam doesn't. This explains Muon's performed better in eval than SGD.

We needed both.


The Fix: Building Hopper

I developed Hopper through systematic debugging of Muon's RL failures.

Stage 1: Variance Normalization (Stabilization)

First, I added Adam's second moment to Muon's orthogonal updates:

  • βœ… Eliminated gradient explosions
  • βœ… Removed KL spikes
  • ❌ But entropy floor remained
  • ❌ Model hallucinated (e.g., inventing "pigs" 🐷 in chicken-cow problems)

This configuration (variance + ns_steps=5) mirrors work others were independently exploring. I call it "AdaMuon" to acknowledge the convergent discovery.

Diagnosis: Full orthogonalization (5 Newton-Schulz iterations) creates rigid geometric constraints. The model would rather hallucinate impossible solutions than violate the optimizer's demands.

Stage 2: Minimal Orthogonalization (The Breakthrough)

The key question: Does the degree of orthogonalization matter?

I reduced Newton-Schulz iterations from 5 to 1:

  • Concept: "Lazy orthogonality" - a soft stretch, not rigid perfection
  • Result: Model became "honest"
    • βœ… No more hallucinations
    • βœ… Stable KL divergence
    • βœ… Fast learning (2x faster than Adam)
    • βœ… Creative reasoning (discovered parallelism!)

Hopper = Variance normalization + ns_steps=1

The ns_steps=1 insight appears unique to this work and is what distinguishes Hopper from AdaMuon variants.

Hopper_is_more_stable

Figure 3: **"Lazy Orthogonality."** made the training more stable

What ns_steps Actually Does

Think of Newton-Schulz iterations like stretching a rubber band:

  • ns_steps=5 (Original Muon): Stretch until perfectly taut

    • Result: Rigid structure, can't adapt
    • In RL: Model can't "give up" on complex solutions even when simple ones work
  • ns_steps=1 (Hopper): One gentle stretch

    • Result: Structured but flexible
    • In RL: Model explores alternatives but can still converge
  • ns_steps=0 (Adam): No stretching

    • Result: No structure, pure gradient following
    • In RL: Converges fast but might miss creative solutions

Screenshot 2026-02-05 at 6.03.57β€―PM

Figure 4: Newton-Schulz, 1 vs 5 steps


The Catch: Early Stopping Required

Here's the twist: Hopper's best checkpoint is NOT the final one.

  • Step 200: Peak performance (4/5 hard problems) βœ…
  • Step 400: Lost "smartness" (3/5, towel dry for 2 hours) πŸ₯€

Muon_lost_smartness_at_cp400)

Figure 5: Hopper lost its edge at cp-400

Why? The constant orthogonal rotation prevents full convergence. The model stays in "exploration mode" even when it should commit.

Solution: Two-phase training:

  1. Steps 0-200: Hopper (fast exploration)
  2. Steps 200-400: Switch to Adam (convergence)

This hybrid approach might combine Hopper's creative reasoning with Adam's disciplined refinement. (Testing in progress!)


The Verdict: The "Best of Both Worlds" Strategy

Hopper is not a magic wand; it is a specialized tool.

  • The Win: It accelerates the "Aha!" moments in reasoning (breaking linear heuristics) significantly faster than Adam.
  • The Trade-off: Because it constantly rotates the weights, it resists "total convergence." If you run it too long, the model may drift and become clumsy on simple tasks (like arithmetic).

My Recommendation: Use Hopper for the first 50% of training to discover structural features and reasoning paths quickly. Then, Early Stop or switch to Adam (optimizer annealing) to "collapse" the entropy and lock in the discipline for the final mile.

Hopper is live. The code is simple: ns_steps=1 + Variance. Happy experimenting. πŸ‡


What This Means

We've been using Adam for everything since 2014. It's reliable, fast, and well-understood. But maybe it's also limiting us.

If a tiny 0.5B model can learn parallel reasoning with one optimizer that Adam misses, what else are we missing? What capabilities require exploration strategies we haven't tried?

Hopper isn't perfect - it needs early stopping, it's finicky, it hallucinates pigs if you run it too long. But it proves that alternative optimization strategies can discover solutions the default approach misses.

The code is simple: ns_steps=1 + variance normalization. The implications might not be, code and checkpoints can be find πŸ‘‰πŸ» here.

Try it. Break it. Build something better. 🐰

Community

Sign up or log in to comment