Field Notes: The Dilemma of Training Reasoning with Muon

Community Article Published January 26, 2026

Reward Hacking, Structural Collapse, and the search for a stable signal in RL.

By Jen Wei

Everyone is talking about Muon right now. The promise is intoxicating: an optimizer that orthogonalizes gradients to train LLMs faster and more efficiently than AdamW. But almost all the conversation is about pre-training—slamming massive datasets into base models to teach them facts.

I wanted to know: Can Muon handle the delicate, high-variance world of Reinforcement Learning (RL)?

Specifically, I wanted to see if I could take a tiny model—Qwen 2.5-0.5B-Instruct—and teach it to reason better using GRPO (Group Relative Policy Optimization) on a "lunch money" budget (Google Colab, ~$10).

The answer? Yes, but it fights you every step of the way.

Here are my field notes from the bleeding edge of experimental optimization.


The Hypothesis: Harnessing the Chaos

Muon is effectively a "high-velocity" optimizer. By normalizing the spectral norm of the update matrix, it ignores the magnitude of the gradient and focuses purely on the direction.

  • In Pre-training: This helps the model escape local minima and learn features rapidly.
  • In RL (Hypothesis): My theory was that this "high entropy" approach might help a small model explore reasoning paths that Adam (which is more conservative) might miss.

I set up a constrained environment to test this:

  • Model: Qwen 2.5-0.5B-Instruct
  • Stack: TRL (GRPO Trainer) + Custom "Indie" Muon
  • Hardware: Single T4/A100 (Colab)

The "Four Horsemen" of Failure

If you are used to the smooth, downward-sloping loss curves of Supervised Fine-Tuning (SFT), Muon-RL will give you a heart attack. We encountered four distinct failure modes.

1. The "Lying" Loss Curve

In GRPO, the loss curve is a lie. It oscillates wildly around zero. A "good" run can look like a seismograph in an earthquake.

  • Lesson: Stop watching the loss. Watch the Reward Mean. If the reward is climbing, the model is learning, even if the loss looks like it's exploding.

2. The "Wack" Temperature Trap

Early on, I thought the model was broken because it outputted gibberish.

  • The Mistake: I was evaluating at Temperature=1.0.
  • The Reality: RL training needs high entropy to explore. But evaluation needs focus. When I switched to Greedy Decoding (Temperature=0), the "drunk" model instantly became a genius.

3. Structural Collapse (The Blind Cliff)

This was the most fascinating failure. To evaluate the model, we require the answer to be in a specific format: \boxed{answer}.

  • Adam is a "Listener." It respects small gradients that say "keep the box."
  • Muon is a "Shouter." It orthogonalizes the update, effectively normalizing small "format" gradients to be just as loud as massive "logic" gradients.

The Consequence: Muon’s violent updates often smashed the delicate "syntax weights." The model would output the correct answer (7) but forget the box. The reward function would return 0.0. The model went "blind" to its own success.

4. The "Indie Muon" Hybrid

To run this on a budget, I couldn't use the standard Muon implementation (which pairs Muon with AdamW for embeddings). AdamW states consume too much VRAM. Instead, I wrote a custom Muon-SGD Hybrid:

  • Hidden Layers: Muon (Newton-Schulz)
  • Embeddings: SGD + Momentum

The Result: A model with "split personalities." The hidden layers (Muon) learned to reason diligently. The embeddings (SGD) moved in slow motion. This mismatch likely contributed to the formatting issues—the vocabulary layer was too stiff to adapt to the new syntax constraints.


The Result: The "Diligent Student"

Despite the chaos, we found a "Stability Pocket" using a low LR (1e-5) and conservative scaling. The result was a fundamental behavioral shift.

Base Model (System 1):

Q: "Lisa runs 3 miles..." A: "The answer is 25." (Lazy guess, often wrong).

Muon-RL Model (System 2):

A: "Let's calculate the miles for the first 5 days. . Now add day 6..."

The model learned to be "Diligent." Even when it got the final logic trap wrong, it showed its work. It adopted a structured, bulleted list format naturally, without being prompted to do so.

Conclusion: A Tool for Creation, Not Adjustment?

Literature suggests Muon is best for "creating new capabilities" (pre-training) rather than "adjusting existing ones" (alignment). My experiments confirm this: Muon is a blunt instrument.

It is violent, chaotic, and deaf to subtlety. But if you can dampen the wind just enough, it clears away the local minima and pushes the model toward rigorous, structured reasoning faster than I expected.

Community

Article author

Update / Follow-up:
After publishing these field notes, I ran one more constrained experiment with a lower LR and more conservative scaling.

Interestingly, the model recovered formatting consistency (\boxed{}) and produced correct, step-by-step solutions on a small evaluation set. This suggests there are narrow stability pockets where Muon + GRPO can hold both structure and correctness — though the basin still seems fragile.

This doesn’t contradict the failure modes above, but adds nuance: the issue seems to be retention and damping, not the absence of learning. Sharing in case this is useful to others experimenting in this space.

Sign up or log in to comment