dongseokmotif Claude Sonnet 4.6

wyldecat github-actions[bot] commited on 25 days ago

Commit

e8e2c81

unverified ·

1 Parent(s): 313d56a

feat: extend QK-Clip to support MLA (MuonClip Algorithm 1) [skip-build] (#28)

Browse files

* feat: extend QK-Clip to support MLA (MuonClip Algorithm 1) [skip-build]

Add MLA support to qk_clip.py following the MuonClip spec (docs/muon-clip.md):
- parse_qk_layer: recognize 'wq_b' and 'wkv_b' MLA weight names
- QKClipInfo: add is_mla, qk_nope_head_dim, qk_rope_head_dim, v_head_dim fields
- get_qk_clip_info: branch on is_mla flag in clip_config
- compute_scales: use kv_stride (qk_nope + v_head_dim) as effective head dim for wkv_b
- qk_clip: simplify signature to (p, scales, info); vectorize MLA sub-region
scaling via tensor reshape instead of Python per-head loops:
wq_b: q_nope rows → √γ, q_pe rows → γ
wkv_b: k_nope rows → √γ, v rows unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: ruff format (pre-commit fix)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add built binary [skip-build]

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: WyldeCat <skan1543@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

build/torch210-cxx11-cu126-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-cu126-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch210-cxx11-cu126-x86_64-linux/cpu_offload.py +10 -13
build/torch210-cxx11-cu126-x86_64-linux/muon.py +8 -4
build/torch210-cxx11-cu126-x86_64-linux/newton_schulz.py +23 -19
build/torch210-cxx11-cu126-x86_64-linux/qk_clip.py +87 -24
build/torch210-cxx11-cu128-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-cu128-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch210-cxx11-cu128-x86_64-linux/cpu_offload.py +10 -13
build/torch210-cxx11-cu128-x86_64-linux/muon.py +8 -4
build/torch210-cxx11-cu128-x86_64-linux/newton_schulz.py +23 -19
build/torch210-cxx11-cu128-x86_64-linux/qk_clip.py +87 -24
build/torch210-cxx11-cu130-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-cu130-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch210-cxx11-cu130-x86_64-linux/cpu_offload.py +10 -13
build/torch210-cxx11-cu130-x86_64-linux/muon.py +8 -4
build/torch210-cxx11-cu130-x86_64-linux/newton_schulz.py +23 -19
build/torch210-cxx11-cu130-x86_64-linux/qk_clip.py +87 -24
build/torch210-cxx11-rocm70-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-rocm70-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch210-cxx11-rocm70-x86_64-linux/cpu_offload.py +10 -13
build/torch210-cxx11-rocm70-x86_64-linux/muon.py +8 -4
build/torch210-cxx11-rocm70-x86_64-linux/newton_schulz.py +23 -19
build/torch210-cxx11-rocm70-x86_64-linux/qk_clip.py +87 -24
build/torch210-cxx11-rocm71-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-rocm71-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch210-cxx11-rocm71-x86_64-linux/cpu_offload.py +10 -13
build/torch210-cxx11-rocm71-x86_64-linux/muon.py +8 -4
build/torch210-cxx11-rocm71-x86_64-linux/newton_schulz.py +23 -19
build/torch210-cxx11-rocm71-x86_64-linux/qk_clip.py +87 -24
build/torch28-cxx11-cu126-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-cu126-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch28-cxx11-cu126-x86_64-linux/cpu_offload.py +10 -13
build/torch28-cxx11-cu126-x86_64-linux/muon.py +8 -4
build/torch28-cxx11-cu126-x86_64-linux/newton_schulz.py +23 -19
build/torch28-cxx11-cu126-x86_64-linux/qk_clip.py +87 -24
build/torch28-cxx11-cu128-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-cu128-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch28-cxx11-cu128-x86_64-linux/cpu_offload.py +10 -13
build/torch28-cxx11-cu128-x86_64-linux/muon.py +8 -4
build/torch28-cxx11-cu128-x86_64-linux/newton_schulz.py +23 -19
build/torch28-cxx11-cu128-x86_64-linux/qk_clip.py +87 -24
build/torch28-cxx11-cu129-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-cu129-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1
build/torch28-cxx11-cu129-x86_64-linux/cpu_offload.py +10 -13
build/torch28-cxx11-cu129-x86_64-linux/muon.py +8 -4
build/torch28-cxx11-cu129-x86_64-linux/newton_schulz.py +23 -19
build/torch28-cxx11-cu129-x86_64-linux/qk_clip.py +87 -24
build/torch28-cxx11-rocm63-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-rocm63-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} +1 -1

build/torch210-cxx11-cu126-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch210-cxx11-cu126-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cb6163428ce86500d61c2b765eecd7eb6f31c092066278e1d1af7a0848dc5126
 size 1940944

 version https://git-lfs.github.com/spec/v1
+oid sha256:075fc73dbb2750aed7598cc3e13b593b6b1e7a78a78491e1b852fbd2a9af8f8d
 size 1940944

build/torch210-cxx11-cu126-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch210-cxx11-cu126-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch210-cxx11-cu126-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch210-cxx11-cu126-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch210-cxx11-cu128-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch210-cxx11-cu128-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:017323d479e8fbd3ed1f550f95fc4ba9f2e304dbe9351c0eaa75543ebe775e18
 size 2004144

 version https://git-lfs.github.com/spec/v1
+oid sha256:2af397ae01c8c01ee0e879f6812bd9df55d152afbcc6713f5c1987d5bce7793b
 size 2004144

build/torch210-cxx11-cu128-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch210-cxx11-cu128-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch210-cxx11-cu128-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch210-cxx11-cu128-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch210-cxx11-cu130-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch210-cxx11-cu130-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:783a161f2d28e4244226c9d6e59ac33f74f7a79aad17c06e8ce027dd6182e03c
 size 2004728

 version https://git-lfs.github.com/spec/v1
+oid sha256:45eef069a7caa85678cd1e05f0c60c5cfbc676dc93a1bcb31e55eb34730aa469
 size 2004728

build/torch210-cxx11-cu130-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch210-cxx11-cu130-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch210-cxx11-cu130-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch210-cxx11-cu130-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch210-cxx11-rocm70-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch210-cxx11-rocm70-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8ec2fcc8a9dc8a1e4aa4e925eaee33613177873e474e8d627bf844dae80f5f8b
 size 1866400

 version https://git-lfs.github.com/spec/v1
+oid sha256:055206c495ecade2fe4b5427db34f0a48152174e79808cbe1ce7d7ca86d32396
 size 1866400

build/torch210-cxx11-rocm70-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch210-cxx11-rocm70-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch210-cxx11-rocm70-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch210-cxx11-rocm70-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch210-cxx11-rocm71-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch210-cxx11-rocm71-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:18373b2e448071735ce724008122f179dd814986925c9cf0fc03f32201b2b1fa
 size 1866112

 version https://git-lfs.github.com/spec/v1
+oid sha256:315ff09ffa88ec806cb8abe49edb2ca6951e9ac34be3d3e10f159093f9576ee0
 size 1866112

build/torch210-cxx11-rocm71-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch210-cxx11-rocm71-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch210-cxx11-rocm71-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch210-cxx11-rocm71-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch28-cxx11-cu126-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch28-cxx11-cu126-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d2db9c7fb764a1fae1872779bc9ffac2aff18d14a238111d6b8b53b7d3dfa0d3
 size 1936664

 version https://git-lfs.github.com/spec/v1
+oid sha256:a9a7c1beffbad405ef7d6f46f44cf9c6671d119e04a340b54c8f4c8f9d699caf
 size 1936664

build/torch28-cxx11-cu126-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch28-cxx11-cu126-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch28-cxx11-cu126-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch28-cxx11-cu126-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch28-cxx11-cu128-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch28-cxx11-cu128-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b1ddfe7e38a9213d5dede8052c81b78eca952aef122d4da919950ff504dc3908
 size 1999872

 version https://git-lfs.github.com/spec/v1
+oid sha256:090f5a44cdfa4554147159cc36bb7e8ee9dba1ffb1fea4825aa838461fdaddf9
 size 1999872

build/torch28-cxx11-cu128-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch28-cxx11-cu128-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch28-cxx11-cu128-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch28-cxx11-cu128-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch28-cxx11-cu129-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch28-cxx11-cu129-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:db68ba26f1b022f56a5ab4e6e0204bf26df8922750f32f21be0ad76e2674b717
 size 1999872

 version https://git-lfs.github.com/spec/v1
+oid sha256:46baa92bf8f5ec5913df4081a01f662049fda475eb01bc7ed0f6154755fa88d5
 size 1999872

build/torch28-cxx11-cu129-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -93,10 +93,7 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off,
-                                   dtype=dtype,
-                                   device="cpu",
-                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -140,8 +137,7 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(local.reshape(-1),
-                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
@@ -159,8 +155,10 @@ class CPUOffloadPool:
                 )
         if not self._logged:
-            logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
-                        offloaded_bytes / (1024**2))
     # ------------------------------------------------------------------
     def reload(self):
@@ -198,12 +196,11 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(cpu_flat[off:off + n],
-                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
-            logger.info("[CPUOffload] Reloaded %.2f MB (CPU → GPU)",
-                        reloaded_bytes / (1024**2))
-            self._logged = True

                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off : off + n].copy_(local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
                 )
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
+                offloaded_bytes / (1024**2),
+            )
     # ------------------------------------------------------------------
     def reload(self):
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off : off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()
         if not self._logged:
+            logger.info(
+                "[CPUOffload] Reloaded %.2f MB (CPU → GPU)", reloaded_bytes / (1024**2)
+            )

build/torch28-cxx11-cu129-x86_64-linux/muon.py CHANGED Viewed

@@ -360,7 +360,7 @@ class Muon(torch.optim.Optimizer):
             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
-                qk_clip(p, scales_full, qk_clip_state.head_dim)
     def distributed_muon(
         self,
@@ -407,7 +407,7 @@ class Muon(torch.optim.Optimizer):
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
-                    qk_clip(p, scales_full, qk_clip_state.head_dim)
             if not dtensor_params:
                 return
@@ -1050,12 +1050,16 @@ class Muon(torch.optim.Optimizer):
     def state_dict(self) -> dict:
         if self.cpu_offload:
-            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
-            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

             scales_full = compute_scales(
                 p, qk_clip_state) if qk_clip_state is not None else None
             if scales_full is not None:
+                qk_clip(p, scales_full, qk_clip_state)
     def distributed_muon(
         self,
                 scales_full = compute_scales(
                     p, qk_clip_state) if qk_clip_state is not None else None
                 if scales_full is not None:
+                    qk_clip(p, scales_full, qk_clip_state)
             if not dtensor_params:
                 return
     def state_dict(self) -> dict:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save."
+            )
         return super().state_dict()
     def load_state_dict(self, state_dict: dict) -> None:
         if self.cpu_offload:
+            raise RuntimeError(
+                "Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load."
+            )
         super().load_state_dict(state_dict)
         # Invalidate adamw.py's module-level tensor caches so that

build/torch28-cxx11-cu129-x86_64-linux/newton_schulz.py CHANGED Viewed

@@ -32,27 +32,30 @@ def _optimal_quintic(l, u, max_iter=1000):
     E = inf
     for _ in range(max_iter):
         old_E = E
-        LHS = np.array([
-            [l, l**3, l**5, 1],
-            [q, q**3, q**5, -1],
-            [r, r**3, r**5, 1],
-            [u, u**3, u**5, -1],
-        ])
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
-            raise ValueError(f"_optimal_quintic: non-finite solve result "
-                             f"a={a}, b={b}, c={c}, E={E}")
         q, r = np.sqrt(
-            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) /
-            (10 * c))
         if not np.all(np.isfinite([q, r])):
-            raise ValueError(
-                f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
-            f"_optimal_quintic: did not converge after {max_iter} iterations")
     return float(a), float(b), float(c)
@@ -111,10 +114,9 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
-_coeffs_list = _optimal_composition(l=1e-3,
-                                    num_iters=10,
-                                    safety_factor_eps=1e-2,
-                                    cushion=0.02)
 # This code is adapted from:
@@ -148,7 +150,8 @@ def _zeropower_via_newtonschulz5(G, steps):
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
@@ -183,7 +186,8 @@ def _zeropower_via_newtonschulz5_batched(G, steps):
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
-        repeat(_coeffs_list[-1], steps - len(_coeffs_list)))
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

     E = inf
     for _ in range(max_iter):
         old_E = E
+        LHS = np.array(
+            [
+                [l, l**3, l**5, 1],
+                [q, q**3, q**5, -1],
+                [r, r**3, r**5, 1],
+                [u, u**3, u**5, -1],
+            ]
+        )
         a, b, c, E = np.linalg.solve(LHS, np.ones(4))
         if not np.all(np.isfinite([a, b, c, E])):
+            raise ValueError(
+                f"_optimal_quintic: non-finite solve result a={a}, b={b}, c={c}, E={E}"
+            )
         q, r = np.sqrt(
+            (-3 * b + np.array([-1, 1]) * sqrt(9 * b**2 - 20 * a * c)) / (10 * c)
+        )
         if not np.all(np.isfinite([q, r])):
+            raise ValueError(f"_optimal_quintic: non-finite node update q={q}, r={r}")
         if abs(old_E - E) <= 1e-15:
             break
     else:
         raise RuntimeError(
+            f"_optimal_quintic: did not converge after {max_iter} iterations"
+        )
     return float(a), float(b), float(c)
 #   - Polar Express: analytically optimal per step, adapting to the shrinking
 #     singular-value interval [l, u] as iterations progress; converges all
 #     singular values to 1, producing the exact polar factor UV^T.
+_coeffs_list = _optimal_composition(
+    l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02
+)
 # This code is adapted from:
     X = X / (X.norm() + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     buf1 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     buf2 = torch.empty(X.size(0), X.size(0), dtype=X.dtype, device=X.device)
     # Perform the NS iterations
     X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
     hs = _coeffs_list[:steps] + list(
+        repeat(_coeffs_list[-1], steps - len(_coeffs_list))
+    )
     for a, b, c in hs:
         buf1 = torch.bmm(X, X.transpose(-2, -1))
         buf2 = torch.bmm(buf1, buf1.transpose(-2, -1))

build/torch28-cxx11-cu129-x86_64-linux/qk_clip.py CHANGED Viewed

@@ -13,7 +13,11 @@ logger = logging.getLogger(__name__)
 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
-    ('wq', 'wk', 'q_proj', 'k_proj') and return (kind, layer_index).
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
@@ -23,6 +27,8 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
@@ -37,7 +43,7 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
             layer_idx = int(part)
             break
-    if kind in ('wq', 'wk', 'q_proj', 'k_proj'):
         return kind, layer_idx
     return None, -1
@@ -46,18 +52,26 @@ def parse_qk_layer(name: str) -> tuple[str | None, int]:
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
-    kind: str | None  # 'wq'/'q_proj' or 'wk'/'k_proj' or None
     indices: list[int]  # which heads to consider for clipping
-    head_dim: int  # from config
     threshold: float  # from config
     logit: torch.Tensor | None
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
@@ -70,31 +84,48 @@ def get_qk_clip_info(clip_config, n, qk_logits):
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
-        indices_key = 'q_indices' if 'q' in kind else 'k_indices'
-        indices = clip_config.get(indices_key, []) or []
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
-    return QKClipInfo(
-        kind=kind,
-        indices=indices,
-        head_dim=head_dim,
-        threshold=threshold,
-        logit=logit,
-    )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
-    Returns scales tensor if any head exceeds threshold, else None.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
@@ -118,18 +149,50 @@ def compute_scales(p, qk_clip_state):
     if not head_scales:
         return None
-    H_global = p.shape[0] // head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
-def qk_clip(p, scales, head_dim):
-    """Apply per-head scaling to a Q/K projection weight matrix."""
-    if isinstance(p, torch.nn.Parameter):
-        W = p.data.view(-1, head_dim, p.data.shape[1])
-        W.mul_(scales.view(-1, 1, 1))
-    else:
-        W = p.view(-1, head_dim, p.shape[1])
-        W.mul_(scales.view(-1, 1, 1))

 def parse_qk_layer(name: str) -> tuple[str | None, int]:
     """
     Parse a parameter name to check if it is a query/key projection layer
+    and return (kind, layer_index).
+    Supported kinds:
+        MHA/GQA: 'wq', 'wk', 'q_proj', 'k_proj'
+        MLA:     'wq_b' (Q up-proj), 'wkv_b' (KV up-proj)
     Returns:
         (kind, layer_idx) or (None, -1) if not matched.
         'model.5.attn.wk.weight'      -> ('wk', 5)
         'model.2.attn.q_proj.weight'  -> ('q_proj', 2)
         'model.7.attn.k_proj.weight'  -> ('k_proj', 7)
+        'model.1.attn.wq_b.weight'    -> ('wq_b', 1)
+        'model.0.attn.wkv_b.weight'   -> ('wkv_b', 0)
         'model.4.attn.v_proj.weight'  -> (None, -1)
     """
     parts = normalize_fqn(name).split('.')
             layer_idx = int(part)
             break
+    if kind in ('wq', 'wk', 'q_proj', 'k_proj', 'wq_b', 'wkv_b'):
         return kind, layer_idx
     return None, -1
 @dataclass
 class QKClipInfo:
     """Per-parameter dynamic info computed from config + runtime logits."""
+    kind: str | None  # 'wq'/'q_proj'/'wq_b' or 'wk'/'k_proj'/'wkv_b' or None
     indices: list[int]  # which heads to consider for clipping
+    head_dim: int  # from config (qk_head_dim for MLA wq_b)
     threshold: float  # from config
     logit: torch.Tensor | None
+    # MLA-specific fields
+    is_mla: bool = False
+    qk_nope_head_dim: int = 0
+    qk_rope_head_dim: int = 0
+    v_head_dim: int = 0
 def get_qk_clip_info(clip_config, n, qk_logits):
     """Extract QK clipping info for a named parameter.
     Args:
         clip_config: QK clipping configuration dict (or None).
+            MHA/GQA keys: head_dim, threshold, q_indices, k_indices
+            MLA extra keys: is_mla=True, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
         n: Parameter name string.
         qk_logits: Dict mapping layer indices to logit tensors (or None).
     head_dim = clip_config.get('head_dim')
     threshold = clip_config.get('threshold')
     kind, layer_idx = parse_qk_layer(n)
+    is_mla = clip_config.get('is_mla', False)
     logit, indices = None, []
     if qk_logits is not None and kind is not None:
         logit = qk_logits[layer_idx]
         if isinstance(logit, DTensor):
             # In TP settings, qk_logits may be DTensor
             # We convert it to full tensor here for simplicity
             logit = logit.full_tensor()
+        if kind in ('wq_b', 'wq', 'q_proj'):
+            indices = clip_config.get('q_indices', []) or []
+        elif kind in ('wkv_b', 'wk', 'k_proj'):
+            indices = clip_config.get('k_indices', []) or []
+    if is_mla:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+            is_mla=True,
+            qk_nope_head_dim=clip_config['qk_nope_head_dim'],
+            qk_rope_head_dim=clip_config['qk_rope_head_dim'],
+            v_head_dim=clip_config['v_head_dim'],
+        )
+    else:
+        return QKClipInfo(
+            kind=kind,
+            indices=indices,
+            head_dim=head_dim,
+            threshold=threshold,
+            logit=logit,
+        )
 def compute_scales(p, qk_clip_state):
     """Compute per-head scaling factors for QK clipping.
+    Returns scales tensor (√γ per head) if any head exceeds threshold, else None.
+    For MLA wkv_b, effective row stride is qk_nope_head_dim + v_head_dim.
     """
     kind = qk_clip_state.kind
     indices = qk_clip_state.indices
     if not head_scales:
         return None
+    # For MLA wkv_b, each KV head spans qk_nope_head_dim + v_head_dim rows
+    if qk_clip_state.is_mla and kind == 'wkv_b':
+        effective_head_dim = qk_clip_state.qk_nope_head_dim + qk_clip_state.v_head_dim
+    else:
+        effective_head_dim = head_dim
+    H_global = p.shape[0] // effective_head_dim
     scales_full = torch.ones(H_global, device=p.data.device)
     for head_idx, scale in head_scales.items():
         scales_full[head_idx] = scale
     return scales_full
+def qk_clip(p, scales, info):
+    """Apply per-head scaling to a Q/K projection weight matrix.
+    Args:
+        p: Parameter (nn.Parameter or raw tensor).
+        scales: [n_heads] tensor, each element = √γ_h.
+        info: QKClipInfo with kind, head_dim, and MLA sub-head dimensions.
+    MLA sub-region scaling per Algorithm 1 (MuonClip):
+        wq_b: q_nope rows → √γ,  q_pe rows → γ
+        wkv_b: k_nope rows → √γ, v rows → unchanged
+    """
+    W = p.data if isinstance(p, torch.nn.Parameter) else p
+    if not info.is_mla:
+        # MHA/GQA: uniform √γ applied to all rows in each head
+        W.view(-1, info.head_dim, W.shape[1]).mul_(scales.view(-1, 1, 1))
+        return
+    # MLA: vectorized sub-region scaling within each head
+    if info.kind == 'wq_b':
+        qk_nope = info.qk_nope_head_dim
+        qk_head_dim = qk_nope + info.qk_rope_head_dim
+        W_3d = W.view(-1, qk_head_dim, W.shape[1])  # [H, qk_head_dim, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # q_nope → √γ
+        W_3d[:, qk_nope:, :].mul_((scales * scales).view(-1, 1,
+                                                         1))  # q_pe   → γ
+    elif info.kind == 'wkv_b':
+        qk_nope = info.qk_nope_head_dim
+        kv_stride = qk_nope + info.v_head_dim
+        W_3d = W.view(-1, kv_stride, W.shape[1])  # [H, kv_stride, in_dim]
+        W_3d[:, :qk_nope, :].mul_(scales.view(-1, 1, 1))  # k_nope → √γ
+        # v rows: not touched (k_R shared rotary unchanged)

build/torch28-cxx11-rocm63-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_38f9b8e_dirty
-ops = torch.ops._optimizer_38f9b8e_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_38f9b8e_dirty::{op_name}"

 import torch
+from . import _optimizer_8d53b78_dirty
+ops = torch.ops._optimizer_8d53b78_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_8d53b78_dirty::{op_name}"

build/torch28-cxx11-rocm63-x86_64-linux/{_optimizer_38f9b8e_dirty.abi3.so → _optimizer_8d53b78_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5a6a8788f055b22d594330fc06487ae2c6eeb2b64e0ab0132b68036a78560cf6
 size 1865080

 version https://git-lfs.github.com/spec/v1
+oid sha256:bcf5b8838dfaf6e81fdbd52ff4638ca76abaa678f7c2cbd81cf03dc72f9cd5d2
 size 1865080