wyldecat Claude Opus 4.6 (1M context) github-actions[bot] commited on about 1 month ago

Commit

05a75f1

unverified ·

1 Parent(s): b61425a

Replace cpu_offload constructor param with turn_on/turn_off API (#26)

* Replace cpu_offload constructor param with turn_on/turn_off API [skip-build]

Remove the cpu_offload boolean from Muon.__init__ and add explicit
turn_on_cpu_offload() / turn_off_cpu_offload() methods instead.
state_dict and load_state_dict now require offload to be disabled
first (RuntimeError). Preserves AdamW tensor cache invalidation
on load_state_dict. Adds test_toggle_correctness test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: skip empty FSDP shards in CPU offload and add storage validation

Empty FSDP shards (storage size=0) were being tracked by CPUOffloadPool,
causing double-free errors on resize_(0) during offload. This led to
hangs on ranks 1-7 while rank 0 succeeded.

- Skip tensors with empty storage in track() to avoid double-free
- Add storage size validation in offload() and reload() with RuntimeError
- Add logging for turn_on/turn_off_cpu_offload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add built binary [skip-build]

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

build/torch210-cxx11-cu126-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-cu126-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch210-cxx11-cu126-x86_64-linux/cpu_offload.py +31 -10
build/torch210-cxx11-cu126-x86_64-linux/distributed/utils.py +16 -12
build/torch210-cxx11-cu126-x86_64-linux/muon.py +44 -48
build/torch210-cxx11-cu128-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-cu128-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch210-cxx11-cu128-x86_64-linux/cpu_offload.py +31 -10
build/torch210-cxx11-cu128-x86_64-linux/distributed/utils.py +16 -12
build/torch210-cxx11-cu128-x86_64-linux/muon.py +44 -48
build/torch210-cxx11-cu130-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-cu130-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch210-cxx11-cu130-x86_64-linux/cpu_offload.py +31 -10
build/torch210-cxx11-cu130-x86_64-linux/distributed/utils.py +16 -12
build/torch210-cxx11-cu130-x86_64-linux/muon.py +44 -48
build/torch210-cxx11-rocm70-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-rocm70-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch210-cxx11-rocm70-x86_64-linux/cpu_offload.py +31 -10
build/torch210-cxx11-rocm70-x86_64-linux/distributed/utils.py +16 -12
build/torch210-cxx11-rocm70-x86_64-linux/muon.py +44 -48
build/torch210-cxx11-rocm71-x86_64-linux/_ops.py +3 -3
build/torch210-cxx11-rocm71-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch210-cxx11-rocm71-x86_64-linux/cpu_offload.py +31 -10
build/torch210-cxx11-rocm71-x86_64-linux/distributed/utils.py +16 -12
build/torch210-cxx11-rocm71-x86_64-linux/muon.py +44 -48
build/torch28-cxx11-cu126-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-cu126-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch28-cxx11-cu126-x86_64-linux/cpu_offload.py +31 -10
build/torch28-cxx11-cu126-x86_64-linux/distributed/utils.py +16 -12
build/torch28-cxx11-cu126-x86_64-linux/muon.py +44 -48
build/torch28-cxx11-cu128-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-cu128-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch28-cxx11-cu128-x86_64-linux/cpu_offload.py +31 -10
build/torch28-cxx11-cu128-x86_64-linux/distributed/utils.py +16 -12
build/torch28-cxx11-cu128-x86_64-linux/muon.py +44 -48
build/torch28-cxx11-cu129-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-cu129-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch28-cxx11-cu129-x86_64-linux/cpu_offload.py +31 -10
build/torch28-cxx11-cu129-x86_64-linux/distributed/utils.py +16 -12
build/torch28-cxx11-cu129-x86_64-linux/muon.py +44 -48
build/torch28-cxx11-rocm63-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-rocm63-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch28-cxx11-rocm63-x86_64-linux/cpu_offload.py +31 -10
build/torch28-cxx11-rocm63-x86_64-linux/distributed/utils.py +16 -12
build/torch28-cxx11-rocm63-x86_64-linux/muon.py +44 -48
build/torch28-cxx11-rocm64-x86_64-linux/_ops.py +3 -3
build/torch28-cxx11-rocm64-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} +1 -1
build/torch28-cxx11-rocm64-x86_64-linux/cpu_offload.py +31 -10
build/torch28-cxx11-rocm64-x86_64-linux/distributed/utils.py +16 -12
build/torch28-cxx11-rocm64-x86_64-linux/muon.py +44 -48

build/torch210-cxx11-cu126-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch210-cxx11-cu126-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:90ace47a61519aefe759810c803789e7f91e6949ca0b04fc177e311709976334
 size 1940944

 version https://git-lfs.github.com/spec/v1
+oid sha256:7be82307f66be4bb841072ecdb3d105dc73bc9ee9ca21b1ce33bddc24113f4d1
 size 1940944

build/torch210-cxx11-cu126-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch210-cxx11-cu126-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch210-cxx11-cu126-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch210-cxx11-cu128-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch210-cxx11-cu128-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1abfa69cd254e0000246a074c0bfa53c2e72bb53cc5fa8216275295cd021c57a
 size 2004144

 version https://git-lfs.github.com/spec/v1
+oid sha256:7fa10afe7c505f69ccf7a98aca116b6d551b9577ecce2dab2559c6c3b433be20
 size 2004144

build/torch210-cxx11-cu128-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch210-cxx11-cu128-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch210-cxx11-cu128-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch210-cxx11-cu130-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch210-cxx11-cu130-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6869cfabdf45c7092d251846b3099287f8bccd5c5ebe7edf1a5fd21436324349
 size 2004728

 version https://git-lfs.github.com/spec/v1
+oid sha256:f72217eb59ce93935f593b0fdfc7f3bfc4e05f18ad9d5384c2325b27ad7ff136
 size 2004728

build/torch210-cxx11-cu130-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch210-cxx11-cu130-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch210-cxx11-cu130-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch210-cxx11-rocm70-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch210-cxx11-rocm70-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0102e10121a43f6d5d59a23f2c0e21d88469cc4597d84f7d48b64b0fabfeacdb
 size 1866400

 version https://git-lfs.github.com/spec/v1
+oid sha256:df63b7dafe62fa5e910de1123729cbe3496015e2c0110785d9bd510bf65c2eaa
 size 1866400

build/torch210-cxx11-rocm70-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch210-cxx11-rocm70-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch210-cxx11-rocm70-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch210-cxx11-rocm71-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch210-cxx11-rocm71-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f41709878a4def27b12f4f9a4f5b767027fb33141e775f64ad04d434fcbe33d9
 size 1866112

 version https://git-lfs.github.com/spec/v1
+oid sha256:7215ee4575fa44f482a98e563c5af2d60089e36d32fe8a3dcffe3fb5f587300f
 size 1866112

build/torch210-cxx11-rocm71-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch210-cxx11-rocm71-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch210-cxx11-rocm71-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch28-cxx11-cu126-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch28-cxx11-cu126-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:08e3ee2f567d7a89ba34a82429c2f47cdb17d53ef66dc7d5751cabeafa01ce0f
 size 1936664

 version https://git-lfs.github.com/spec/v1
+oid sha256:37bcda6973440cdeb880e411dbaf12220cef0bab18299b4922b6a504ab109b42
 size 1936664

build/torch28-cxx11-cu126-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch28-cxx11-cu126-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch28-cxx11-cu126-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch28-cxx11-cu128-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch28-cxx11-cu128-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e08baad750646c67f23c6e7c4d0e1b7266eeffed3bbb730729ba8f37e120a2b1
 size 1999872

 version https://git-lfs.github.com/spec/v1
+oid sha256:e9cd35014d41034ed35fbad31c19c80e7b3977cea889a865eb12db705678bb29
 size 1999872

build/torch28-cxx11-cu128-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch28-cxx11-cu128-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch28-cxx11-cu128-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch28-cxx11-cu129-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch28-cxx11-cu129-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c485caa290f4b43e49db4ceafe25f0d0840dcdd61d02a5aecfa78d8f9acc9b60
 size 1999872

 version https://git-lfs.github.com/spec/v1
+oid sha256:64072037c62afbfefe92a35b026794f2bd406fddef38bf58d318d3bae7652a29
 size 1999872

build/torch28-cxx11-cu129-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch28-cxx11-cu129-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch28-cxx11-cu129-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch28-cxx11-rocm63-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch28-cxx11-rocm63-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9a57395ef49976af61778f127cfdeace6a4c35b491b9903e48b1cd7199ee217c
 size 1865080

 version https://git-lfs.github.com/spec/v1
+oid sha256:65135fe756ed97f2bd21fefda883b6a7b90179ebd7c0a882673239daf9d9aa6a
 size 1865080

build/torch28-cxx11-rocm63-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch28-cxx11-rocm63-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch28-cxx11-rocm63-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()

build/torch28-cxx11-rocm64-x86_64-linux/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _optimizer_5b58933_dirty
-ops = torch.ops._optimizer_5b58933_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_optimizer_5b58933_dirty::{op_name}"

 import torch
+from . import _optimizer_b68ea5b_dirty
+ops = torch.ops._optimizer_b68ea5b_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_optimizer_b68ea5b_dirty::{op_name}"

build/torch28-cxx11-rocm64-x86_64-linux/{_optimizer_5b58933_dirty.abi3.so → _optimizer_b68ea5b_dirty.abi3.so} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fd57f2197a2107ad920abbce3e2c986b79c76cb864f693f53bd389b26b763902
 size 1865168

 version https://git-lfs.github.com/spec/v1
+oid sha256:c562a3b7e6f3032ff56473531e9e08fceb2c86f8804080330c896ab8f0dd32af
 size 1865168

build/torch28-cxx11-rocm64-x86_64-linux/cpu_offload.py CHANGED Viewed

@@ -68,7 +68,11 @@ class CPUOffloadPool:
         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
-        self._storage_nbytes[tid] = local.untyped_storage().size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
@@ -89,7 +93,10 @@ class CPUOffloadPool:
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
-            cpu_flat = torch.empty(off, dtype=dtype, device="cpu", pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
@@ -117,8 +124,7 @@ class CPUOffloadPool:
         self._ensure_stream()
         # Offload stream waits for compute to finish.
-        compute_event = torch.cuda.current_stream(
-            self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
@@ -134,15 +140,23 @@ class CPUOffloadPool:
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
-                    cpu_flat[off:off + n].copy_(
-                        local.reshape(-1), non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
-            self._local(t).untyped_storage().resize_(0)
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
@@ -165,7 +179,14 @@ class CPUOffloadPool:
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
-            local.untyped_storage().resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
@@ -177,8 +198,8 @@ class CPUOffloadPool:
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
-                local.reshape(-1).copy_(
-                    cpu_flat[off:off + n], non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

         local = self._local(tensor)
         if self._device is None:
             self._device = local.device
+        storage = local.untyped_storage()
+        # Skip tensors with empty storage (e.g. empty FSDP shards)
+        if storage.size() == 0:
+            return
+        self._storage_nbytes[tid] = storage.size()
         self._managed.append(tensor)
     # ------------------------------------------------------------------
                 indices.append(idx)
                 offsets.append((off, n))
                 off += n
+            cpu_flat = torch.empty(off,
+                                   dtype=dtype,
+                                   device="cpu",
+                                   pin_memory=True)
             self._groups[dtype] = {
                 "indices": indices,
                 "offsets": offsets,
         self._ensure_stream()
         # Offload stream waits for compute to finish.
+        compute_event = torch.cuda.current_stream(self._device).record_event()
         self._offload_stream.wait_event(compute_event)
         offloaded_bytes = 0
                 for i, mgd_idx in enumerate(indices):
                     local = self._local(self._managed[mgd_idx])
                     off, n = offsets[i]
+                    cpu_flat[off:off + n].copy_(local.reshape(-1),
+                                                non_blocking=True)
                 offloaded_bytes += grp["total"] * cpu_flat.element_size()
         # Wait for all D2H copies to land, then free GPU storage.
         self._offload_stream.synchronize()
         for t in self._managed:
+            storage = self._local(t).untyped_storage()
+            if storage.size() != 0:
+                storage.resize_(0)
+            else:
+                raise RuntimeError(
+                    f"Tensor storage is already freed (size=0) before offload. "
+                    f"This indicates a double-free or external interference. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
         if not self._logged:
             logger.info("[CPUOffload] Offloaded %.2f MB (GPU → CPU)",
         # Re-allocate all GPU storages first.
         for t in self._managed:
             local = self._local(t)
+            storage = local.untyped_storage()
+            if storage.size() != 0:
+                raise RuntimeError(
+                    f"Storage should have been freed (size=0) before reload, "
+                    f"but got size={storage.size()}. "
+                    f"Tensor shape: {t.shape}, dtype: {t.dtype}"
+                )
+            storage.resize_(self._storage_nbytes[id(t)])
         # Per-tensor H2D copies from CPU flat buffer slices.
         # non_blocking=True with pinned source allows DMA overlap.
             for i, mgd_idx in enumerate(indices):
                 local = self._local(self._managed[mgd_idx])
                 off, n = offsets[i]
+                local.reshape(-1).copy_(cpu_flat[off:off + n],
+                                        non_blocking=True)
             reloaded_bytes += grp["total"] * cpu_flat.element_size()

build/torch28-cxx11-rocm64-x86_64-linux/distributed/utils.py CHANGED Viewed

@@ -163,9 +163,10 @@ def construct_shard_mesh(
     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
-    # This avoids a non-collective dist.new_group() call, which would
-    # deadlock when only a subset of ranks call this function (e.g. expert
-    # DTensors on a TP submesh where ranks 0-3 and 4-7 call separately).
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
@@ -207,22 +208,25 @@ def construct_shard_mesh(
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
-    # All ranks must call dist.new_group in the same order, even though each
-    # rank only joins one group.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     assert mesh.mesh.device.type == 'cpu'
     # -- Fast path: 1D all-shard mesh → reuse existing PG. ----------------
+    # Reuses the mesh's existing ProcessGroup directly, avoiding the
+    # overhead of dist.new_group(). The standard path below also handles
+    # subset calls safely via use_local_synchronization=True, but this
+    # fast path is still beneficial for the common 1D shard case.
     if mesh.ndim == 1 and len(placements) == 1 and _is_shard(placements[0]):
         key = (*mesh.mesh.shape, *mesh.mesh.flatten().tolist())
         if key not in _ranks_to_dist_cache:
     assert len(shard_placements) == len(set(shard_placements))
     # -- Step 4: Create/retrieve ProcessGroup for current rank's sub-mesh. --
+    # Each rank only creates the group it belongs to, using
+    # use_local_synchronization=True so that only group members need to
+    # coordinate. This avoids deadlocks when different PP stages call
+    # construct_shard_mesh for different parameters.
     def _cache_key(t: torch.Tensor) -> tuple:
         return (*t.shape, *t.flatten().tolist())
     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

build/torch28-cxx11-rocm64-x86_64-linux/muon.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.distributed as dist
 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
-from .adamw import step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
@@ -207,8 +207,7 @@ class Muon(torch.optim.Optimizer):
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
-                 expert_keys=None,
-                 cpu_offload=False):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
@@ -229,7 +228,6 @@ class Muon(torch.optim.Optimizer):
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
@@ -243,8 +241,8 @@ class Muon(torch.optim.Optimizer):
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
-        self.cpu_offload = cpu_offload
-        self._cpu_offload_pool = CPUOffloadPool() if cpu_offload else None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
@@ -1008,6 +1006,8 @@ class Muon(torch.optim.Optimizer):
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
@@ -1015,54 +1015,50 @@ class Muon(torch.optim.Optimizer):
         return loss
     # ------------------------------------------------------------------
-    # Checkpoint support for cpu_offload
     # ------------------------------------------------------------------
-    def state_dict(self) -> dict:
-        """Return optimizer state dict, reloading offloaded states first.
-        When ``cpu_offload=True``, optimizer state tensors have their GPU
-        storage freed (``resize_(0)``) between steps.  We reload them,
-        snapshot the state dict, then re-offload so the optimizer stays
-        in the expected post-step state.  The returned dict holds cloned
-        tensors so they remain valid after the re-offload frees the
-        originals' GPU storage.
-        """
-        if self.cpu_offload and self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
-        sd = super().state_dict()
-        if self.cpu_offload and self._offload_initialized:
-            # Clone state tensors so the returned dict survives re-offload
-            # (which frees GPU storage on the originals via resize_(0)).
-            for k in sd["state"]:
-                sd["state"][k] = {
-                    sk: sv.clone() if isinstance(sv, torch.Tensor) else sv
-                    for sk, sv in sd["state"][k].items()
-                }
-            self._cpu_offload_pool.offload()
-        return sd
-    def load_state_dict(self, state_dict: dict) -> None:
-        """Load optimizer state dict, then offload states if needed.
-        After ``super().load_state_dict()`` populates GPU tensors, we
-        re-register them with the offload pool and offload to CPU so the
-        optimizer is in the same post-step state (GPU storage freed).
-        """
-        # If states were offloaded, reload first so storage sizes are
-        # correct for super().load_state_dict() to overwrite.
-        if self.cpu_offload and self._offload_initialized:
-            self._cpu_offload_pool.reload()
-            torch.cuda.current_stream().synchronize()
         super().load_state_dict(state_dict)
-        if self.cpu_offload:
-            # Re-create the offload pool since state tensors may be new
-            # objects after load_state_dict.
-            self._cpu_offload_pool = CPUOffloadPool()
-            self._offload_initialized = False
-            self._register_states_for_offload()
-            self._offload_initialized = True
-            self._cpu_offload_pool.offload()

 from torch.distributed.tensor import DTensor, Replicate, Shard
 from torch.profiler import record_function
+from .adamw import _placement_cache, _tensor_cache, step_adamw
 from .async_utils import run_pipeline
 from .core import (_muon_state, adjust_lr_for_muon, batch_pre_ortho,
                    get_default_muon_param_groups, is_expert_param, update_p)
                  warmup_step=5,
                  chunk_size=-1,
                  use_distributed_muon=False,
+                 expert_keys=None):
         defaults = dict(
             lr=lr,
             weight_decay=weight_decay,
             if param_group.get("use_muon", None) is None:
                 raise ValueError(
                     error_message.format(idx=_idx) + instruction_code)
         super().__init__(params, defaults)
         self.debug = debug
         self.chunk_size = chunk_size
         self.use_distributed_muon = use_distributed_muon
         self.expert_keys = expert_keys
+        self.cpu_offload = False
+        self._cpu_offload_pool: CPUOffloadPool | None = None
         self._offload_initialized = False
         self._parallel_cache: dict[tuple[str, ...], dict] = {}
         self._expert_expand_cache: dict[tuple[int, ...], dict] = {}
         # D2H: offload optimizer states to CPU after computation.
         if self.cpu_offload:
             if not self._offload_initialized:
+                if self._cpu_offload_pool is None:
+                    self._cpu_offload_pool = CPUOffloadPool()
                 self._register_states_for_offload()
                 self._offload_initialized = True
             self._cpu_offload_pool.offload()
         return loss
     # ------------------------------------------------------------------
+    # CPU offload public helpers
     # ------------------------------------------------------------------
+    def turn_on_cpu_offload(self):
+        """Enable CPU offload for optimizer states."""
+        if self.cpu_offload:
+            return
+        logger.info("[Muon] turn_on_cpu_offload")
+        self.cpu_offload = True
+        if not self.state:
+            return
+        self._cpu_offload_pool = CPUOffloadPool()
+        self._offload_initialized = False
+        self._register_states_for_offload()
+        self._offload_initialized = True
+        self._cpu_offload_pool.offload()
+    def turn_off_cpu_offload(self):
+        """Disable CPU offload and keep optimizer states resident on GPU."""
+        if not self.cpu_offload:
+            return
+        logger.info("[Muon] turn_off_cpu_offload")
+        if self._offload_initialized:
             self._cpu_offload_pool.reload()
             torch.cuda.current_stream().synchronize()
+        self._cpu_offload_pool = None
+        self._offload_initialized = False
+        self.cpu_offload = False
+    # ------------------------------------------------------------------
+    # Checkpoint support for cpu_offload
+    # ------------------------------------------------------------------
+    def state_dict(self) -> dict:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.state_dict() requires turn_off_cpu_offload() before checkpoint save.")
+        return super().state_dict()
+    def load_state_dict(self, state_dict: dict) -> None:
+        if self.cpu_offload:
+            raise RuntimeError("Muon.load_state_dict() requires turn_off_cpu_offload() before checkpoint load.")
         super().load_state_dict(state_dict)
+        # Invalidate adamw.py's module-level tensor caches so that
+        # the next step rebuilds them with the newly loaded state tensors.
+        _placement_cache.clear()
+        _tensor_cache.clear()