| Lmod has detected the following error: The following module(s) are unknown: |
| "buildenv-gcccuda/12.1.1-gcc12.3.0" |
|
|
| Please check the spelling or version number. Also try "module spider ..." |
| It is also possible your cache file is out-of-date; it may help to try: |
| $ module --ignore_cache load "buildenv-gcccuda/12.1.1-gcc12.3.0" |
|
|
| Also make sure that all modulefiles written in TCL start with the string |
| |
|
|
|
|
|
|
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:16:54,188][pytorch_lightning.utilities.rank_zero][INFO] - Using 16bit Automatic Mixed Precision (AMP) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| [2026-03-11 17:16:54,343][pytorch_lightning.utilities.rank_zero][INFO] - GPU available: True (cuda), used: True |
| [2026-03-11 17:16:54,343][pytorch_lightning.utilities.rank_zero][INFO] - TPU available: False, using: 0 TPU cores |
| [2026-03-11 17:16:54,344][pytorch_lightning.utilities.rank_zero][INFO] - IPU available: False, using: 0 IPUs |
| [2026-03-11 17:16:54,344][pytorch_lightning.utilities.rank_zero][INFO] - HPU available: False, using: 0 HPUs |
| [2026-03-11 17:16:54,344][pytorch_lightning.utilities.rank_zero][INFO] - `Trainer(limit_val_batches=1)` was configured so 1 batch will be used. |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py:74: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| ckpt = torch.load(checkpoint_path, map_location=torch.device('cpu')) |
| [2026-03-11 17:17:04,462][pytorch_lightning.utilities.rank_zero][INFO] - Model weights loaded. |
| [2026-03-11 17:17:11,506][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:17:11,588][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:17:11,762][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:17:12,114][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:17:12,621][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:17:12,833][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:17:12,891][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| Will load checkpoint from /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:17:12,999][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8 |
| [2026-03-11 17:17:16,428][pytorch_lightning.utilities.rank_zero][INFO] - ---------------------------------------------------------------------------------------------------- |
| distributed_backend=nccl |
| All distributed processes registered. Starting with 8 processes |
| ---------------------------------------------------------------------------------------------------- |
|
|
| wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id m7wkey5k. |
| wandb: Tracking run with wandb version 0.17.9 |
| wandb: W&B syncing is set to `offline` in this directory. |
| wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. |
| INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,534][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,535][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,535][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,535][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,535][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,535][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,535][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:17:33,535][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: |
| | Name | Type | Params |
| --------------------------------------------------------------------------------- |
| 0 | diffusion_model | DiffusionMamba | 609 M |
| 1 | validation_lpips_model | LearnedPerceptualImagePatchSimilarity | 2.5 M |
| 2 | vae | AutoencoderKL | 229 M |
| 3 | mamba_memory | BiMambaMemory | 4.5 M |
| 4 | pose_prediction_model | PosePredictionNet | 200 K |
| --------------------------------------------------------------------------------- |
| 609 M Trainable params |
| 236 M Non-trainable params |
| 846 M Total params |
| 3,384.157 Total estimated model params size (MB) |
| [2026-03-11 17:17:34,487][lightning.pytorch.callbacks.model_summary][INFO] - |
| | Name | Type | Params |
| --------------------------------------------------------------------------------- |
| 0 | diffusion_model | DiffusionMamba | 609 M |
| 1 | validation_lpips_model | LearnedPerceptualImagePatchSimilarity | 2.5 M |
| 2 | vae | AutoencoderKL | 229 M |
| 3 | mamba_memory | BiMambaMemory | 4.5 M |
| 4 | pose_prediction_model | PosePredictionNet | 200 K |
| --------------------------------------------------------------------------------- |
| 609 M Trainable params |
| 236 M Non-trainable params |
| 846 M Total params |
| 3,384.157 Total estimated model params size (MB) |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:17:36,764][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| /proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/models/mamba_memory.py:173: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. |
| with torch.cuda.amp.autocast(enabled=False): |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsd1b7c7079d9ccf2500000257' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsb6d52dabe7a485e700000258' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs2c23cb84c858bc8400000259' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs0c3825628dc1f5530000025a' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs3fb824d43fa31ceb0000025b' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs1267e3ac541a7f840000025c' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs27c788df8c2d13bd0000025d' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs2037c117460c17e10000025e' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsb581b4979ca205db0000025f' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs5f57d132417d5fc900000260' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsbac1473f327b891000000261' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs285b71c7303687c100000262' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsc91663a71d377eaa00000263' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs95255d3e018ec5bd00000264' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs9d172a2958d7795000000265' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs015d95e2ce1624a100000266' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs03fe2cc497ec967400000267' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsa72c953a9214f87600000268' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsf590bb1ea8e6a83800000269' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs10e05a73da08cc720000026a' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs18bc3ccb0aeb33c70000026b' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsd083db09fa4ec0e90000026c' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs7e4e27940201c24e0000026d' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfsb230e13b74f1aa370000026e' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs14a74aca6927a29300000270' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs8a160514759b54330000026f' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs12a743e6ba42af7200000271' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs236091a94f0e633e00000272' |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs3ab4adbb1c05165a00000273' |
| [rank1]: Traceback (most recent call last): |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank1]: return _run_code(code, main_globals, None, |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank1]: exec(code, run_globals) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank1]: run() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank1]: _run_hydra( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank1]: _run_app( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank1]: run_and_report( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank1]: raise ex |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank1]: return func() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank1]: lambda: hydra.run( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank1]: _ = ret.return_value |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank1]: raise self._return_value |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank1]: ret.return_value = task_function(task_cfg) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank1]: run_local(cfg) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank1]: experiment.exec_task(task) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank1]: getattr(self, task)() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank1]: trainer.fit( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank1]: call._call_and_handle_interrupt( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank1]: return function(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank1]: self._run(model, ckpt_path=ckpt_path) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank1]: results = self._run_stage() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank1]: self.fit_loop.run() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank1]: self.advance() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank1]: self.epoch_loop.run(self._data_fetcher) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank1]: self.advance(data_fetcher) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank1]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank1]: self._optimizer_step(batch_idx, closure) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank1]: call._call_lightning_module_hook( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank1]: output = fn(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank1]: optimizer.step(closure=optimizer_closure) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank1]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank1]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank1]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank1]: closure_result = closure() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank1]: self._result = self.closure(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank1]: return func(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank1]: step_output = self._step_fn() |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank1]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank1]: output = fn(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank1]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank1]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank1]: return self._call_impl(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank1]: return forward_call(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank1]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank1]: return self.module(*inputs, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank1]: return self._call_impl(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank1]: return forward_call(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank1]: out = method(*_args, **_kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank1]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank1]: results.log( |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank1]: self.update_metrics(key, value, batch_size) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank1]: result_metric.forward(value, batch_size) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank1]: self.update(value, batch_size) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank1]: raise err |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank1]: update(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank1]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank1]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank1]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank1]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank1]: return func(*args, **kwargs) |
| [rank1]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank1]: work = group.allreduce([tensor], opts) |
| [rank1]: RuntimeError: No backend type associated with device type cpu |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfscdf3fa4d4bb3289f00000274' |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs3e78cc3f6f29816000000275' |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| [rank3]: Traceback (most recent call last): |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank3]: return _run_code(code, main_globals, None, |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank3]: exec(code, run_globals) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank3]: run() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank3]: _run_hydra( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank3]: _run_app( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank3]: run_and_report( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank3]: raise ex |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank3]: return func() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank3]: lambda: hydra.run( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank3]: _ = ret.return_value |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank3]: raise self._return_value |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank3]: ret.return_value = task_function(task_cfg) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank3]: run_local(cfg) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank3]: experiment.exec_task(task) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank3]: getattr(self, task)() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank3]: trainer.fit( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank3]: call._call_and_handle_interrupt( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank3]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank3]: return function(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank3]: self._run(model, ckpt_path=ckpt_path) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank3]: results = self._run_stage() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank3]: self.fit_loop.run() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank3]: self.advance() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank3]: self.epoch_loop.run(self._data_fetcher) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank3]: self.advance(data_fetcher) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank3]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank3]: self._optimizer_step(batch_idx, closure) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank3]: call._call_lightning_module_hook( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank3]: output = fn(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank3]: optimizer.step(closure=optimizer_closure) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank3]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank3]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank3]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank3]: closure_result = closure() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank3]: self._result = self.closure(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank3]: return func(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank3]: step_output = self._step_fn() |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank3]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank3]: output = fn(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank3]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank3]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank3]: return self._call_impl(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank3]: return forward_call(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank3]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank3]: return self.module(*inputs, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank3]: return self._call_impl(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank3]: return forward_call(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank3]: out = method(*_args, **_kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank3]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank3]: results.log( |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank3]: self.update_metrics(key, value, batch_size) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank3]: result_metric.forward(value, batch_size) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank3]: self.update(value, batch_size) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank3]: raise err |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank3]: update(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank3]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank3]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank3]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank3]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank3]: return func(*args, **kwargs) |
| [rank3]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank3]: work = group.allreduce([tensor], opts) |
| [rank3]: RuntimeError: No backend type associated with device type cpu |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers |
| finalizer() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 224, in __call__ |
| res = self._callback(*self._args, **self._kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir |
| rmtree(tempdir) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 725, in rmtree |
| _rmtree_safe_fd(fd, path, onerror) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd |
| onerror(os.unlink, fullname, sys.exc_info()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd |
| os.unlink(entry.name, dir_fd=topfd) |
| OSError: [Errno 16] Device or resource busy: '.nfs492972e3c8459f7600000276' |
| [rank4]: Traceback (most recent call last): |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank4]: return _run_code(code, main_globals, None, |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank4]: exec(code, run_globals) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank4]: run() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank4]: _run_hydra( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank4]: _run_app( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank4]: run_and_report( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank4]: raise ex |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank4]: return func() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank4]: lambda: hydra.run( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank4]: _ = ret.return_value |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank4]: raise self._return_value |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank4]: ret.return_value = task_function(task_cfg) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank4]: run_local(cfg) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank4]: experiment.exec_task(task) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank4]: getattr(self, task)() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank4]: trainer.fit( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank4]: call._call_and_handle_interrupt( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank4]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank4]: return function(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank4]: self._run(model, ckpt_path=ckpt_path) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank4]: results = self._run_stage() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank4]: self.fit_loop.run() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank4]: self.advance() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank4]: self.epoch_loop.run(self._data_fetcher) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank4]: self.advance(data_fetcher) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank4]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank4]: self._optimizer_step(batch_idx, closure) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank4]: call._call_lightning_module_hook( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank4]: output = fn(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank4]: optimizer.step(closure=optimizer_closure) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank4]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank4]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank4]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank4]: closure_result = closure() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank4]: self._result = self.closure(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank4]: return func(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank4]: step_output = self._step_fn() |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank4]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank4]: output = fn(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank4]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank4]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank4]: return self._call_impl(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank4]: return forward_call(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank4]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank4]: return self.module(*inputs, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank4]: return self._call_impl(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank4]: return forward_call(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank4]: out = method(*_args, **_kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank4]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank4]: results.log( |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank4]: self.update_metrics(key, value, batch_size) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank4]: result_metric.forward(value, batch_size) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank4]: self.update(value, batch_size) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank4]: raise err |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank4]: update(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank4]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank4]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank4]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank4]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank4]: return func(*args, **kwargs) |
| [rank4]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank4]: work = group.allreduce([tensor], opts) |
| [rank4]: RuntimeError: No backend type associated with device type cpu |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| Traceback (most recent call last): |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| return _run_code(code, main_globals, None, |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| exec(code, run_globals) |
| File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| run() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| _run_hydra( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| _run_app( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| run_and_report( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| raise ex |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| return func() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| lambda: hydra.run( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| _ = ret.return_value |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| raise self._return_value |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| ret.return_value = task_function(task_cfg) |
| File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| run_local(cfg) |
| File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| experiment.exec_task(task) |
| File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| getattr(self, task)() |
| File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| trainer.fit( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| call._call_and_handle_interrupt( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| return function(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| self._run(model, ckpt_path=ckpt_path) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| results = self._run_stage() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| self.fit_loop.run() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| self.advance() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| self.epoch_loop.run(self._data_fetcher) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| self.advance(data_fetcher) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| self._optimizer_step(batch_idx, closure) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| call._call_lightning_module_hook( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| output = fn(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| optimizer.step(closure=optimizer_closure) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| closure_result = closure() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| self._result = self.closure(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| return func(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| step_output = self._step_fn() |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| output = fn(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| wrapper_output = wrapper_module(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| return self._call_impl(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| return forward_call(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| else self._run_ddp_forward(*inputs, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| return self.module(*inputs, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| return self._call_impl(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| return forward_call(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| out = method(*_args, **_kwargs) |
| File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| results.log( |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| self.update_metrics(key, value, batch_size) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| result_metric.forward(value, batch_size) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| self.update(value, batch_size) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| raise err |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| update(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| self._forward_cache = self.meta.sync(value.clone()) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| return func(*args, **kwargs) |
| File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| work = group.allreduce([tensor], opts) |
| RuntimeError: No backend type associated with device type cpu |
| [rank0]: Traceback (most recent call last): |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank0]: return _run_code(code, main_globals, None, |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank0]: exec(code, run_globals) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank0]: run() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank0]: _run_hydra( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank0]: _run_app( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank0]: run_and_report( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank0]: raise ex |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank0]: return func() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank0]: lambda: hydra.run( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank0]: _ = ret.return_value |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank0]: raise self._return_value |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank0]: ret.return_value = task_function(task_cfg) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank0]: run_local(cfg) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank0]: experiment.exec_task(task) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank0]: getattr(self, task)() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank0]: trainer.fit( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank0]: call._call_and_handle_interrupt( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank0]: return function(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank0]: self._run(model, ckpt_path=ckpt_path) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank0]: results = self._run_stage() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank0]: self.fit_loop.run() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank0]: self.advance() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank0]: self.epoch_loop.run(self._data_fetcher) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank0]: self.advance(data_fetcher) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank0]: self._optimizer_step(batch_idx, closure) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank0]: call._call_lightning_module_hook( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank0]: output = fn(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank0]: optimizer.step(closure=optimizer_closure) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank0]: closure_result = closure() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank0]: self._result = self.closure(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank0]: return func(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank0]: step_output = self._step_fn() |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank0]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank0]: output = fn(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank0]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank0]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank0]: return self._call_impl(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank0]: return forward_call(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank0]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank0]: return self.module(*inputs, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank0]: return self._call_impl(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank0]: return forward_call(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank0]: out = method(*_args, **_kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank0]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank0]: results.log( |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank0]: self.update_metrics(key, value, batch_size) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank0]: result_metric.forward(value, batch_size) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank0]: self.update(value, batch_size) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank0]: raise err |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank0]: update(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank0]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank0]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank0]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank0]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank0]: return func(*args, **kwargs) |
| [rank0]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank0]: work = group.allreduce([tensor], opts) |
| [rank0]: RuntimeError: No backend type associated with device type cpu |
| [rank7]: Traceback (most recent call last): |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank7]: return _run_code(code, main_globals, None, |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank7]: exec(code, run_globals) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank7]: run() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank7]: _run_hydra( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank7]: _run_app( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank7]: run_and_report( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank7]: raise ex |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank7]: return func() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank7]: lambda: hydra.run( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank7]: _ = ret.return_value |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank7]: raise self._return_value |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank7]: ret.return_value = task_function(task_cfg) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank7]: run_local(cfg) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank7]: experiment.exec_task(task) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank7]: getattr(self, task)() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank7]: trainer.fit( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank7]: call._call_and_handle_interrupt( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank7]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank7]: return function(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank7]: self._run(model, ckpt_path=ckpt_path) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank7]: results = self._run_stage() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank7]: self.fit_loop.run() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank7]: self.advance() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank7]: self.epoch_loop.run(self._data_fetcher) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank7]: self.advance(data_fetcher) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank7]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank7]: self._optimizer_step(batch_idx, closure) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank7]: call._call_lightning_module_hook( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank7]: output = fn(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank7]: optimizer.step(closure=optimizer_closure) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank7]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank7]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank7]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank7]: closure_result = closure() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank7]: self._result = self.closure(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank7]: return func(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank7]: step_output = self._step_fn() |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank7]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank7]: output = fn(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank7]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank7]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank7]: return self._call_impl(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank7]: return forward_call(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank7]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank7]: return self.module(*inputs, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank7]: return self._call_impl(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank7]: return forward_call(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank7]: out = method(*_args, **_kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank7]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank7]: results.log( |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank7]: self.update_metrics(key, value, batch_size) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank7]: result_metric.forward(value, batch_size) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank7]: self.update(value, batch_size) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank7]: raise err |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank7]: update(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank7]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank7]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank7]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank7]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank7]: return func(*args, **kwargs) |
| [rank7]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank7]: work = group.allreduce([tensor], opts) |
| [rank7]: RuntimeError: No backend type associated with device type cpu |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| [rank6]: Traceback (most recent call last): |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank6]: return _run_code(code, main_globals, None, |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank6]: exec(code, run_globals) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank6]: run() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank6]: _run_hydra( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank6]: _run_app( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank6]: run_and_report( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank6]: raise ex |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank6]: return func() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank6]: lambda: hydra.run( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank6]: _ = ret.return_value |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank6]: raise self._return_value |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank6]: ret.return_value = task_function(task_cfg) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank6]: run_local(cfg) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank6]: experiment.exec_task(task) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank6]: getattr(self, task)() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank6]: trainer.fit( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank6]: call._call_and_handle_interrupt( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank6]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank6]: return function(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank6]: self._run(model, ckpt_path=ckpt_path) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank6]: results = self._run_stage() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank6]: self.fit_loop.run() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank6]: self.advance() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank6]: self.epoch_loop.run(self._data_fetcher) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank6]: self.advance(data_fetcher) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank6]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank6]: self._optimizer_step(batch_idx, closure) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank6]: call._call_lightning_module_hook( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank6]: output = fn(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank6]: optimizer.step(closure=optimizer_closure) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank6]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank6]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank6]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank6]: closure_result = closure() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank6]: self._result = self.closure(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank6]: return func(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank6]: step_output = self._step_fn() |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank6]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank6]: output = fn(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank6]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank6]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank6]: return self._call_impl(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank6]: return forward_call(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank6]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank6]: return self.module(*inputs, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank6]: return self._call_impl(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank6]: return forward_call(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank6]: out = method(*_args, **_kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank6]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank6]: results.log( |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank6]: self.update_metrics(key, value, batch_size) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank6]: result_metric.forward(value, batch_size) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank6]: self.update(value, batch_size) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank6]: raise err |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank6]: update(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank6]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank6]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank6]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank6]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank6]: return func(*args, **kwargs) |
| [rank6]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank6]: work = group.allreduce([tensor], opts) |
| [rank6]: RuntimeError: No backend type associated with device type cpu |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| [rank5]: Traceback (most recent call last): |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank5]: return _run_code(code, main_globals, None, |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank5]: exec(code, run_globals) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank5]: run() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank5]: _run_hydra( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank5]: _run_app( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank5]: run_and_report( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank5]: raise ex |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank5]: return func() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank5]: lambda: hydra.run( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank5]: _ = ret.return_value |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank5]: raise self._return_value |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank5]: ret.return_value = task_function(task_cfg) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank5]: run_local(cfg) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank5]: experiment.exec_task(task) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank5]: getattr(self, task)() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank5]: trainer.fit( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank5]: call._call_and_handle_interrupt( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank5]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank5]: return function(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank5]: self._run(model, ckpt_path=ckpt_path) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank5]: results = self._run_stage() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank5]: self.fit_loop.run() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank5]: self.advance() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank5]: self.epoch_loop.run(self._data_fetcher) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank5]: self.advance(data_fetcher) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank5]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank5]: self._optimizer_step(batch_idx, closure) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank5]: call._call_lightning_module_hook( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank5]: output = fn(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank5]: optimizer.step(closure=optimizer_closure) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank5]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank5]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank5]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank5]: closure_result = closure() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank5]: self._result = self.closure(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank5]: return func(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank5]: step_output = self._step_fn() |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank5]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank5]: output = fn(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank5]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank5]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank5]: return self._call_impl(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank5]: return forward_call(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank5]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank5]: return self.module(*inputs, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank5]: return self._call_impl(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank5]: return forward_call(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank5]: out = method(*_args, **_kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank5]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank5]: results.log( |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank5]: self.update_metrics(key, value, batch_size) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank5]: result_metric.forward(value, batch_size) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank5]: self.update(value, batch_size) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank5]: raise err |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank5]: update(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank5]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank5]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank5]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank5]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank5]: return func(*args, **kwargs) |
| [rank5]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank5]: work = group.allreduce([tensor], opts) |
| [rank5]: RuntimeError: No backend type associated with device type cpu |
| Error executing job with overrides: ['+name=train_stage_b_mamba', 'algorithm=df_video_mamba3stage', '+customized_load=true', '+seperate_load=false', 'experiment.num_nodes=1', 'load=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_a_128/checkpoints/epoch0_step2000.ckpt', 'dataset.save_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/datasets/minecraft', 'dataset.n_frames=200', '+dataset.n_frames_valid=200', '+dataset.angle_range=110', '+dataset.pos_range=2', '+dataset.wo_updown=false', '+dataset.customized_validation=true', '+dataset.add_timestamp_embedding=true', '+dataset.use_explicit_memory_frames=false', 'algorithm.training_stage=stage_b_diffusion_frozen_memory', 'algorithm.use_mamba_memory_pipeline=true', 'algorithm.use_oracle_pose_eval=false', 'algorithm.enable_memory_noise_curriculum=false', '+algorithm.require_pose_prediction=false', '+algorithm.use_memory_attention=false', '+algorithm.relative_embedding=false', '+algorithm.memory_retrieval_topk=32', 'algorithm.diff_window_size=8', 'algorithm.memory_condition_length=0', 'algorithm.context_frames=100', '+algorithm.n_tokens=8', 'experiment.training.batch_size=8', 'experiment.training.checkpointing.every_n_train_steps=2500', 'experiment.training.max_steps=30000', 'experiment.validation.val_every_n_step=2500', '+output_dir=/proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/'] |
| [rank2]: Traceback (most recent call last): |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 196, in _run_module_as_main |
| [rank2]: return _run_code(code, main_globals, None, |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/runpy.py", line 86, in _run_code |
| [rank2]: exec(code, run_globals) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 202, in <module> |
| [rank2]: run() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main |
| [rank2]: _run_hydra( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra |
| [rank2]: _run_app( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app |
| [rank2]: run_and_report( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report |
| [rank2]: raise ex |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report |
| [rank2]: return func() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> |
| [rank2]: lambda: hydra.run( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run |
| [rank2]: _ = ret.return_value |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value |
| [rank2]: raise self._return_value |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job |
| [rank2]: ret.return_value = task_function(task_cfg) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 198, in run |
| [rank2]: run_local(cfg) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/main.py", line 122, in run_local |
| [rank2]: experiment.exec_task(task) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 172, in exec_task |
| [rank2]: getattr(self, task)() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/experiments/exp_base.py", line 357, in training |
| [rank2]: trainer.fit( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit |
| [rank2]: call._call_and_handle_interrupt( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt |
| [rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch |
| [rank2]: return function(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl |
| [rank2]: self._run(model, ckpt_path=ckpt_path) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run |
| [rank2]: results = self._run_stage() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage |
| [rank2]: self.fit_loop.run() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run |
| [rank2]: self.advance() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance |
| [rank2]: self.epoch_loop.run(self._data_fetcher) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run |
| [rank2]: self.advance(data_fetcher) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance |
| [rank2]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run |
| [rank2]: self._optimizer_step(batch_idx, closure) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step |
| [rank2]: call._call_lightning_module_hook( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook |
| [rank2]: output = fn(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_base.py", line 65, in optimizer_step |
| [rank2]: optimizer.step(closure=optimizer_closure) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step |
| [rank2]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 265, in optimizer_step |
| [rank2]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step |
| [rank2]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py", line 77, in optimizer_step |
| [rank2]: closure_result = closure() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__ |
| [rank2]: self._result = self.closure(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| [rank2]: return func(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure |
| [rank2]: step_output = self._step_fn() |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step |
| [rank2]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook |
| [rank2]: output = fn(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 381, in training_step |
| [rank2]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in __call__ |
| [rank2]: wrapper_output = wrapper_module(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank2]: return self._call_impl(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank2]: return forward_call(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward |
| [rank2]: else self._run_ddp_forward(*inputs, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward |
| [rank2]: return self.module(*inputs, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| [rank2]: return self._call_impl(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| [rank2]: return forward_call(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 626, in wrapped_forward |
| [rank2]: out = method(*_args, **_kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/WorldMem_Repro/algorithms/worldmem/df_video_mamba3stage.py", line 1070, in training_step |
| [rank2]: self.log("training/loss", total_loss.cpu(), prog_bar=True, sync_dist=True) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 503, in log |
| [rank2]: results.log( |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 407, in log |
| [rank2]: self.update_metrics(key, value, batch_size) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 421, in update_metrics |
| [rank2]: result_metric.forward(value, batch_size) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 267, in forward |
| [rank2]: self.update(value, batch_size) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 400, in wrapped_func |
| [rank2]: raise err |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchmetrics/metric.py", line 390, in wrapped_func |
| [rank2]: update(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 225, in update |
| [rank2]: self._forward_cache = self.meta.sync(value.clone()) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 332, in reduce |
| [rank2]: return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available |
| [rank2]: return _sync_ddp(result, group=group, reduce_op=reduce_op) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp |
| [rank2]: torch.distributed.all_reduce(result, op=op, group=group, async_op=False) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper |
| [rank2]: return func(*args, **kwargs) |
| [rank2]: File "/proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce |
| [rank2]: work = group.allreduce([tensor], opts) |
| [rank2]: RuntimeError: No backend type associated with device type cpu |
| wandb: You can sync this run to the cloud by running: |
| wandb: wandb sync /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b/wandb/offline-run-20260311_171725-m7wkey5k |
| wandb: Find logs at: ./checkpoints/bimamba_stage_b/wandb/offline-run-20260311_171725-m7wkey5k/logs |
| wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information. |
| srun: error: node109: tasks 1,3-4: Exited with exit code 1 |
| srun: Terminating StepId=7368.0 |
| [2026-03-11T17:18:08.842] error: *** STEP 7368.0 ON node109 CANCELLED AT 2026-03-11T17:18:08 DUE TO TASK FAILURE *** |
|
Training: | | 0/? [00:00<?, ?it/s]
Training: 0%| | 0/203307 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/203307 [00:00<?, ?it/s] srun: error: node109: tasks 5,7: Terminated |
| srun: error: node109: task 0: Terminated |
| srun: error: node109: task 2: Terminated |
| srun: error: node109: task 6: Terminated |
| srun: Force Terminated StepId=7368.0 |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'training': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information |
| warnings.warn(msg, UserWarning) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/fabric/__init__.py:40: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. |
| warnings.warn( |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. |
| warnings.warn(msg) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:42,584][pytorch_lightning.utilities.rank_zero][INFO] - Using 16bit Automatic Mixed Precision (AMP) |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md |
| self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False) |
| [2026-03-11 17:18:42,726][pytorch_lightning.utilities.rank_zero][INFO] - GPU available: True (cuda), used: True |
| [2026-03-11 17:18:42,726][pytorch_lightning.utilities.rank_zero][INFO] - TPU available: False, using: 0 TPU cores |
| [2026-03-11 17:18:42,726][pytorch_lightning.utilities.rank_zero][INFO] - IPU available: False, using: 0 IPUs |
| [2026-03-11 17:18:42,726][pytorch_lightning.utilities.rank_zero][INFO] - HPU available: False, using: 0 HPUs |
| [2026-03-11 17:18:42,727][pytorch_lightning.utilities.rank_zero][INFO] - `Trainer(limit_val_batches=1)` was configured so 1 batch will be used. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| /proj/cvl/users/x_fahkh2/envs/worldmem/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/amp.py:54: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. |
| [2026-03-11 17:18:48,831][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:48,998][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:49,098][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:49,593][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:49,616][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:49,636][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:49,653][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8 |
| INFO: Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8 |
| [36mOutputs will be saved to:[39m /proj/cvl/users/x_fahkh2/WorldMem_Repro/checkpoints/bimamba_stage_b |
| [36mExecuting task:[39m training out of ['training'] |
| [2026-03-11 17:18:50,914][lightning.fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8 |
| [2026-03-11 17:18:54,154][pytorch_lightning.utilities.rank_zero][INFO] - ---------------------------------------------------------------------------------------------------- |
| distributed_backend=nccl |
| All distributed processes registered. Starting with 8 processes |
| ---------------------------------------------------------------------------------------------------- |
|
|
| wandb: WARNING Tried to auto resume run with id m7wkey5k but id stage_b_offline is set. |
| wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id stage_b_offline. |
| wandb: Tracking run with wandb version 0.17.9 |
| wandb: W&B syncing is set to `offline` in this directory. |
| wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. |
| INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,989][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,990][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,990][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,990][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,990][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,990][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,990][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| [2026-03-11 17:19:09,990][lightning.pytorch.accelerators.cuda][INFO] - LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] |
| INFO: |
| | Name | Type | Params |
| --------------------------------------------------------------------------------- |
| 0 | diffusion_model | DiffusionMamba | 609 M |
| 1 | validation_lpips_model | LearnedPerceptualImagePatchSimilarity | 2.5 M |
| 2 | vae | AutoencoderKL | 229 M |
| 3 | mamba_memory | BiMambaMemory | 4.5 M |
| 4 | pose_prediction_model | PosePredictionNet | 200 K |
| --------------------------------------------------------------------------------- |
| 609 M Trainable params |
| 236 M Non-trainable params |
| 846 M Total params |
| 3,384.157 Total estimated model params size (MB) |
| [2026-03-11 17:19:12,611][lightning.pytorch.callbacks.model_summary][INFO] - |
| | Name | Type | Params |
| --------------------------------------------------------------------------------- |
| 0 | diffusion_model | DiffusionMamba | 609 M |
| 1 | validation_lpips_model | LearnedPerceptualImagePatchSimilarity | 2.5 M |
| 2 | vae | AutoencoderKL | 229 M |
| 3 | mamba_memory | BiMambaMemory | 4.5 M |
| 4 | pose_prediction_model | PosePredictionNet | 200 K |
| --------------------------------------------------------------------------------- |
| 609 M Trainable params |
| 236 M Non-trainable params |
| 846 M Total params |
| 3,384.157 Total estimated model params size (MB) |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,418][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,418][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,418][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,418][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,418][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,419][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,419][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
| INFO: SLURM auto-requeueing enabled. Setting signal handlers. |
| [2026-03-11 17:19:13,419][lightning.pytorch.trainer.connectors.signal_connector][INFO] - SLURM auto-requeueing enabled. Setting signal handlers. |
|
|