W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] ***************************************** W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] ***************************************** 时间网格:t_c=0.75, 步数 (1→t_c)=100, (t_c→0)=50 Total number of images that will be sampled: 40192 sampling: 0%| | 0/157 [00:00 sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ========================================================== sample_from_checkpoint_ddp.py FAILED ---------------------------------------------------------- Failures: ---------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2026-03-25_18:03:18 host : 24c964746905d416ce09d045f9a06f23-taskrole1-0 rank : 1 (local_rank: 1) exitcode : -9 (pid: 538676) error_file: traceback : Signal 9 (SIGKILL) received by PID 538676 ==========================================================