| W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] |
| W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] ***************************************** |
| W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
| W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] ***************************************** |
| ζΆι΄η½ζ ΌοΌt_c=0.75, ζ₯ζ° (1βt_c)=100, (t_cβ0)=50 |
| Total number of images that will be sampled: 40192 |
|
sampling: 0%| | 0/157 [00:00<?, ?it/s][rank2]:[W325 16:56:59.533804859 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
| [rank3]:[W325 16:57:00.799344818 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
| [rank1]:[W325 16:57:00.847229448 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
| [rank0]:[W325 16:57:01.326116049 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
|
sampling: 1%| | 1/157 [00:39<1:43:36, 39.85s/it]
sampling: 1%|β | 2/157 [01:17<1:39:23, 38.48s/it]
sampling: 2%|β | 3/157 [01:54<1:37:23, 37.94s/it]
sampling: 3%|β | 4/157 [02:32<1:36:11, 37.72s/it]
sampling: 3%|β | 5/157 [03:09<1:35:18, 37.62s/it]
sampling: 4%|β | 6/157 [03:46<1:34:33, 37.57s/it]
sampling: 4%|β | 7/157 [04:24<1:33:43, 37.49s/it]
sampling: 5%|β | 8/157 [05:01<1:32:54, 37.42s/it]
sampling: 6%|β | 9/157 [05:38<1:32:15, 37.40s/it]
sampling: 6%|β | 10/157 [06:16<1:31:34, 37.38s/it]
sampling: 7%|β | 11/157 [06:53<1:30:56, 37.37s/it]
sampling: 8%|β | 12/157 [07:30<1:30:15, 37.34s/it]
sampling: 8%|β | 13/157 [08:08<1:29:40, 37.36s/it]
sampling: 9%|β | 14/157 [08:45<1:29:02, 37.36s/it]
sampling: 10%|β | 15/157 [09:22<1:28:22, 37.34s/it]
sampling: 10%|β | 16/157 [10:00<1:27:50, 37.38s/it]
sampling: 11%|β | 17/157 [10:37<1:27:14, 37.39s/it]
sampling: 11%|ββ | 18/157 [11:15<1:26:35, 37.38s/it]
sampling: 12%|ββ | 19/157 [11:52<1:25:55, 37.36s/it]
sampling: 13%|ββ | 20/157 [12:29<1:25:18, 37.36s/it]
sampling: 13%|ββ | 21/157 [13:06<1:24:10, 37.14s/it]
sampling: 14%|ββ | 22/157 [13:43<1:23:41, 37.20s/it]
sampling: 15%|ββ | 23/157 [14:21<1:23:10, 37.24s/it]
sampling: 15%|ββ | 24/157 [14:58<1:22:38, 37.29s/it]
sampling: 16%|ββ | 25/157 [15:35<1:22:02, 37.29s/it]
sampling: 17%|ββ | 26/157 [16:13<1:21:27, 37.31s/it]
sampling: 17%|ββ | 27/157 [16:50<1:20:50, 37.31s/it]
sampling: 18%|ββ | 28/157 [17:27<1:20:14, 37.32s/it]
sampling: 18%|ββ | 29/157 [18:05<1:19:40, 37.35s/it]
sampling: 19%|ββ | 30/157 [18:42<1:18:40, 37.17s/it]
sampling: 20%|ββ | 31/157 [19:19<1:18:07, 37.21s/it]
sampling: 20%|ββ | 32/157 [19:56<1:17:37, 37.26s/it]
sampling: 21%|ββ | 33/157 [20:34<1:17:07, 37.32s/it]
sampling: 22%|βββ | 34/157 [21:11<1:16:31, 37.33s/it]
sampling: 22%|βββ | 35/157 [21:48<1:15:56, 37.35s/it]
sampling: 23%|βββ | 36/157 [22:26<1:15:18, 37.34s/it]
sampling: 24%|βββ | 37/157 [23:03<1:14:49, 37.41s/it]
sampling: 24%|βββ | 38/157 [23:41<1:14:15, 37.44s/it]
sampling: 25%|βββ | 39/157 [24:18<1:13:35, 37.42s/it]
sampling: 25%|βββ | 40/157 [24:55<1:12:53, 37.38s/it]
sampling: 26%|βββ | 41/157 [25:33<1:12:13, 37.36s/it]
sampling: 27%|βββ | 42/157 [26:10<1:11:38, 37.38s/it]
sampling: 27%|βββ | 43/157 [26:48<1:11:03, 37.40s/it]
sampling: 28%|βββ | 44/157 [27:26<1:11:07, 37.77s/it]
sampling: 29%|βββ | 45/157 [28:05<1:10:49, 37.95s/it]
sampling: 29%|βββ | 46/157 [28:43<1:10:38, 38.18s/it]
sampling: 30%|βββ | 47/157 [29:22<1:10:12, 38.30s/it]
sampling: 31%|βββ | 48/157 [30:01<1:09:53, 38.47s/it]
sampling: 31%|βββ | 49/157 [30:39<1:09:06, 38.39s/it]
sampling: 32%|ββββ | 50/157 [31:17<1:08:21, 38.34s/it]
sampling: 32%|ββββ | 51/157 [31:56<1:07:51, 38.41s/it]
sampling: 33%|ββββ | 52/157 [32:34<1:07:10, 38.39s/it]
sampling: 34%|ββββ | 53/157 [33:12<1:06:21, 38.29s/it]
sampling: 34%|ββββ | 54/157 [33:50<1:05:40, 38.26s/it]
sampling: 35%|ββββ | 55/157 [34:29<1:05:07, 38.31s/it]
sampling: 36%|ββββ | 56/157 [35:07<1:04:29, 38.32s/it]
sampling: 36%|ββββ | 57/157 [35:45<1:03:51, 38.31s/it]
sampling: 37%|ββββ | 58/157 [36:22<1:02:12, 37.70s/it]
sampling: 38%|ββββ | 59/157 [37:00<1:02:01, 37.97s/it]
sampling: 38%|ββββ | 60/157 [37:38<1:01:28, 38.02s/it]
sampling: 39%|ββββ | 61/157 [38:17<1:00:55, 38.08s/it]
sampling: 39%|ββββ | 62/157 [38:55<1:00:20, 38.11s/it]
sampling: 40%|ββββ | 63/157 [39:33<59:43, 38.12s/it]
sampling: 41%|ββββ | 64/157 [40:13<59:55, 38.66s/it]
sampling: 41%|βββββ | 65/157 [40:50<58:41, 38.28s/it]
sampling: 42%|βββββ | 66/157 [41:28<57:39, 38.02s/it]
sampling: 43%|βββββ | 67/157 [42:05<56:46, 37.85s/it]
sampling: 43%|βββββ | 68/157 [42:43<55:57, 37.73s/it]
sampling: 44%|βββββ | 69/157 [43:20<55:12, 37.64s/it]
sampling: 45%|βββββ | 70/157 [43:58<54:30, 37.59s/it]
sampling: 45%|βββββ | 71/157 [44:35<53:47, 37.53s/it]
sampling: 46%|βββββ | 72/157 [45:12<53:06, 37.49s/it]
sampling: 46%|βββββ | 73/157 [45:50<52:26, 37.46s/it]
sampling: 47%|βββββ | 74/157 [46:27<51:49, 37.46s/it]
sampling: 48%|βββββ | 75/157 [47:05<51:10, 37.44s/it]
sampling: 48%|βββββ | 76/157 [47:42<50:32, 37.43s/it]
sampling: 49%|βββββ | 77/157 [48:19<49:56, 37.46s/it]
sampling: 50%|βββββ | 78/157 [48:57<49:19, 37.46s/it]
sampling: 50%|βββββ | 79/157 [49:34<48:41, 37.45s/it]
sampling: 51%|βββββ | 80/157 [50:12<48:03, 37.45s/it]
sampling: 52%|ββββββ | 81/157 [50:49<47:27, 37.47s/it]
sampling: 52%|ββββββ | 82/157 [51:27<46:49, 37.47s/it]
sampling: 53%|ββββββ | 83/157 [52:04<46:11, 37.45s/it]
sampling: 54%|ββββββ | 84/157 [52:42<45:33, 37.45s/it]
sampling: 54%|ββββββ | 85/157 [53:19<44:57, 37.47s/it]
sampling: 55%|ββββββ | 86/157 [53:57<44:19, 37.46s/it]
sampling: 55%|ββββββ | 87/157 [54:34<43:40, 37.44s/it]
sampling: 56%|ββββββ | 88/157 [55:12<43:04, 37.46s/it]
sampling: 57%|ββββββ | 89/157 [55:49<42:26, 37.45s/it]
sampling: 57%|ββββββ | 90/157 [56:26<41:48, 37.44s/it]
sampling: 58%|ββββββ | 91/157 [57:04<41:14, 37.49s/it]
sampling: 59%|ββββββ | 92/157 [57:41<40:36, 37.48s/it]
sampling: 59%|ββββββ | 93/157 [58:19<39:59, 37.49s/it]
sampling: 60%|ββββββ | 94/157 [58:56<39:19, 37.45s/it]
sampling: 61%|ββββββ | 95/157 [59:34<38:40, 37.43s/it]
sampling: 61%|ββββββ | 96/157 [1:00:11<38:04, 37.46s/it]
sampling: 62%|βββββββ | 97/157 [1:00:49<37:27, 37.46s/it]
sampling: 62%|βββββββ | 98/157 [1:01:26<36:49, 37.46s/it]
sampling: 63%|βββββββ | 99/157 [1:02:04<36:12, 37.45s/it]
sampling: 64%|βββββββ | 100/157 [1:02:41<35:34, 37.45s/it]
sampling: 64%|βββββββ | 101/157 [1:03:18<34:56, 37.43s/it]
sampling: 65%|βββββββ | 102/157 [1:03:56<34:18, 37.43s/it]
sampling: 66%|βββββββ | 103/157 [1:04:33<33:41, 37.44s/it]
sampling: 66%|βββββββ | 104/157 [1:05:11<33:04, 37.45s/it]
sampling: 67%|βββββββ | 105/157 [1:05:48<32:27, 37.44s/it]
sampling: 68%|βββββββ | 106/157 [1:06:26<31:49, 37.44s/it]W0325 18:03:18.209000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 538675 closing signal SIGTERM |
| W0325 18:03:18.212000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 538677 closing signal SIGTERM |
| W0325 18:03:18.212000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 538678 closing signal SIGTERM |
| E0325 18:03:18.554000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 1 (pid: 538676) of binary: /gemini/space/zhaozy/guzhenyu/envs/envs/SiT/bin/python |
| Traceback (most recent call last): |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/bin/torchrun", line 33, in <module> |
| sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')()) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper |
| return f(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main |
| run(args) |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run |
| elastic_launch( |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ |
| return launch_agent(self._config, self._entrypoint, list(args)) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent |
| raise ChildFailedError( |
| torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
| ========================================================== |
| sample_from_checkpoint_ddp.py FAILED |
| |
| Failures: |
| <NO_OTHER_FAILURES> |
| |
| Root Cause (first observed failure): |
| [0]: |
| time : 2026-03-25_18:03:18 |
| host : 24c964746905d416ce09d045f9a06f23-taskrole1-0 |
| rank : 1 (local_rank: 1) |
| exitcode : -9 (pid: 538676) |
| error_file: <N/A> |
| traceback : Signal 9 (SIGKILL) received by PID 538676 |
| ========================================================== |
|
|