jsflow / REG /samples_0.75_new.log
xiangzai's picture
Add files using upload-large-folder tool
b65e56d verified
W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793]
W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] *****************************************
W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 16:55:16.395000 538513 site-packages/torch/distributed/run.py:793] *****************************************
ζ—Άι—΄η½‘ζ ΌοΌšt_c=0.75, ζ­₯ζ•° (1β†’t_c)=100, (t_cβ†’0)=50
Total number of images that will be sampled: 40192
sampling: 0%| | 0/157 [00:00<?, ?it/s][rank2]:[W325 16:56:59.533804859 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W325 16:57:00.799344818 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W325 16:57:00.847229448 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W325 16:57:01.326116049 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
sampling: 1%| | 1/157 [00:39<1:43:36, 39.85s/it] sampling: 1%|▏ | 2/157 [01:17<1:39:23, 38.48s/it] sampling: 2%|▏ | 3/157 [01:54<1:37:23, 37.94s/it] sampling: 3%|β–Ž | 4/157 [02:32<1:36:11, 37.72s/it] sampling: 3%|β–Ž | 5/157 [03:09<1:35:18, 37.62s/it] sampling: 4%|▍ | 6/157 [03:46<1:34:33, 37.57s/it] sampling: 4%|▍ | 7/157 [04:24<1:33:43, 37.49s/it] sampling: 5%|β–Œ | 8/157 [05:01<1:32:54, 37.42s/it] sampling: 6%|β–Œ | 9/157 [05:38<1:32:15, 37.40s/it] sampling: 6%|β–‹ | 10/157 [06:16<1:31:34, 37.38s/it] sampling: 7%|β–‹ | 11/157 [06:53<1:30:56, 37.37s/it] sampling: 8%|β–Š | 12/157 [07:30<1:30:15, 37.34s/it] sampling: 8%|β–Š | 13/157 [08:08<1:29:40, 37.36s/it] sampling: 9%|β–‰ | 14/157 [08:45<1:29:02, 37.36s/it] sampling: 10%|β–‰ | 15/157 [09:22<1:28:22, 37.34s/it] sampling: 10%|β–ˆ | 16/157 [10:00<1:27:50, 37.38s/it] sampling: 11%|β–ˆ | 17/157 [10:37<1:27:14, 37.39s/it] sampling: 11%|β–ˆβ– | 18/157 [11:15<1:26:35, 37.38s/it] sampling: 12%|β–ˆβ– | 19/157 [11:52<1:25:55, 37.36s/it] sampling: 13%|β–ˆβ–Ž | 20/157 [12:29<1:25:18, 37.36s/it] sampling: 13%|β–ˆβ–Ž | 21/157 [13:06<1:24:10, 37.14s/it] sampling: 14%|β–ˆβ– | 22/157 [13:43<1:23:41, 37.20s/it] sampling: 15%|β–ˆβ– | 23/157 [14:21<1:23:10, 37.24s/it] sampling: 15%|β–ˆβ–Œ | 24/157 [14:58<1:22:38, 37.29s/it] sampling: 16%|β–ˆβ–Œ | 25/157 [15:35<1:22:02, 37.29s/it] sampling: 17%|β–ˆβ–‹ | 26/157 [16:13<1:21:27, 37.31s/it] sampling: 17%|β–ˆβ–‹ | 27/157 [16:50<1:20:50, 37.31s/it] sampling: 18%|β–ˆβ–Š | 28/157 [17:27<1:20:14, 37.32s/it] sampling: 18%|β–ˆβ–Š | 29/157 [18:05<1:19:40, 37.35s/it] sampling: 19%|β–ˆβ–‰ | 30/157 [18:42<1:18:40, 37.17s/it] sampling: 20%|β–ˆβ–‰ | 31/157 [19:19<1:18:07, 37.21s/it] sampling: 20%|β–ˆβ–ˆ | 32/157 [19:56<1:17:37, 37.26s/it] sampling: 21%|β–ˆβ–ˆ | 33/157 [20:34<1:17:07, 37.32s/it] sampling: 22%|β–ˆβ–ˆβ– | 34/157 [21:11<1:16:31, 37.33s/it] sampling: 22%|β–ˆβ–ˆβ– | 35/157 [21:48<1:15:56, 37.35s/it] sampling: 23%|β–ˆβ–ˆβ–Ž | 36/157 [22:26<1:15:18, 37.34s/it] sampling: 24%|β–ˆβ–ˆβ–Ž | 37/157 [23:03<1:14:49, 37.41s/it] sampling: 24%|β–ˆβ–ˆβ– | 38/157 [23:41<1:14:15, 37.44s/it] sampling: 25%|β–ˆβ–ˆβ– | 39/157 [24:18<1:13:35, 37.42s/it] sampling: 25%|β–ˆβ–ˆβ–Œ | 40/157 [24:55<1:12:53, 37.38s/it] sampling: 26%|β–ˆβ–ˆβ–Œ | 41/157 [25:33<1:12:13, 37.36s/it] sampling: 27%|β–ˆβ–ˆβ–‹ | 42/157 [26:10<1:11:38, 37.38s/it] sampling: 27%|β–ˆβ–ˆβ–‹ | 43/157 [26:48<1:11:03, 37.40s/it] sampling: 28%|β–ˆβ–ˆβ–Š | 44/157 [27:26<1:11:07, 37.77s/it] sampling: 29%|β–ˆβ–ˆβ–Š | 45/157 [28:05<1:10:49, 37.95s/it] sampling: 29%|β–ˆβ–ˆβ–‰ | 46/157 [28:43<1:10:38, 38.18s/it] sampling: 30%|β–ˆβ–ˆβ–‰ | 47/157 [29:22<1:10:12, 38.30s/it] sampling: 31%|β–ˆβ–ˆβ–ˆ | 48/157 [30:01<1:09:53, 38.47s/it] sampling: 31%|β–ˆβ–ˆβ–ˆ | 49/157 [30:39<1:09:06, 38.39s/it] sampling: 32%|β–ˆβ–ˆβ–ˆβ– | 50/157 [31:17<1:08:21, 38.34s/it] sampling: 32%|β–ˆβ–ˆβ–ˆβ– | 51/157 [31:56<1:07:51, 38.41s/it] sampling: 33%|β–ˆβ–ˆβ–ˆβ–Ž | 52/157 [32:34<1:07:10, 38.39s/it] sampling: 34%|β–ˆβ–ˆβ–ˆβ– | 53/157 [33:12<1:06:21, 38.29s/it] sampling: 34%|β–ˆβ–ˆβ–ˆβ– | 54/157 [33:50<1:05:40, 38.26s/it] sampling: 35%|β–ˆβ–ˆβ–ˆβ–Œ | 55/157 [34:29<1:05:07, 38.31s/it] sampling: 36%|β–ˆβ–ˆβ–ˆβ–Œ | 56/157 [35:07<1:04:29, 38.32s/it] sampling: 36%|β–ˆβ–ˆβ–ˆβ–‹ | 57/157 [35:45<1:03:51, 38.31s/it] sampling: 37%|β–ˆβ–ˆβ–ˆβ–‹ | 58/157 [36:22<1:02:12, 37.70s/it] sampling: 38%|β–ˆβ–ˆβ–ˆβ–Š | 59/157 [37:00<1:02:01, 37.97s/it] sampling: 38%|β–ˆβ–ˆβ–ˆβ–Š | 60/157 [37:38<1:01:28, 38.02s/it] sampling: 39%|β–ˆβ–ˆβ–ˆβ–‰ | 61/157 [38:17<1:00:55, 38.08s/it] sampling: 39%|β–ˆβ–ˆβ–ˆβ–‰ | 62/157 [38:55<1:00:20, 38.11s/it] sampling: 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 63/157 [39:33<59:43, 38.12s/it] sampling: 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 64/157 [40:13<59:55, 38.66s/it] sampling: 41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 65/157 [40:50<58:41, 38.28s/it] sampling: 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 66/157 [41:28<57:39, 38.02s/it] sampling: 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 67/157 [42:05<56:46, 37.85s/it] sampling: 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 68/157 [42:43<55:57, 37.73s/it] sampling: 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 69/157 [43:20<55:12, 37.64s/it] sampling: 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 70/157 [43:58<54:30, 37.59s/it] sampling: 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 71/157 [44:35<53:47, 37.53s/it] sampling: 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 72/157 [45:12<53:06, 37.49s/it] sampling: 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 73/157 [45:50<52:26, 37.46s/it] sampling: 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 74/157 [46:27<51:49, 37.46s/it] sampling: 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 75/157 [47:05<51:10, 37.44s/it] sampling: 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 76/157 [47:42<50:32, 37.43s/it] sampling: 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 77/157 [48:19<49:56, 37.46s/it] sampling: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 78/157 [48:57<49:19, 37.46s/it] sampling: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 79/157 [49:34<48:41, 37.45s/it] sampling: 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 80/157 [50:12<48:03, 37.45s/it] sampling: 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 81/157 [50:49<47:27, 37.47s/it] sampling: 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 82/157 [51:27<46:49, 37.47s/it] sampling: 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 83/157 [52:04<46:11, 37.45s/it] sampling: 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 84/157 [52:42<45:33, 37.45s/it] sampling: 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 85/157 [53:19<44:57, 37.47s/it] sampling: 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 86/157 [53:57<44:19, 37.46s/it] sampling: 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 87/157 [54:34<43:40, 37.44s/it] sampling: 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 88/157 [55:12<43:04, 37.46s/it] sampling: 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 89/157 [55:49<42:26, 37.45s/it] sampling: 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 90/157 [56:26<41:48, 37.44s/it] sampling: 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 91/157 [57:04<41:14, 37.49s/it] sampling: 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 92/157 [57:41<40:36, 37.48s/it] sampling: 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 93/157 [58:19<39:59, 37.49s/it] sampling: 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 94/157 [58:56<39:19, 37.45s/it] sampling: 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 95/157 [59:34<38:40, 37.43s/it] sampling: 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 96/157 [1:00:11<38:04, 37.46s/it] sampling: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 97/157 [1:00:49<37:27, 37.46s/it] sampling: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 98/157 [1:01:26<36:49, 37.46s/it] sampling: 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 99/157 [1:02:04<36:12, 37.45s/it] sampling: 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 100/157 [1:02:41<35:34, 37.45s/it] sampling: 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 101/157 [1:03:18<34:56, 37.43s/it] sampling: 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 102/157 [1:03:56<34:18, 37.43s/it] sampling: 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 103/157 [1:04:33<33:41, 37.44s/it] sampling: 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 104/157 [1:05:11<33:04, 37.45s/it] sampling: 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 105/157 [1:05:48<32:27, 37.44s/it] sampling: 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 106/157 [1:06:26<31:49, 37.44s/it]W0325 18:03:18.209000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 538675 closing signal SIGTERM
W0325 18:03:18.212000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 538677 closing signal SIGTERM
W0325 18:03:18.212000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 538678 closing signal SIGTERM
E0325 18:03:18.554000 538513 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 1 (pid: 538676) of binary: /gemini/space/zhaozy/guzhenyu/envs/envs/SiT/bin/python
Traceback (most recent call last):
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==========================================================
sample_from_checkpoint_ddp.py FAILED
----------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
----------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-03-25_18:03:18
host : 24c964746905d416ce09d045f9a06f23-taskrole1-0
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 538676)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 538676
==========================================================