jsflow / REG /samples_0.25_new.log
xiangzai's picture
Add files using upload-large-folder tool
b65e56d verified
W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793]
W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793] *****************************************
W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793] *****************************************
ζ—Άι—΄η½‘ζ ΌοΌšt_c=0.25, ζ­₯ζ•° (1β†’t_c)=100, (t_cβ†’0)=2
Total number of images that will be sampled: 5120
sampling: 0%| | 0/20 [00:00<?, ?it/s][rank3]:[W408 11:06:31.576422087 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W408 11:06:31.621528015 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W408 11:06:31.627182760 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W408 11:06:32.966218444 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
sampling: 5%|β–Œ | 1/20 [00:28<09:00, 28.44s/it] sampling: 10%|β–ˆ | 2/20 [00:54<08:11, 27.28s/it] sampling: 15%|β–ˆβ–Œ | 3/20 [01:21<07:37, 26.89s/it] sampling: 20%|β–ˆβ–ˆ | 4/20 [01:47<07:07, 26.74s/it] sampling: 25%|β–ˆβ–ˆβ–Œ | 5/20 [02:14<06:39, 26.61s/it] sampling: 30%|β–ˆβ–ˆβ–ˆ | 6/20 [02:40<06:12, 26.57s/it] sampling: 35%|β–ˆβ–ˆβ–ˆβ–Œ | 7/20 [03:07<05:45, 26.58s/it] sampling: 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 8/20 [03:33<05:18, 26.57s/it] sampling: 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 9/20 [04:00<04:51, 26.53s/it] sampling: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 10/20 [04:26<04:24, 26.48s/it] sampling: 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 11/20 [04:53<03:58, 26.51s/it]W0408 11:11:04.741000 100112 site-packages/torch/distributed/elastic/agent/server/api.py:704] Received 15 death signal, shutting down workers
W0408 11:11:04.746000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100158 closing signal SIGTERM
W0408 11:11:04.748000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100159 closing signal SIGTERM
W0408 11:11:04.749000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100160 closing signal SIGTERM
W0408 11:11:04.749000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100161 closing signal SIGTERM
Traceback (most recent call last):
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run
time.sleep(monitor_interval)
File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 100112 got signal: 15