Hanrui / nohup.out
Lekr0's picture
Add files using upload-large-folder tool
40d87dd verified
/workspace/miniconda3/envs/dflash/bin/python3: can't open file '/workspace/hanrui/ ': [Errno 2] No such file or directory
E0317 16:57:14.100000 140364991186752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 14058) of binary: /workspace/miniconda3/envs/dflash/bin/python3
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-03-17_16:57:14
host : job-006ce80a7c47-20260302193512-5dcd4c9bbd-gfjsn
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 14058)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
usage: run.py [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
[--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT]
[--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone]
[--max-restarts MAX_RESTARTS]
[--monitor-interval MONITOR_INTERVAL]
[--start-method {spawn,fork,forkserver}] [--role ROLE] [-m]
[--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
[-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER]
[--node-rank NODE_RANK] [--master-addr MASTER_ADDR]
[--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
[--logs-specs LOGS_SPECS]
training_script ...
run.py: error: the following arguments are required: training_script, training_script_args