| /workspace/miniconda3/envs/dflash/bin/python3: can't open file '/workspace/hanrui/ ': [Errno 2] No such file or directory |
| E0317 16:57:14.100000 140364991186752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 14058) of binary: /workspace/miniconda3/envs/dflash/bin/python3 |
| Traceback (most recent call last): |
| File "<frozen runpy>", line 198, in _run_module_as_main |
| File "<frozen runpy>", line 88, in _run_code |
| File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 905, in <module> |
| main() |
| File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper |
| return f(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main |
| run(args) |
| File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run |
| elastic_launch( |
| File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ |
| return launch_agent(self._config, self._entrypoint, list(args)) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent |
| raise ChildFailedError( |
| torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
| ============================================================ |
| FAILED |
| ------------------------------------------------------------ |
| Failures: |
| <NO_OTHER_FAILURES> |
| ------------------------------------------------------------ |
| Root Cause (first observed failure): |
| [0]: |
| time : 2026-03-17_16:57:14 |
| host : job-006ce80a7c47-20260302193512-5dcd4c9bbd-gfjsn |
| rank : 0 (local_rank: 0) |
| exitcode : 2 (pid: 14058) |
| error_file: <N/A> |
| traceback : To enable traceback see: https: |
| ============================================================ |
| usage: run.py [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE] |
| [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] |
| [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone] |
| [--max-restarts MAX_RESTARTS] |
| [--monitor-interval MONITOR_INTERVAL] |
| [--start-method {spawn,fork,forkserver}] [--role ROLE] [-m] |
| [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS] |
| [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] |
| [--node-rank NODE_RANK] [--master-addr MASTER_ADDR] |
| [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR] |
| [--logs-specs LOGS_SPECS] |
| training_script ... |
| run.py: error: the following arguments are required: training_script, training_script_args |
|
|