One shot bootstrap for 2 Sparks not working

by easonchow0419 - opened 8 days ago

nvidia@aitopatom-c1bf:$ ./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b --head-qsfp-ip 10.0.0.1 --worker-qsfp-ip 10.0.0.2 --qsfp-ifname enp1s0f1np1
[19:31:53] [1/9] SSH reachability check...
[19:31:54] ok — both Sparks reachable
[19:31:54] [2/9] Ensuring pastapaul/DeepSeek-V4-Flash-W4A16-FP8 is cached on both Sparks (143 GiB)...
[aitopatom-c1bf] already cached at /home/nvidia/.cache/huggingface/hub/models--pastapaul--DeepSeek-V4-Flash-W4A16-FP8
[aitopatom-cbca] already cached at /home/nvidia/.cache/huggingface/hub/models--pastapaul--DeepSeek-V4-Flash-W4A16-FP8
[19:31:55] [3/9] Configuring QSFP /30 on both Sparks...
[19:31:55] verifying connectivity...
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.333 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.843 ms

--- 10.0.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1015ms
rtt min/avg/max/mdev = 0.333/0.588/0.843/0.255 ms
[19:31:57] ok — QSFP up, < 1 ms RTT
[19:31:57] [4-5/9] Building vllm-w4a16-dsv4:exp on spark-a from jasl/vllm@ds4-sm120-experimental + cherry-pick + packed_modules patch...
[19:31:57] (~25-40 min on a Spark; image ships to worker via docker save | scp | docker load)
Cloning into 'spark-vllm-docker'...
Downloading flashinfer_cubin-0.6.11-py3-none-any.whl...
######################################################################## 100.0%
Downloading flashinfer_jit_cache-0.6.11-cp39-abi3-manylinux_2_28_aarch64.whl...
######################################################################## 100.0%
Downloading flashinfer_python-0.6.11-py3-none-any.whl...
######################################################################## 100.0%
Recorded flashinfer commit hash: ef983122
FlashInfer wheels ready.
Rebuilding vLLM wheels (--vllm-ref specified)...
vLLM build command: docker build --target vllm-export --output type=local,dest=./wheels --progress=plain --build-arg BUILD_JOBS=16 --build-arg TORCH_CUDA_ARCH_LIST=12.1a --build-arg FLASHINFER_CUDA_ARCH_LIST=12.1a --build-arg VLLM_REF=ds4-sm120-experimental --build-arg CACHEBUST_VLLM=1778844754 .
#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 15.65kB done
#1 DONE 0.0s

#2 resolve image config for docker-image://docker.io/docker/dockerfile:1.6
#2 DONE 1.6s

#3 docker-image://docker.io/docker/dockerfile:1.6@sha256:ac85f380a63b13dfcefa89046420e1781752bab202122f8f50032edf31be0021
#3 CACHED

#4 [internal] load metadata for docker.io/nvidia/cuda:13.2.0-devel-ubuntu24.04
#4 DONE 1.4s

#5 [internal] load .dockerignore
#5 transferring context: 2B done
#5 DONE 0.0s

#6 [base 1/5] FROM docker.io/nvidia/cuda:13.2.0-devel-ubuntu24.04@sha256:f9492f2eea77fbc3d0c14fa8738f35946b42da72917bf5959d284ca39b4f209a
#6 DONE 0.0s

#7 [base 5/5] RUN git clone -b dgxspark-3node-ring https://github.com/zyang-dev/nccl.git && cd nccl && make -j 16 src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121" && make pkg.debian.build && apt install -y --no-install-recommends --allow-downgrades ./build/pkg/deb/*.deb
#7 CACHED

#8 [base 2/5] RUN apt update && apt install -y --no-install-recommends curl vim cmake build-essential ninja-build libcudnn9-cuda-13 libcudnn9-dev-cuda-13 python3-dev python3-pip git wget libibverbs1 libibverbs-dev rdma-core ccache devscripts debhelper fakeroot && rm -rf /var/lib/apt/lists/* && pip install uv
#8 CACHED

#9 [base 3/5] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv uv pip install torch==2.11.0 torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130 && uv pip install nvidia-nvshmem-cu13 "apache-tvm-ffi<0.2" filelock pynvml requests tqdm
#9 CACHED

#10 [base 4/5] WORKDIR /workspace/vllm
#10 CACHED

#11 [vllm-builder 1/11] WORKDIR /workspace/vllm
#11 CACHED

#12 [internal] load build context
#12 transferring context: 1.96kB 0.1s done
#12 DONE 0.1s

#13 [vllm-builder 3/11] WORKDIR /workspace/vllm/vllm
#13 CACHED

#14 [vllm-builder 4/11] RUN if [ -n "" ]; then git config --global user.email "builder@example.com"; git config --global user.name "Docker Builder"; echo "Applying PRs: "; for pr in ; do echo "Fetching and merging PR #$pr..."; git fetch origin pull/${pr}/head:pr-${pr}; git merge pr-${pr} --no-edit; done; fi
#14 CACHED

#15 [vllm-builder 5/11] COPY kylesayrs-deepseek-ct.patch /tmp/kylesayrs-deepseek-ct.patch
#15 ERROR: failed to calculate checksum of ref a1db2961-a383-481a-a9d8-959d28d74c2e::nsf3gybaao6i1tje326dxpqe1: "/kylesayrs-deepseek-ct.patch": not found

#16 [vllm-builder 2/11] RUN --mount=type=cache,id=repo-cache-v3,target=/repo-cache cd /repo-cache && if [ ! -d "vllm" ]; then echo "Cache miss: Cloning vLLM from scratch..." && git clone --recursive https://github.com/jasl/vllm.git; if [ "ds4-sm120-experimental" != "main" ]; then cd vllm && git checkout ds4-sm120-experimental; fi; else echo "Cache hit: Fetching updates..." && cd vllm && git fetch origin && git fetch origin --tags --force && (git checkout --detach origin/ds4-sm120-experimental 2>/dev/null || git checkout ds4-sm120-experimental) && git submodule update --init --recursive && git clean -fdx && git gc --auto; fi && cp -a /repo-cache/vllm /workspace/vllm/
#16 0.138 Cache hit: Fetching updates...
#16 CANCELED

[vllm-builder 5/11] COPY kylesayrs-deepseek-ct.patch /tmp/kylesayrs-deepseek-ct.patch:

ERROR: failed to build: failed to solve: failed to compute cache key: failed to calculate checksum of ref a1db2961-a383-481a-a9d8-959d28d74c2e::nsf3gybaao6i1tje326dxpqe1: "/kylesayrs-deepseek-ct.patch": not found
vLLM build failed — restoring previous wheels...

pastapaul

Canada Quant Labs org 7 days ago

Thanks for the report — this is our bug, not yours. The Dockerfile started COPYing kylesayrs-deepseek-ct.patch into the build context after I migrated away from a live cherry-pick (Kyle force-pushed his branch out of history a week ago), but I never updated the bootstrap script to curl the patch file alongside the Dockerfile. So the script was guaranteed to fail at the COPY step on any fresh run. Sorry about that.
Fixed on main in 4c64828. Just re-fetch and run again:
bashcurl -fsSLO https://raw.githubusercontent.com/pasta-paul/dsv4-flash-w4a16-fp8/main/scripts/bootstrap_dsv4_spark.sh
chmod +x bootstrap_dsv4_spark.sh
./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b ...
Same commit also adds: always-on huggingface-cli download (instead of a lax cache-present check that could pass on half-cached models), a build-metadata.yaml dump to /tmp/dsv4-spark-build-metadata-*.yaml on success, and dual-node diagnostics (last 300 lines of logs + env + nvidia-smi + dmesg from both nodes) on any failure during boot. If you hit something else, paste the /tmp/...metadata.yaml plus whatever the failure dump prints and I can take it from there.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment