decodeshare / docs /HUGGINGFACE_UPLOAD.md
Zishan-Shao's picture
Upload docs/HUGGINGFACE_UPLOAD.md with huggingface_hub
fdfb84b verified

Hugging Face Artifact Upload

Large DecodeShare artifacts should live outside Git history. The intended repository is:

Zishan-Shao/decodeshare

The file docs/artifact_manifest.tsv lists large local files and suggested paths under artifacts/ in the Hugging Face repository.

Suggested upload pattern

Install and authenticate the Hugging Face CLI:

pip install -U huggingface_hub[hf_transfer]
hf auth login

For a full upload from the original workspace:

cd /path/to/decodeshare
hf upload Zishan-Shao/decodeshare Hype1/results/acts artifacts/Hype1/results/acts
hf upload Zishan-Shao/decodeshare patch_back/results artifacts/patch_back/results

Downstream outputs

downstream/outputs contains several .pt files. The profiling caches can be larger than Hugging Face Hub's 50 GB single-file limit, so upload them as ordered 10 GiB parts rather than as raw files.

The uploaded layout keeps the original run directories. Files ending in .pt.part-000, .pt.part-001, and so on are split chunks that should be concatenated in lexical order to recover the original .pt file.

Example staging flow:

SRC_ROOT=/path/to/decodeshare
STAGE=/path/to/decodeshare_hf_downstream_split
rm -rf "$STAGE"
mkdir -p "$STAGE/artifacts/downstream/outputs"

mkdir -p "$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline"
ln "$SRC_ROOT/downstream/outputs/llama2_r0.2_baseline/meta_llama_Llama_2_7b_chat_hf_whitening_only_keep0p8_baseline.pt" \
  "$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline/"
split -b 10G -d -a 3 \
  "$SRC_ROOT/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt" \
  "$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-"

mkdir -p "$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2"
ln "$SRC_ROOT/downstream/outputs/llama2_r0.2_decodeshare_a2/meta_llama_Llama_2_7b_chat_hf_whitening_only_keep0p8_decodeshare_a2p0.pt" \
  "$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2/"
split -b 10G -d -a 3 \
  "$SRC_ROOT/downstream/outputs/llama2_r0.2_decodeshare_a2/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt" \
  "$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-"

mkdir -p "$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2"
ln "$SRC_ROOT/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_whitening_only_0.8.pt" \
  "$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2/"
split -b 10G -d -a 3 \
  "$SRC_ROOT/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_profiling_wikitext2_128_0.pt" \
  "$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_profiling_wikitext2_128_0.pt.part-"

find "$STAGE" -type f -size +50G -print
hf upload-large-folder Zishan-Shao/decodeshare "$STAGE" \
  --repo-type model \
  --num-workers 4 \
  --no-bars

Reassemble a split file after download:

cat artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-* \
  > artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt

Notes

  • The current GitHub branch excludes .npy, .npz, .pt, .bin, and related large binary formats.
  • The largest local files are downstream .pt profiling and whitening outputs; the profiling outputs need the split workflow above.
  • The manifest is intentionally broad and includes local experimental archives. Inspect it before uploading everything.
  • If you want this to be a dataset repository rather than a model repository, create or switch to a Hugging Face Dataset repo and add --repo-type dataset to the hf upload commands.