decodeshare / docs /HUGGINGFACE_UPLOAD.md

Upload docs/HUGGINGFACE_UPLOAD.md with huggingface_hub

fdfb84b verified 6 days ago

3.92 kB

	# Hugging Face Artifact Upload

	Large DecodeShare artifacts should live outside Git history. The intended
	repository is:

	```text
	Zishan-Shao/decodeshare
	```

	The file `docs/artifact_manifest.tsv` lists large local files and suggested
	paths under `artifacts/` in the Hugging Face repository.

	## Suggested upload pattern

	Install and authenticate the Hugging Face CLI:

	```bash
	pip install -U huggingface_hub[hf_transfer]
	hf auth login
	```

	For a full upload from the original workspace:

	```bash
	cd /path/to/decodeshare
	hf upload Zishan-Shao/decodeshare Hype1/results/acts artifacts/Hype1/results/acts
	hf upload Zishan-Shao/decodeshare patch_back/results artifacts/patch_back/results
	```

	## Downstream outputs

	`downstream/outputs` contains several `.pt` files. The profiling caches can be
	larger than Hugging Face Hub's 50 GB single-file limit, so upload them as
	ordered 10 GiB parts rather than as raw files.

	The uploaded layout keeps the original run directories. Files ending in
	`.pt.part-000`, `.pt.part-001`, and so on are split chunks that should be
	concatenated in lexical order to recover the original `.pt` file.

	Example staging flow:

	```bash
	SRC_ROOT=/path/to/decodeshare
	STAGE=/path/to/decodeshare_hf_downstream_split
	rm -rf "$STAGE"
	mkdir -p "$STAGE/artifacts/downstream/outputs"

	mkdir -p "$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline"
	ln "$SRC_ROOT/downstream/outputs/llama2_r0.2_baseline/meta_llama_Llama_2_7b_chat_hf_whitening_only_keep0p8_baseline.pt" \
	"$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline/"
	split -b 10G -d -a 3 \
	"$SRC_ROOT/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt" \
	"$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-"

	mkdir -p "$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2"
	ln "$SRC_ROOT/downstream/outputs/llama2_r0.2_decodeshare_a2/meta_llama_Llama_2_7b_chat_hf_whitening_only_keep0p8_decodeshare_a2p0.pt" \
	"$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2/"
	split -b 10G -d -a 3 \
	"$SRC_ROOT/downstream/outputs/llama2_r0.2_decodeshare_a2/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt" \
	"$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-"

	mkdir -p "$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2"
	ln "$SRC_ROOT/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_whitening_only_0.8.pt" \
	"$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2/"
	split -b 10G -d -a 3 \
	"$SRC_ROOT/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_profiling_wikitext2_128_0.pt" \
	"$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_profiling_wikitext2_128_0.pt.part-"

	find "$STAGE" -type f -size +50G -print
	hf upload-large-folder Zishan-Shao/decodeshare "$STAGE" \
	--repo-type model \
	--num-workers 4 \
	--no-bars
	```

	Reassemble a split file after download:

	```bash
	cat artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-* \
	> artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt
	```

	## Notes

	- The current GitHub branch excludes `.npy`, `.npz`, `.pt`, `.bin`, and related
	large binary formats.
	- The largest local files are downstream `.pt` profiling and whitening outputs;
	the profiling outputs need the split workflow above.
	- The manifest is intentionally broad and includes local experimental archives.
	Inspect it before uploading everything.
	- If you want this to be a dataset repository rather than a model repository,
	create or switch to a Hugging Face Dataset repo and add `--repo-type dataset`
	to the `hf upload` commands.