Zishan-Shao commited on
Commit
fdfb84b
·
verified ·
1 Parent(s): 87e8ca3

Upload docs/HUGGINGFACE_UPLOAD.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/HUGGINGFACE_UPLOAD.md +51 -6
docs/HUGGINGFACE_UPLOAD.md CHANGED
@@ -22,17 +22,62 @@ hf auth login
22
  For a full upload from the original workspace:
23
 
24
  ```bash
25
- cd
26
  hf upload Zishan-Shao/decodeshare Hype1/results/acts artifacts/Hype1/results/acts
27
- hf upload Zishan-Shao/decodeshare downstream/outputs artifacts/downstream/outputs
28
  hf upload Zishan-Shao/decodeshare patch_back/results artifacts/patch_back/results
29
  ```
30
 
31
- For a smaller first release, upload only the most reusable artifacts:
 
 
 
 
 
 
 
 
 
 
32
 
33
  ```bash
34
- hf upload Zishan-Shao/decodeshare Hype1/results/acts artifacts/Hype1/results/acts
35
- hf upload Zishan-Shao/decodeshare patch_back/results artifacts/patch_back/results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ```
37
 
38
  ## Notes
@@ -40,7 +85,7 @@ hf upload Zishan-Shao/decodeshare patch_back/results artifacts/patch_back/result
40
  - The current GitHub branch excludes `.npy`, `.npz`, `.pt`, `.bin`, and related
41
  large binary formats.
42
  - The largest local files are downstream `.pt` profiling and whitening outputs;
43
- decide whether those are necessary before uploading the full manifest.
44
  - The manifest is intentionally broad and includes local experimental archives.
45
  Inspect it before uploading everything.
46
  - If you want this to be a dataset repository rather than a model repository,
 
22
  For a full upload from the original workspace:
23
 
24
  ```bash
25
+ cd /path/to/decodeshare
26
  hf upload Zishan-Shao/decodeshare Hype1/results/acts artifacts/Hype1/results/acts
 
27
  hf upload Zishan-Shao/decodeshare patch_back/results artifacts/patch_back/results
28
  ```
29
 
30
+ ## Downstream outputs
31
+
32
+ `downstream/outputs` contains several `.pt` files. The profiling caches can be
33
+ larger than Hugging Face Hub's 50 GB single-file limit, so upload them as
34
+ ordered 10 GiB parts rather than as raw files.
35
+
36
+ The uploaded layout keeps the original run directories. Files ending in
37
+ `.pt.part-000`, `.pt.part-001`, and so on are split chunks that should be
38
+ concatenated in lexical order to recover the original `.pt` file.
39
+
40
+ Example staging flow:
41
 
42
  ```bash
43
+ SRC_ROOT=/path/to/decodeshare
44
+ STAGE=/path/to/decodeshare_hf_downstream_split
45
+ rm -rf "$STAGE"
46
+ mkdir -p "$STAGE/artifacts/downstream/outputs"
47
+
48
+ mkdir -p "$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline"
49
+ ln "$SRC_ROOT/downstream/outputs/llama2_r0.2_baseline/meta_llama_Llama_2_7b_chat_hf_whitening_only_keep0p8_baseline.pt" \
50
+ "$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline/"
51
+ split -b 10G -d -a 3 \
52
+ "$SRC_ROOT/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt" \
53
+ "$STAGE/artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-"
54
+
55
+ mkdir -p "$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2"
56
+ ln "$SRC_ROOT/downstream/outputs/llama2_r0.2_decodeshare_a2/meta_llama_Llama_2_7b_chat_hf_whitening_only_keep0p8_decodeshare_a2p0.pt" \
57
+ "$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2/"
58
+ split -b 10G -d -a 3 \
59
+ "$SRC_ROOT/downstream/outputs/llama2_r0.2_decodeshare_a2/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt" \
60
+ "$STAGE/artifacts/downstream/outputs/llama2_r0.2_decodeshare_a2/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-"
61
+
62
+ mkdir -p "$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2"
63
+ ln "$SRC_ROOT/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_whitening_only_0.8.pt" \
64
+ "$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2/"
65
+ split -b 10G -d -a 3 \
66
+ "$SRC_ROOT/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_profiling_wikitext2_128_0.pt" \
67
+ "$STAGE/artifacts/downstream/outputs/svdllm_whiten_r0.2/meta_llama_Llama_2_7b_chat_hf_profiling_wikitext2_128_0.pt.part-"
68
+
69
+ find "$STAGE" -type f -size +50G -print
70
+ hf upload-large-folder Zishan-Shao/decodeshare "$STAGE" \
71
+ --repo-type model \
72
+ --num-workers 4 \
73
+ --no-bars
74
+ ```
75
+
76
+ Reassemble a split file after download:
77
+
78
+ ```bash
79
+ cat artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt.part-* \
80
+ > artifacts/downstream/outputs/llama2_r0.2_baseline/meta-llama_Llama-2-7b-chat-hf_profiling___calib_mix_jsonl_128_0.pt
81
  ```
82
 
83
  ## Notes
 
85
  - The current GitHub branch excludes `.npy`, `.npz`, `.pt`, `.bin`, and related
86
  large binary formats.
87
  - The largest local files are downstream `.pt` profiling and whitening outputs;
88
+ the profiling outputs need the split workflow above.
89
  - The manifest is intentionally broad and includes local experimental archives.
90
  Inspect it before uploading everything.
91
  - If you want this to be a dataset repository rather than a model repository,