hirann commited on
Commit
bbc88e8
Β·
verified Β·
1 Parent(s): 119ef98

Upload scripts/hpc/HANDOFF.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. scripts/hpc/HANDOFF.md +181 -0
scripts/hpc/HANDOFF.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ImmunoOrg 2.0 β€” Supercomputer Run Handoff (4-stage pipeline)
2
+
3
+ Hi! Thanks for running this. The whole thing is **two commands** and the cluster
4
+ does the rest unattended. Total wall-clock: **~3-4 hours** for the full 4-stage
5
+ pipeline on a single A100/H100, or **~1-1.5 hours** if you skip SFT.
6
+
7
+ What the pipeline produces:
8
+ - A trained LoRA defender (Qwen2.5-7B by default, configurable up to 14B/32B)
9
+ - 6+ evidence PNG charts (loss curves, baseline-vs-trained comparisons)
10
+ - A reusable training dataset on the HF Hub
11
+ - All artifacts auto-pushed to my HF account
12
+
13
+ ---
14
+
15
+ ## What you'll need
16
+
17
+ - **HF write token** (sender will give you one, will look like `hf_xxx...`).
18
+ - **GPU**: A100 / H100 / V100 (32GB+). If you have multiple, even better.
19
+ - **SLURM** (most US clusters). PBS/Torque also works β€” see "Non-SLURM" below.
20
+ - **Internet on GPU node** for model download. Most clusters allow this. If not,
21
+ see "Air-gapped" below.
22
+
23
+ ---
24
+
25
+ ## Steps (literal copy-paste)
26
+
27
+ ```bash
28
+ # 1. Clone the repo (~3 sec)
29
+ git clone https://github.com/Charannoo/immunoorg.git
30
+ cd immunoorg
31
+
32
+ # 2. One-time env setup (~5-8 min, downloads PyTorch + Unsloth + TRL + flash-attn)
33
+ bash scripts/hpc/setup_env.sh
34
+
35
+ # 3. Export the HF token
36
+ export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
37
+
38
+ # 4. Submit the entire 4-stage pipeline (returns immediately with 5 job IDs)
39
+ bash scripts/hpc/run_all.sh
40
+ ```
41
+
42
+ That's it. SLURM will run all 4 stages in dependency order (each stage waits
43
+ for the previous via `--dependency=afterok:`). When stage 4 finishes, every
44
+ artifact is on the HF Hub and the sender can pull it from there.
45
+
46
+ ---
47
+
48
+ ## What the pipeline actually does
49
+
50
+ | Stage | Job | Resources | Time | What it produces |
51
+ | ---: | --- | --- | ---: | --- |
52
+ | 0 | datasets | CPU only, 32G RAM | ~25 min | 1700+ scenarios + 200 heuristic trajectories + SFT data + GRPO prompt set, pushed to `<user>/immunoorg-grpo-dataset` |
53
+ | 1 | SFT warm-start | 1 GPU, 64G RAM | ~25 min | LoRA adapter trained on heuristic trajectories so the model already speaks the env's JSON format before GRPO starts |
54
+ | 2 | GRPO training | 1+ GPU, 96G RAM | ~90-120 min | Final LoRA adapter, `evidence_grpo_training.png` (loss + per-reward curves) |
55
+ | 3 | evaluation | 1 GPU, 64G RAM | ~30 min | 100 episodes per family Γ— 3 policies (random/heuristic/trained), produces `evidence_eval_per_family.png` and `evidence_eval_summary.png` |
56
+ | 4 | push artifacts | CPU only | ~10 min | Pushes adapter + 6+ PNGs + raw logs to `<user>/immunoorg-grpo-defender` model repo |
57
+
58
+ You can watch live with:
59
+
60
+ ```bash
61
+ squeue -u $USER # job states
62
+ tail -f logs/stage*-*.out # live training log
63
+ ```
64
+
65
+ ---
66
+
67
+ ## Customising
68
+
69
+ ### Want to use multiple GPUs (recommended if you have them)?
70
+
71
+ ```bash
72
+ bash scripts/hpc/run_all.sh --multigpu 4
73
+ ```
74
+
75
+ Stage 2 (GRPO) will be data-parallel across 4 GPUs via `accelerate launch`.
76
+ Roughly cuts stage 2 time from 90 min to 25 min.
77
+
78
+ ### Want a bigger model (14B / 32B)?
79
+
80
+ Override before submitting:
81
+
82
+ ```bash
83
+ export IMMUNOORG_MODEL="Qwen/Qwen2.5-14B-Instruct" # needs A100 80GB or 2x A100 40GB
84
+ # or
85
+ export IMMUNOORG_MODEL="Qwen/Qwen2.5-32B-Instruct" # needs 2x A100 80GB or 4x A100 40GB
86
+
87
+ bash scripts/hpc/run_all.sh --multigpu 2
88
+ ```
89
+
90
+ ### Skip SFT (saves ~30 min, slightly weaker results)
91
+
92
+ ```bash
93
+ bash scripts/hpc/run_all.sh --skip-sft
94
+ ```
95
+
96
+ ### Custom partition / queue names
97
+
98
+ If your partition isn't called `gpu` and `cpu`:
99
+
100
+ ```bash
101
+ bash scripts/hpc/run_all.sh --partition gpu-a100 --partition-cpu compute
102
+ ```
103
+
104
+ Or set env vars: `IMMUNOORG_PARTITION=gpu-a100 IMMUNOORG_PARTITION_CPU=compute`.
105
+
106
+ ### Push to a different HF account
107
+
108
+ ```bash
109
+ export HF_PUSH_REPO="your-username/immunoorg-defender"
110
+ export HF_DATASET_REPO="your-username/immunoorg-dataset"
111
+ bash scripts/hpc/run_all.sh
112
+ ```
113
+
114
+ ### Common partition names by cluster
115
+
116
+ | Cluster | GPU partition | CPU partition |
117
+ | --- | --- | --- |
118
+ | TACC (Frontera/Lonestar) | `rtx`, `v100` | `normal`, `development` |
119
+ | NCSA Delta | `gpuA100x4`, `gpuA40x4` | `cpu` |
120
+ | NERSC Perlmutter | `gpu`, `gpu_a100` | `regular_milan_ss11` |
121
+ | Most universities | `gpu`, `a100`, `h100` | `cpu`, `compute`, `general` |
122
+
123
+ Run `sinfo -o '%P %G %D'` if you're not sure.
124
+
125
+ ---
126
+
127
+ ## Troubleshooting
128
+
129
+ ### `sbatch: error: Invalid partition specified`
130
+ β†’ `sinfo -o '%P %G'` shows real partition names. Pass `--partition <name>`.
131
+
132
+ ### Out of memory on GPU
133
+ β†’ Smaller model: `export IMMUNOORG_MODEL="Qwen/Qwen2.5-3B-Instruct"`.
134
+ β†’ Smaller batch: `export IMMUNOORG_GRPO_BATCH_SIZE=2 IMMUNOORG_SFT_BATCH_SIZE=2`.
135
+
136
+ ### "RuntimeError: bf16 requires Ampere or newer"
137
+ β†’ V100 (Volta) detected. The pipeline auto-falls back to fp16 β€” should just work.
138
+ If it doesn't, edit `scripts/hpc/pipeline/02_grpo_train.py` and force `bf16=False, fp16=True`.
139
+
140
+ ### Stage 0 / 4 want a CPU partition but cluster only has GPU
141
+ β†’ Submit them to GPU too: `bash scripts/hpc/run_all.sh --partition-cpu gpu`.
142
+
143
+ ### Air-gapped GPU node (no internet)
144
+ 1. On the login node:
145
+ ```bash
146
+ bash scripts/hpc/setup_env.sh
147
+ source .venv-hpc/bin/activate
148
+ python -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
149
+ AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-7B-Instruct'); \
150
+ AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')"
151
+ ```
152
+ 2. The model is now in `~/.cache/huggingface/`. SLURM jobs reuse the cache.
153
+
154
+ ### Non-SLURM cluster (PBS/Torque, LSF, single-node interactive)
155
+ You can still run each stage manually inside an interactive GPU shell:
156
+
157
+ ```bash
158
+ # Get an interactive GPU shell first (cluster-specific, e.g. PBS):
159
+ qsub -I -l select=1:ngpus=1:walltime=04:00:00
160
+
161
+ # Then:
162
+ source .venv-hpc/bin/activate
163
+ export HF_TOKEN="..."
164
+ python scripts/hpc/pipeline/00_generate_datasets.py
165
+ python scripts/hpc/pipeline/01_sft_warmstart.py
166
+ python scripts/hpc/pipeline/02_grpo_train.py
167
+ python scripts/hpc/pipeline/03_evaluate.py
168
+ python scripts/hpc/pipeline/04_push_artifacts.py
169
+ ```
170
+
171
+ ---
172
+
173
+ ## When it's done
174
+
175
+ Just message back:
176
+ - The five SLURM job IDs (`run_all.sh` prints them)
177
+ - Confirmation that the model + dataset URLs above contain the artifacts
178
+
179
+ Sender pulls everything from those URLs and re-deploys the HF Space.
180
+
181
+ Thanks again β€” this run is the missing piece for the hackathon submission. πŸ™