File size: 6,237 Bytes
bbc88e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# ImmunoOrg 2.0 β€” Supercomputer Run Handoff (4-stage pipeline)

Hi! Thanks for running this. The whole thing is **two commands** and the cluster
does the rest unattended. Total wall-clock: **~3-4 hours** for the full 4-stage
pipeline on a single A100/H100, or **~1-1.5 hours** if you skip SFT.

What the pipeline produces:
- A trained LoRA defender (Qwen2.5-7B by default, configurable up to 14B/32B)
- 6+ evidence PNG charts (loss curves, baseline-vs-trained comparisons)
- A reusable training dataset on the HF Hub
- All artifacts auto-pushed to my HF account

---

## What you'll need

- **HF write token** (sender will give you one, will look like `hf_xxx...`).
- **GPU**: A100 / H100 / V100 (32GB+). If you have multiple, even better.
- **SLURM** (most US clusters). PBS/Torque also works β€” see "Non-SLURM" below.
- **Internet on GPU node** for model download. Most clusters allow this. If not,
  see "Air-gapped" below.

---

## Steps (literal copy-paste)

```bash
# 1. Clone the repo (~3 sec)
git clone https://github.com/Charannoo/immunoorg.git
cd immunoorg

# 2. One-time env setup (~5-8 min, downloads PyTorch + Unsloth + TRL + flash-attn)
bash scripts/hpc/setup_env.sh

# 3. Export the HF token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# 4. Submit the entire 4-stage pipeline (returns immediately with 5 job IDs)
bash scripts/hpc/run_all.sh
```

That's it. SLURM will run all 4 stages in dependency order (each stage waits
for the previous via `--dependency=afterok:`). When stage 4 finishes, every
artifact is on the HF Hub and the sender can pull it from there.

---

## What the pipeline actually does

| Stage | Job | Resources | Time | What it produces |
| ---: | --- | --- | ---: | --- |
| 0 | datasets | CPU only, 32G RAM | ~25 min | 1700+ scenarios + 200 heuristic trajectories + SFT data + GRPO prompt set, pushed to `<user>/immunoorg-grpo-dataset` |
| 1 | SFT warm-start | 1 GPU, 64G RAM | ~25 min | LoRA adapter trained on heuristic trajectories so the model already speaks the env's JSON format before GRPO starts |
| 2 | GRPO training | 1+ GPU, 96G RAM | ~90-120 min | Final LoRA adapter, `evidence_grpo_training.png` (loss + per-reward curves) |
| 3 | evaluation | 1 GPU, 64G RAM | ~30 min | 100 episodes per family Γ— 3 policies (random/heuristic/trained), produces `evidence_eval_per_family.png` and `evidence_eval_summary.png` |
| 4 | push artifacts | CPU only | ~10 min | Pushes adapter + 6+ PNGs + raw logs to `<user>/immunoorg-grpo-defender` model repo |

You can watch live with:

```bash
squeue -u $USER                       # job states
tail -f logs/stage*-*.out             # live training log
```

---

## Customising

### Want to use multiple GPUs (recommended if you have them)?

```bash
bash scripts/hpc/run_all.sh --multigpu 4
```

Stage 2 (GRPO) will be data-parallel across 4 GPUs via `accelerate launch`.
Roughly cuts stage 2 time from 90 min to 25 min.

### Want a bigger model (14B / 32B)?

Override before submitting:

```bash
export IMMUNOORG_MODEL="Qwen/Qwen2.5-14B-Instruct"      # needs A100 80GB or 2x A100 40GB
# or
export IMMUNOORG_MODEL="Qwen/Qwen2.5-32B-Instruct"      # needs 2x A100 80GB or 4x A100 40GB

bash scripts/hpc/run_all.sh --multigpu 2
```

### Skip SFT (saves ~30 min, slightly weaker results)

```bash
bash scripts/hpc/run_all.sh --skip-sft
```

### Custom partition / queue names

If your partition isn't called `gpu` and `cpu`:

```bash
bash scripts/hpc/run_all.sh --partition gpu-a100 --partition-cpu compute
```

Or set env vars: `IMMUNOORG_PARTITION=gpu-a100 IMMUNOORG_PARTITION_CPU=compute`.

### Push to a different HF account

```bash
export HF_PUSH_REPO="your-username/immunoorg-defender"
export HF_DATASET_REPO="your-username/immunoorg-dataset"
bash scripts/hpc/run_all.sh
```

### Common partition names by cluster

| Cluster | GPU partition | CPU partition |
| --- | --- | --- |
| TACC (Frontera/Lonestar) | `rtx`, `v100` | `normal`, `development` |
| NCSA Delta | `gpuA100x4`, `gpuA40x4` | `cpu` |
| NERSC Perlmutter | `gpu`, `gpu_a100` | `regular_milan_ss11` |
| Most universities | `gpu`, `a100`, `h100` | `cpu`, `compute`, `general` |

Run `sinfo -o '%P %G %D'` if you're not sure.

---

## Troubleshooting

### `sbatch: error: Invalid partition specified`
β†’ `sinfo -o '%P %G'` shows real partition names. Pass `--partition <name>`.

### Out of memory on GPU
β†’ Smaller model: `export IMMUNOORG_MODEL="Qwen/Qwen2.5-3B-Instruct"`.
β†’ Smaller batch: `export IMMUNOORG_GRPO_BATCH_SIZE=2 IMMUNOORG_SFT_BATCH_SIZE=2`.

### "RuntimeError: bf16 requires Ampere or newer"
β†’ V100 (Volta) detected. The pipeline auto-falls back to fp16 β€” should just work.
   If it doesn't, edit `scripts/hpc/pipeline/02_grpo_train.py` and force `bf16=False, fp16=True`.

### Stage 0 / 4 want a CPU partition but cluster only has GPU
β†’ Submit them to GPU too: `bash scripts/hpc/run_all.sh --partition-cpu gpu`.

### Air-gapped GPU node (no internet)
1. On the login node:
   ```bash
   bash scripts/hpc/setup_env.sh
   source .venv-hpc/bin/activate
   python -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
              AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-7B-Instruct'); \
              AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')"
   ```
2. The model is now in `~/.cache/huggingface/`. SLURM jobs reuse the cache.

### Non-SLURM cluster (PBS/Torque, LSF, single-node interactive)
You can still run each stage manually inside an interactive GPU shell:

```bash
# Get an interactive GPU shell first (cluster-specific, e.g. PBS):
qsub -I -l select=1:ngpus=1:walltime=04:00:00

# Then:
source .venv-hpc/bin/activate
export HF_TOKEN="..."
python scripts/hpc/pipeline/00_generate_datasets.py
python scripts/hpc/pipeline/01_sft_warmstart.py
python scripts/hpc/pipeline/02_grpo_train.py
python scripts/hpc/pipeline/03_evaluate.py
python scripts/hpc/pipeline/04_push_artifacts.py
```

---

## When it's done

Just message back:
- The five SLURM job IDs (`run_all.sh` prints them)
- Confirmation that the model + dataset URLs above contain the artifacts

Sender pulls everything from those URLs and re-deploys the HF Space.

Thanks again β€” this run is the missing piece for the hackathon submission. πŸ™