Spaces:
Paused
Paused
File size: 6,237 Bytes
bbc88e8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | # ImmunoOrg 2.0 β Supercomputer Run Handoff (4-stage pipeline)
Hi! Thanks for running this. The whole thing is **two commands** and the cluster
does the rest unattended. Total wall-clock: **~3-4 hours** for the full 4-stage
pipeline on a single A100/H100, or **~1-1.5 hours** if you skip SFT.
What the pipeline produces:
- A trained LoRA defender (Qwen2.5-7B by default, configurable up to 14B/32B)
- 6+ evidence PNG charts (loss curves, baseline-vs-trained comparisons)
- A reusable training dataset on the HF Hub
- All artifacts auto-pushed to my HF account
---
## What you'll need
- **HF write token** (sender will give you one, will look like `hf_xxx...`).
- **GPU**: A100 / H100 / V100 (32GB+). If you have multiple, even better.
- **SLURM** (most US clusters). PBS/Torque also works β see "Non-SLURM" below.
- **Internet on GPU node** for model download. Most clusters allow this. If not,
see "Air-gapped" below.
---
## Steps (literal copy-paste)
```bash
# 1. Clone the repo (~3 sec)
git clone https://github.com/Charannoo/immunoorg.git
cd immunoorg
# 2. One-time env setup (~5-8 min, downloads PyTorch + Unsloth + TRL + flash-attn)
bash scripts/hpc/setup_env.sh
# 3. Export the HF token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# 4. Submit the entire 4-stage pipeline (returns immediately with 5 job IDs)
bash scripts/hpc/run_all.sh
```
That's it. SLURM will run all 4 stages in dependency order (each stage waits
for the previous via `--dependency=afterok:`). When stage 4 finishes, every
artifact is on the HF Hub and the sender can pull it from there.
---
## What the pipeline actually does
| Stage | Job | Resources | Time | What it produces |
| ---: | --- | --- | ---: | --- |
| 0 | datasets | CPU only, 32G RAM | ~25 min | 1700+ scenarios + 200 heuristic trajectories + SFT data + GRPO prompt set, pushed to `<user>/immunoorg-grpo-dataset` |
| 1 | SFT warm-start | 1 GPU, 64G RAM | ~25 min | LoRA adapter trained on heuristic trajectories so the model already speaks the env's JSON format before GRPO starts |
| 2 | GRPO training | 1+ GPU, 96G RAM | ~90-120 min | Final LoRA adapter, `evidence_grpo_training.png` (loss + per-reward curves) |
| 3 | evaluation | 1 GPU, 64G RAM | ~30 min | 100 episodes per family Γ 3 policies (random/heuristic/trained), produces `evidence_eval_per_family.png` and `evidence_eval_summary.png` |
| 4 | push artifacts | CPU only | ~10 min | Pushes adapter + 6+ PNGs + raw logs to `<user>/immunoorg-grpo-defender` model repo |
You can watch live with:
```bash
squeue -u $USER # job states
tail -f logs/stage*-*.out # live training log
```
---
## Customising
### Want to use multiple GPUs (recommended if you have them)?
```bash
bash scripts/hpc/run_all.sh --multigpu 4
```
Stage 2 (GRPO) will be data-parallel across 4 GPUs via `accelerate launch`.
Roughly cuts stage 2 time from 90 min to 25 min.
### Want a bigger model (14B / 32B)?
Override before submitting:
```bash
export IMMUNOORG_MODEL="Qwen/Qwen2.5-14B-Instruct" # needs A100 80GB or 2x A100 40GB
# or
export IMMUNOORG_MODEL="Qwen/Qwen2.5-32B-Instruct" # needs 2x A100 80GB or 4x A100 40GB
bash scripts/hpc/run_all.sh --multigpu 2
```
### Skip SFT (saves ~30 min, slightly weaker results)
```bash
bash scripts/hpc/run_all.sh --skip-sft
```
### Custom partition / queue names
If your partition isn't called `gpu` and `cpu`:
```bash
bash scripts/hpc/run_all.sh --partition gpu-a100 --partition-cpu compute
```
Or set env vars: `IMMUNOORG_PARTITION=gpu-a100 IMMUNOORG_PARTITION_CPU=compute`.
### Push to a different HF account
```bash
export HF_PUSH_REPO="your-username/immunoorg-defender"
export HF_DATASET_REPO="your-username/immunoorg-dataset"
bash scripts/hpc/run_all.sh
```
### Common partition names by cluster
| Cluster | GPU partition | CPU partition |
| --- | --- | --- |
| TACC (Frontera/Lonestar) | `rtx`, `v100` | `normal`, `development` |
| NCSA Delta | `gpuA100x4`, `gpuA40x4` | `cpu` |
| NERSC Perlmutter | `gpu`, `gpu_a100` | `regular_milan_ss11` |
| Most universities | `gpu`, `a100`, `h100` | `cpu`, `compute`, `general` |
Run `sinfo -o '%P %G %D'` if you're not sure.
---
## Troubleshooting
### `sbatch: error: Invalid partition specified`
β `sinfo -o '%P %G'` shows real partition names. Pass `--partition <name>`.
### Out of memory on GPU
β Smaller model: `export IMMUNOORG_MODEL="Qwen/Qwen2.5-3B-Instruct"`.
β Smaller batch: `export IMMUNOORG_GRPO_BATCH_SIZE=2 IMMUNOORG_SFT_BATCH_SIZE=2`.
### "RuntimeError: bf16 requires Ampere or newer"
β V100 (Volta) detected. The pipeline auto-falls back to fp16 β should just work.
If it doesn't, edit `scripts/hpc/pipeline/02_grpo_train.py` and force `bf16=False, fp16=True`.
### Stage 0 / 4 want a CPU partition but cluster only has GPU
β Submit them to GPU too: `bash scripts/hpc/run_all.sh --partition-cpu gpu`.
### Air-gapped GPU node (no internet)
1. On the login node:
```bash
bash scripts/hpc/setup_env.sh
source .venv-hpc/bin/activate
python -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-7B-Instruct'); \
AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')"
```
2. The model is now in `~/.cache/huggingface/`. SLURM jobs reuse the cache.
### Non-SLURM cluster (PBS/Torque, LSF, single-node interactive)
You can still run each stage manually inside an interactive GPU shell:
```bash
# Get an interactive GPU shell first (cluster-specific, e.g. PBS):
qsub -I -l select=1:ngpus=1:walltime=04:00:00
# Then:
source .venv-hpc/bin/activate
export HF_TOKEN="..."
python scripts/hpc/pipeline/00_generate_datasets.py
python scripts/hpc/pipeline/01_sft_warmstart.py
python scripts/hpc/pipeline/02_grpo_train.py
python scripts/hpc/pipeline/03_evaluate.py
python scripts/hpc/pipeline/04_push_artifacts.py
```
---
## When it's done
Just message back:
- The five SLURM job IDs (`run_all.sh` prints them)
- Confirmation that the model + dataset URLs above contain the artifacts
Sender pulls everything from those URLs and re-deploys the HF Space.
Thanks again β this run is the missing piece for the hackathon submission. π
|