File size: 3,920 Bytes
119ef98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# HPC training pipeline

Turnkey 4-stage pipeline for running real GRPO training (with optional SFT
warm-start) on an HPC cluster, then auto-pushing every artifact (model,
dataset, evidence PNGs, raw logs) to the Hugging Face Hub.

## Layout

```
scripts/hpc/
β”œβ”€β”€ HANDOFF.md            ← copy-paste-ready instructions for the cluster operator
β”œβ”€β”€ README.md             ← this file (reference for repo viewers)
β”œβ”€β”€ setup_env.sh          ← login-node env setup (uv, no conda, no sudo)
β”œβ”€β”€ run_all.sh            ← submit all 4 stages with SLURM dependencies
β”œβ”€β”€ pipeline/
β”‚   β”œβ”€β”€ 00_generate_datasets.py   stage 0 β€” datasets (CPU)
β”‚   β”œβ”€β”€ 01_sft_warmstart.py       stage 1 β€” SFT warm-start (1 GPU)
β”‚   β”œβ”€β”€ 02_grpo_train.py          stage 2 β€” GRPO training (1+ GPUs)
β”‚   β”œβ”€β”€ 03_evaluate.py            stage 3 β€” eval baseline vs trained (1 GPU)
β”‚   └── 04_push_artifacts.py      stage 4 β€” push to HF Hub (CPU)
└── slurm/
    β”œβ”€β”€ 00_datasets.sbatch
    β”œβ”€β”€ 01_sft.sbatch
    β”œβ”€β”€ 02_grpo.sbatch
    β”œβ”€β”€ 03_eval.sbatch
    └── 04_push.sbatch
```

## Quick reference

```bash
# One-time on login node
bash scripts/hpc/setup_env.sh

# Submit everything (returns immediately with 5 job IDs)
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
bash scripts/hpc/run_all.sh

# Watch
squeue -u $USER
tail -f logs/stage*-*.out

# Common overrides
bash scripts/hpc/run_all.sh --multigpu 4                    # data-parallel GRPO
bash scripts/hpc/run_all.sh --skip-sft                      # GRPO only
bash scripts/hpc/run_all.sh --partition gpu-a100            # custom partition
export IMMUNOORG_MODEL=Qwen/Qwen2.5-14B-Instruct            # bigger model
export HF_PUSH_REPO=your-org/your-defender-repo             # custom destination
```

## What the pipeline guarantees

- **Stage 0 β†’ 1 β†’ 2 β†’ 3 β†’ 4** runs in order via `--dependency=afterok:`.
  If any stage fails, downstream stages are cancelled automatically.
- **Idempotent**: re-running stages 1-4 reuses cached datasets / adapters.
- **Falls back gracefully**: no Unsloth β†’ plain HF; no flash-attn β†’ SDPA;
  no GPU β†’ CPU mode (slow but works).
- **Auto-detects multi-GPU**: if SLURM gives stage 2 more than 1 GPU, it
  launches via `accelerate launch` for data-parallel GRPO.

## Reading the artifacts after a run

When stage 4 finishes:

| Artifact | Where to find it |
| --- | --- |
| Trained LoRA adapter | `https://huggingface.co/$HF_PUSH_REPO` |
| `evidence_grpo_training.png` | committed to the model repo, also in repo root on the cluster |
| `evidence_eval_*.png` | same |
| `evaluation_results.json` | model repo, full per-policy / per-family numbers |
| `grpo_log.json` | model repo, raw `trainer.state.log_history` |
| Training datasets | `https://huggingface.co/datasets/$HF_DATASET_REPO` |

To pull just the PNGs back to your dev machine:

```bash
huggingface-cli download $HF_PUSH_REPO evidence_grpo_training.png \
    evidence_eval_per_family.png evidence_eval_summary.png \
    --local-dir . --local-dir-use-symlinks False

git add evidence_*.png
git commit -m "evidence: add real GRPO + eval curves from HPC run"
git push origin master
```

That commit is what closes the "Improvement in Rewards" judging gap.

## Default training config (overridable per-stage)

| Stage | Model | Dataset | Time on 1Γ— A100 |
| --- | --- | --- | ---: |
| 0 datasets | (CPU only) | builds 1700+ scenarios + 200 trajectories | ~25 min |
| 1 SFT | Qwen2.5-7B + LoRA r=16 | sft_warmstart.jsonl.gz (~2000 samples) | ~25 min |
| 2 GRPO | Qwen2.5-7B + LoRA (warm-started) | 500 elite-mix prompts Γ— 4 generations | ~90-120 min |
| 3 eval | trained adapter | 100 episodes/family Γ— 3 policies | ~30 min |
| 4 push | (CPU only) | adapter + 6+ PNGs + raw logs | ~10 min |

**Total: ~3-4 hours on a single A100. ~1-1.5 hours with `--multigpu 4`.**