Spaces:
Paused
Paused
HPC training pipeline
Turnkey 4-stage pipeline for running real GRPO training (with optional SFT warm-start) on an HPC cluster, then auto-pushing every artifact (model, dataset, evidence PNGs, raw logs) to the Hugging Face Hub.
Layout
scripts/hpc/
βββ HANDOFF.md β copy-paste-ready instructions for the cluster operator
βββ README.md β this file (reference for repo viewers)
βββ setup_env.sh β login-node env setup (uv, no conda, no sudo)
βββ run_all.sh β submit all 4 stages with SLURM dependencies
βββ pipeline/
β βββ 00_generate_datasets.py stage 0 β datasets (CPU)
β βββ 01_sft_warmstart.py stage 1 β SFT warm-start (1 GPU)
β βββ 02_grpo_train.py stage 2 β GRPO training (1+ GPUs)
β βββ 03_evaluate.py stage 3 β eval baseline vs trained (1 GPU)
β βββ 04_push_artifacts.py stage 4 β push to HF Hub (CPU)
βββ slurm/
βββ 00_datasets.sbatch
βββ 01_sft.sbatch
βββ 02_grpo.sbatch
βββ 03_eval.sbatch
βββ 04_push.sbatch
Quick reference
# One-time on login node
bash scripts/hpc/setup_env.sh
# Submit everything (returns immediately with 5 job IDs)
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
bash scripts/hpc/run_all.sh
# Watch
squeue -u $USER
tail -f logs/stage*-*.out
# Common overrides
bash scripts/hpc/run_all.sh --multigpu 4 # data-parallel GRPO
bash scripts/hpc/run_all.sh --skip-sft # GRPO only
bash scripts/hpc/run_all.sh --partition gpu-a100 # custom partition
export IMMUNOORG_MODEL=Qwen/Qwen2.5-14B-Instruct # bigger model
export HF_PUSH_REPO=your-org/your-defender-repo # custom destination
What the pipeline guarantees
- Stage 0 β 1 β 2 β 3 β 4 runs in order via
--dependency=afterok:. If any stage fails, downstream stages are cancelled automatically. - Idempotent: re-running stages 1-4 reuses cached datasets / adapters.
- Falls back gracefully: no Unsloth β plain HF; no flash-attn β SDPA; no GPU β CPU mode (slow but works).
- Auto-detects multi-GPU: if SLURM gives stage 2 more than 1 GPU, it
launches via
accelerate launchfor data-parallel GRPO.
Reading the artifacts after a run
When stage 4 finishes:
| Artifact | Where to find it |
|---|---|
| Trained LoRA adapter | https://huggingface.co/$HF_PUSH_REPO |
evidence_grpo_training.png |
committed to the model repo, also in repo root on the cluster |
evidence_eval_*.png |
same |
evaluation_results.json |
model repo, full per-policy / per-family numbers |
grpo_log.json |
model repo, raw trainer.state.log_history |
| Training datasets | https://huggingface.co/datasets/$HF_DATASET_REPO |
To pull just the PNGs back to your dev machine:
huggingface-cli download $HF_PUSH_REPO evidence_grpo_training.png \
evidence_eval_per_family.png evidence_eval_summary.png \
--local-dir . --local-dir-use-symlinks False
git add evidence_*.png
git commit -m "evidence: add real GRPO + eval curves from HPC run"
git push origin master
That commit is what closes the "Improvement in Rewards" judging gap.
Default training config (overridable per-stage)
| Stage | Model | Dataset | Time on 1Γ A100 |
|---|---|---|---|
| 0 datasets | (CPU only) | builds 1700+ scenarios + 200 trajectories | ~25 min |
| 1 SFT | Qwen2.5-7B + LoRA r=16 | sft_warmstart.jsonl.gz (~2000 samples) | ~25 min |
| 2 GRPO | Qwen2.5-7B + LoRA (warm-started) | 500 elite-mix prompts Γ 4 generations | ~90-120 min |
| 3 eval | trained adapter | 100 episodes/family Γ 3 policies | ~30 min |
| 4 push | (CPU only) | adapter + 6+ PNGs + raw logs | ~10 min |
Total: ~3-4 hours on a single A100. ~1-1.5 hours with --multigpu 4.