| ---
|
| title: CERNenv Trainer
|
| emoji: ⚛️
|
| colorFrom: indigo
|
| colorTo: pink
|
| sdk: docker
|
| suggested_hardware: a100x4
|
| suggested_storage: medium
|
| pinned: false
|
| license: bsd-3-clause
|
| short_description: GRPO trainer for CERNenv (Unsloth + LoRA, A100)
|
| ---
|
|
|
| # CERNenv Trainer (Hugging Face Space, A100)
|
|
|
| Fine-tunes a small instruction-tuned LLM (Large Language Model) to act as
|
| an LHC (Large Hadron Collider) physicist inside the **CERNenv** OpenEnv
|
| environment using **GRPO** (Group-Relative Policy Optimization),
|
| **Unsloth**, and **LoRA** (Low-Rank Adaptation).
|
|
|
| ## Hardware
|
|
|
| - Recommended: **4× A100 (`a100x4`, 320 GB VRAM, ~$10/hr)**
|
| - Single GPU also supported: `a100-large` (slower, fewer episodes recommended)
|
| - Minimum: T4 / L4 (use the Colab notebook fallback)
|
|
|
| ### Budget guidance (~$27 envelope, the default for this hackathon run)
|
|
|
| A 1500-episode GRPO run with `MODEL_NAME=unsloth/Qwen2.5-3B-Instruct`,
|
| `NUM_GENERATIONS=8`, `MAX_STEPS=18` typically lands as follows:
|
|
|
| | Hardware | $/hr | Wall-clock | Cost (1× run) | Headroom in $27 |
|
| | ------------ | ----- | ---------- | ------------- | --------------- |
|
| | `a100x4` | ~$10 | ~1.5–2 h | ~$15–20 | 1 re-run |
|
| | `a100-large` | ~$4 | ~2.5–3 h | ~$10–12 | 2+ re-runs |
|
| | `l40sx4` | ~$8 | ~2 h | ~$16 | 1 re-run |
|
|
|
| `a100x4` gets the trained adapters + evidence into your hands fastest; the
|
| multi-GPU launcher (`accelerate launch --num_processes 4`) is already wired
|
| in `_build_training_cmd`. If you want extra safety margin in case anything
|
| needs a re-run, drop to `a100-large` — wall-clock is ~2× longer but cost
|
| is ~50% lower, leaving you with budget for two complete attempts.
|
|
|
| ## Required Space secrets
|
| | Secret | Purpose |
|
| | --- | --- |
|
| | `HF_TOKEN` | Hugging Face token with `write` access for model push |
|
| | `HF_USERNAME` | Hub username, used as the default model-repo owner |
|
|
|
| ## Optional environment variables
|
| | Variable | Default | Notes |
|
| | --- | --- | --- |
|
| | `MODEL_NAME` | `unsloth/Qwen2.5-3B-Instruct` | Any chat model Unsloth supports |
|
| | `TOTAL_EPISODES` | `1500` | Prompts × generations rollouts |
|
| | `DIFFICULTY` | `easy` | Starting tier when `CURRICULUM=1`; static tier when `CURRICULUM=0` |
|
| | `CURRICULUM` | `1` | `1` enables easy→medium→hard prompt-ramp + adaptive eval-tier |
|
| | `CURRICULUM_PROMOTE` | `0.55` | Held-out success rate that promotes the eval tier one step |
|
| | `CURRICULUM_DEMOTE` | `0.10` | Rolling success rate that demotes the eval tier one step |
|
| | `MAX_STEPS` | `18` | Max steps per episode |
|
| | `NUM_GENERATIONS` | `8` | GRPO group size (bigger = better signal) |
|
| | `NUM_GPUS` | auto-detected | `accelerate launch --num_processes` value |
|
| | `CHECKPOINT_EVAL_STEPS` | `25` | Run a held-out eval every N updates |
|
| | `CHECKPOINT_EVAL_EPISODES` | `8` | Episodes per mid-training eval |
|
| | `EVAL_EPISODES` | `32` | Episodes for pre/post eval (statistical power) |
|
| | `OUTPUT_DIR` | `runs/unsloth-grpo` | LoRA adapter output |
|
| | `EVIDENCE_DIR` | `evidence` | Where curves, CSVs, plots are written |
|
| | `PUSH_REPO` | `${HF_USERNAME}/cernenv-grpo-qwen2.5-3b` | Hub repo for adapters + evidence |
|
| | `AUTOSTART` | `0` | Set to `1` to start training on Space boot |
|
|
|
| ## How to use
|
|
|
| This Space exposes a tiny FastAPI control panel:
|
| - `GET /` — status + run info + **live training-progress evidence** (curves, before/after metrics, plots)
|
| - `POST /train` — start / restart a training run
|
| - `GET /logs?tail=N` — live tail of `training.log`
|
| - `GET /metrics` — pre / post / Δ metrics JSON
|
| - `GET /evidence` — list of evidence artifacts on disk
|
| - `GET /evidence/{name}` — download an artifact (`training_curve.png`, `training_log.csv`, etc.)
|
|
|
| ### Training-progress evidence saved (and pushed to Hub)
|
| - `training_log.csv` — per-step reward, loss, KL, lr, grad-norm
|
| - `training_curve.png` — reward + loss vs step
|
| - `reward_components.csv` — per-rollout terminal vs shaping reward, plus
|
| discovery / mass / channel / parsed-action rates per logging step.
|
| This is the "watch individual reward function columns" view recommended
|
| in the hackathon FAQ — it makes verifier hacks visible (rising mean
|
| reward without rising mass/channel correctness ⇒ red flag).
|
| - `reward_components.png` — 2-panel plot rendered from the above CSV
|
| - `checkpoint_evals.csv` — held-out eval every `CHECKPOINT_EVAL_STEPS` updates
|
| - `checkpoint_progression.png` — mean reward + success/mass/channel accuracy vs step
|
| - `pre_eval.jsonl` / `post_eval.jsonl` — full per-episode rollouts before vs after
|
| - `before_after_summary.png` — pre/post bar chart with Δ annotations
|
| - `reward_distribution.png` — pre vs post reward histogram
|
| - `before_after_metrics.json` — machine-readable metrics + deltas
|
| - `sample_trajectories.md` — cherry-picked pre vs post agent traces
|
| - `curriculum_state.json` — adaptive-curriculum tier/promotion log
|
|
|
| Click **"Start training"** in the UI, or set `AUTOSTART=1` in the Space variables to kick off immediately on boot.
|
|
|
| When training finishes, the LoRA adapters are pushed to `PUSH_REPO`.
|
|
|
| ## Local equivalent
|
|
|
| The same training run is reproducible locally with:
|
|
|
| ```bash
|
| # single GPU (with curriculum)
|
| PYTHONPATH=. python -m training.training_unsloth \
|
| --model_name unsloth/Qwen2.5-3B-Instruct \
|
| --difficulty easy --curriculum --total_episodes 1500 --max_steps 18 \
|
| --num_generations 8 --output_dir runs/unsloth-grpo \
|
| --evidence_dir evidence
|
|
|
| # multi-GPU (e.g. 4× A100, with curriculum)
|
| PYTHONPATH=. accelerate launch --num_processes 4 --mixed_precision bf16 \
|
| -m training.training_unsloth \
|
| --difficulty easy --curriculum \
|
| --total_episodes 1500 --num_generations 8 \
|
| --output_dir runs/unsloth-grpo --evidence_dir evidence
|
| ```
|
|
|