Spaces:

anugrahhu
/

cernenv

Sleeping

App Files Files Community

cernenv / space /training /README.md

anugrahhu

Update CERNenv Space

f28409b verified 14 days ago

preview code

raw

history blame contribute delete

5.87 kB

	---
	title: CERNenv Trainer
	emoji: ⚛️
	colorFrom: indigo
	colorTo: pink
	sdk: docker
	suggested_hardware: a100x4
	suggested_storage: medium
	pinned: false
	license: bsd-3-clause
	short_description: GRPO trainer for CERNenv (Unsloth + LoRA, A100)
	---

	# CERNenv Trainer (Hugging Face Space, A100)

	Fine-tunes a small instruction-tuned LLM (Large Language Model) to act as
	an LHC (Large Hadron Collider) physicist inside the CERNenv OpenEnv
	environment using GRPO (Group-Relative Policy Optimization),
	Unsloth, and LoRA (Low-Rank Adaptation).

	## Hardware

	- Recommended: 4× A100 (`a100x4`, 320 GB VRAM, ~$10/hr)
	- Single GPU also supported: `a100-large` (slower, fewer episodes recommended)
	- Minimum: T4 / L4 (use the Colab notebook fallback)

	### Budget guidance (~$27 envelope, the default for this hackathon run)

	A 1500-episode GRPO run with `MODEL_NAME=unsloth/Qwen2.5-3B-Instruct`,
	`NUM_GENERATIONS=8`, `MAX_STEPS=18` typically lands as follows:

	\| Hardware \| $/hr \| Wall-clock \| Cost (1× run) \| Headroom in $27 \|
	\| ------------ \| ----- \| ---------- \| ------------- \| --------------- \|
	\| `a100x4` \| ~$10 \| ~1.5–2 h \| ~$15–20 \| 1 re-run \|
	\| `a100-large` \| ~$4 \| ~2.5–3 h \| ~$10–12 \| 2+ re-runs \|
	\| `l40sx4` \| ~$8 \| ~2 h \| ~$16 \| 1 re-run \|

	`a100x4` gets the trained adapters + evidence into your hands fastest; the
	multi-GPU launcher (`accelerate launch --num_processes 4`) is already wired
	in `_build_training_cmd`. If you want extra safety margin in case anything
	needs a re-run, drop to `a100-large` — wall-clock is ~2× longer but cost
	is ~50% lower, leaving you with budget for two complete attempts.

	## Required Space secrets
	\| Secret \| Purpose \|
	\| --- \| --- \|
	\| `HF_TOKEN` \| Hugging Face token with `write` access for model push \|
	\| `HF_USERNAME` \| Hub username, used as the default model-repo owner \|

	## Optional environment variables
	\| Variable \| Default \| Notes \|
	\| --- \| --- \| --- \|
	\| `MODEL_NAME` \| `unsloth/Qwen2.5-3B-Instruct` \| Any chat model Unsloth supports \|
	\| `TOTAL_EPISODES` \| `1500` \| Prompts × generations rollouts \|
	\| `DIFFICULTY` \| `easy` \| Starting tier when `CURRICULUM=1`; static tier when `CURRICULUM=0` \|
	\| `CURRICULUM` \| `1` \| `1` enables easy→medium→hard prompt-ramp + adaptive eval-tier \|
	\| `CURRICULUM_PROMOTE` \| `0.55` \| Held-out success rate that promotes the eval tier one step \|
	\| `CURRICULUM_DEMOTE` \| `0.10` \| Rolling success rate that demotes the eval tier one step \|
	\| `MAX_STEPS` \| `18` \| Max steps per episode \|
	\| `NUM_GENERATIONS` \| `8` \| GRPO group size (bigger = better signal) \|
	\| `NUM_GPUS` \| auto-detected \| `accelerate launch --num_processes` value \|
	\| `CHECKPOINT_EVAL_STEPS` \| `25` \| Run a held-out eval every N updates \|
	\| `CHECKPOINT_EVAL_EPISODES` \| `8` \| Episodes per mid-training eval \|
	\| `EVAL_EPISODES` \| `32` \| Episodes for pre/post eval (statistical power) \|
	\| `OUTPUT_DIR` \| `runs/unsloth-grpo` \| LoRA adapter output \|
	\| `EVIDENCE_DIR` \| `evidence` \| Where curves, CSVs, plots are written \|
	\| `PUSH_REPO` \| `${HF_USERNAME}/cernenv-grpo-qwen2.5-3b` \| Hub repo for adapters + evidence \|
	\| `AUTOSTART` \| `0` \| Set to `1` to start training on Space boot \|

	## How to use

	This Space exposes a tiny FastAPI control panel:
	- `GET /` — status + run info + live training-progress evidence (curves, before/after metrics, plots)
	- `POST /train` — start / restart a training run
	- `GET /logs?tail=N` — live tail of `training.log`
	- `GET /metrics` — pre / post / Δ metrics JSON
	- `GET /evidence` — list of evidence artifacts on disk
	- `GET /evidence/{name}` — download an artifact (`training_curve.png`, `training_log.csv`, etc.)

	### Training-progress evidence saved (and pushed to Hub)
	- `training_log.csv` — per-step reward, loss, KL, lr, grad-norm
	- `training_curve.png` — reward + loss vs step
	- `reward_components.csv` — per-rollout terminal vs shaping reward, plus
	discovery / mass / channel / parsed-action rates per logging step.
	This is the "watch individual reward function columns" view recommended
	in the hackathon FAQ — it makes verifier hacks visible (rising mean
	reward without rising mass/channel correctness ⇒ red flag).
	- `reward_components.png` — 2-panel plot rendered from the above CSV
	- `checkpoint_evals.csv` — held-out eval every `CHECKPOINT_EVAL_STEPS` updates
	- `checkpoint_progression.png` — mean reward + success/mass/channel accuracy vs step
	- `pre_eval.jsonl` / `post_eval.jsonl` — full per-episode rollouts before vs after
	- `before_after_summary.png` — pre/post bar chart with Δ annotations
	- `reward_distribution.png` — pre vs post reward histogram
	- `before_after_metrics.json` — machine-readable metrics + deltas
	- `sample_trajectories.md` — cherry-picked pre vs post agent traces
	- `curriculum_state.json` — adaptive-curriculum tier/promotion log

	Click "Start training" in the UI, or set `AUTOSTART=1` in the Space variables to kick off immediately on boot.

	When training finishes, the LoRA adapters are pushed to `PUSH_REPO`.

	## Local equivalent

	The same training run is reproducible locally with:

	```bash
	# single GPU (with curriculum)
	PYTHONPATH=. python -m training.training_unsloth \
	--model_name unsloth/Qwen2.5-3B-Instruct \
	--difficulty easy --curriculum --total_episodes 1500 --max_steps 18 \
	--num_generations 8 --output_dir runs/unsloth-grpo \
	--evidence_dir evidence

	# multi-GPU (e.g. 4× A100, with curriculum)
	PYTHONPATH=. accelerate launch --num_processes 4 --mixed_precision bf16 \
	-m training.training_unsloth \
	--difficulty easy --curriculum \
	--total_episodes 1500 --num_generations 8 \
	--output_dir runs/unsloth-grpo --evidence_dir evidence
	```