Spaces:

anugrahhu
/

cernenv-trainer

Paused

App Files Files Community

cernenv-trainer / README.md

anugrahhu

Update CERNenv Space

0a6c641 verified 12 days ago

preview code

raw

history blame contribute delete

5.75 kB

metadata

title: CERNenv Trainer
emoji: ⚛️
colorFrom: indigo
colorTo: pink
sdk: docker
suggested_hardware: a100x4
suggested_storage: medium
pinned: false
license: bsd-3-clause
short_description: GRPO trainer for CERNenv (Unsloth + LoRA, A100)

CERNenv Trainer (Hugging Face Space, A100)

Fine-tunes a small instruction-tuned LLM (Large Language Model) to act as an LHC (Large Hadron Collider) physicist inside the CERNenv OpenEnv environment using GRPO (Group-Relative Policy Optimization), Unsloth, and LoRA (Low-Rank Adaptation).

Hardware

Recommended: 4× A100 (a100x4, 320 GB VRAM, ~$10/hr)
Single GPU also supported: a100-large (slower, fewer episodes recommended)
Minimum: T4 / L4 (use the Colab notebook fallback)

Budget guidance (~$27 envelope, the default for this hackathon run)

A 1500-episode GRPO run with MODEL_NAME=unsloth/Qwen2.5-3B-Instruct, NUM_GENERATIONS=8, MAX_STEPS=18 typically lands as follows:

Hardware	$/hr	Wall-clock	Cost (1× run)	Headroom in $27
`a100x4`	~$10	~1.5–2 h	~$15–20	1 re-run
`a100-large`	~$4	~2.5–3 h	~$10–12	2+ re-runs
`l40sx4`	~$8	~2 h	~$16	1 re-run

a100x4 gets the trained adapters + evidence into your hands fastest; the multi-GPU launcher (accelerate launch --num_processes 4) is already wired in _build_training_cmd. If you want extra safety margin in case anything needs a re-run, drop to a100-large — wall-clock is ~2× longer but cost is ~50% lower, leaving you with budget for two complete attempts.

Required Space secrets

Secret	Purpose
`HF_TOKEN`	Hugging Face token with `write` access for model push
`HF_USERNAME`	Hub username, used as the default model-repo owner

Optional environment variables

Variable	Default	Notes
`MODEL_NAME`	`unsloth/Qwen2.5-3B-Instruct`	Any chat model Unsloth supports
`TOTAL_EPISODES`	`1500`	Prompts × generations rollouts
`DIFFICULTY`	`easy`	Starting tier when `CURRICULUM=1`; static tier when `CURRICULUM=0`
`CURRICULUM`	`1`	`1` enables easy→medium→hard prompt-ramp + adaptive eval-tier
`CURRICULUM_PROMOTE`	`0.55`	Held-out success rate that promotes the eval tier one step
`CURRICULUM_DEMOTE`	`0.10`	Rolling success rate that demotes the eval tier one step
`MAX_STEPS`	`18`	Max steps per episode
`NUM_GENERATIONS`	`8`	GRPO group size (bigger = better signal)
`NUM_GPUS`	auto-detected	`accelerate launch --num_processes` value
`CHECKPOINT_EVAL_STEPS`	`25`	Run a held-out eval every N updates
`CHECKPOINT_EVAL_EPISODES`	`8`	Episodes per mid-training eval
`EVAL_EPISODES`	`32`	Episodes for pre/post eval (statistical power)
`OUTPUT_DIR`	`runs/unsloth-grpo`	LoRA adapter output
`EVIDENCE_DIR`	`evidence`	Where curves, CSVs, plots are written
`PUSH_REPO`	`${HF_USERNAME}/cernenv-grpo-qwen2.5-3b`	Hub repo for adapters + evidence
`AUTOSTART`	`0`	Set to `1` to start training on Space boot

How to use

This Space exposes a tiny FastAPI control panel:

GET / — status + run info + live training-progress evidence (curves, before/after metrics, plots)
POST /train — start / restart a training run
GET /logs?tail=N — live tail of training.log
GET /metrics — pre / post / Δ metrics JSON
GET /evidence — list of evidence artifacts on disk
GET /evidence/{name} — download an artifact (training_curve.png, training_log.csv, etc.)

Training-progress evidence saved (and pushed to Hub)

training_log.csv — per-step reward, loss, KL, lr, grad-norm
training_curve.png — reward + loss vs step
reward_components.csv — per-rollout terminal vs shaping reward, plus discovery / mass / channel / parsed-action rates per logging step. This is the "watch individual reward function columns" view recommended in the hackathon FAQ — it makes verifier hacks visible (rising mean reward without rising mass/channel correctness ⇒ red flag).
reward_components.png — 2-panel plot rendered from the above CSV
checkpoint_evals.csv — held-out eval every CHECKPOINT_EVAL_STEPS updates
checkpoint_progression.png — mean reward + success/mass/channel accuracy vs step
pre_eval.jsonl / post_eval.jsonl — full per-episode rollouts before vs after
before_after_summary.png — pre/post bar chart with Δ annotations
reward_distribution.png — pre vs post reward histogram
before_after_metrics.json — machine-readable metrics + deltas
sample_trajectories.md — cherry-picked pre vs post agent traces
curriculum_state.json — adaptive-curriculum tier/promotion log

Click "Start training" in the UI, or set AUTOSTART=1 in the Space variables to kick off immediately on boot.

When training finishes, the LoRA adapters are pushed to PUSH_REPO.

Local equivalent

The same training run is reproducible locally with:

# single GPU (with curriculum)
PYTHONPATH=. python -m training.training_unsloth \
  --model_name unsloth/Qwen2.5-3B-Instruct \
  --difficulty easy --curriculum --total_episodes 1500 --max_steps 18 \
  --num_generations 8 --output_dir runs/unsloth-grpo \
  --evidence_dir evidence

# multi-GPU (e.g. 4× A100, with curriculum)
PYTHONPATH=. accelerate launch --num_processes 4 --mixed_precision bf16 \
  -m training.training_unsloth \
  --difficulty easy --curriculum \
  --total_episodes 1500 --num_generations 8 \
  --output_dir runs/unsloth-grpo --evidence_dir evidence