Spaces:
Paused
title: CERNenv Trainer
emoji: ⚛️
colorFrom: indigo
colorTo: pink
sdk: docker
suggested_hardware: a100x4
suggested_storage: medium
pinned: false
license: bsd-3-clause
short_description: GRPO trainer for CERNenv (Unsloth + LoRA, A100)
CERNenv Trainer (Hugging Face Space, A100)
Fine-tunes a small instruction-tuned LLM (Large Language Model) to act as an LHC (Large Hadron Collider) physicist inside the CERNenv OpenEnv environment using GRPO (Group-Relative Policy Optimization), Unsloth, and LoRA (Low-Rank Adaptation).
Hardware
- Recommended: 4× A100 (
a100x4, 320 GB VRAM, ~$10/hr) - Single GPU also supported:
a100-large(slower, fewer episodes recommended) - Minimum: T4 / L4 (use the Colab notebook fallback)
Budget guidance (~$27 envelope, the default for this hackathon run)
A 1500-episode GRPO run with MODEL_NAME=unsloth/Qwen2.5-3B-Instruct,
NUM_GENERATIONS=8, MAX_STEPS=18 typically lands as follows:
| Hardware | $/hr | Wall-clock | Cost (1× run) | Headroom in $27 |
|---|---|---|---|---|
a100x4 |
~$10 | ~1.5–2 h | ~$15–20 | 1 re-run |
a100-large |
~$4 | ~2.5–3 h | ~$10–12 | 2+ re-runs |
l40sx4 |
~$8 | ~2 h | ~$16 | 1 re-run |
a100x4 gets the trained adapters + evidence into your hands fastest; the
multi-GPU launcher (accelerate launch --num_processes 4) is already wired
in _build_training_cmd. If you want extra safety margin in case anything
needs a re-run, drop to a100-large — wall-clock is ~2× longer but cost
is ~50% lower, leaving you with budget for two complete attempts.
Required Space secrets
| Secret | Purpose |
|---|---|
HF_TOKEN |
Hugging Face token with write access for model push |
HF_USERNAME |
Hub username, used as the default model-repo owner |
Optional environment variables
| Variable | Default | Notes |
|---|---|---|
MODEL_NAME |
unsloth/Qwen2.5-3B-Instruct |
Any chat model Unsloth supports |
TOTAL_EPISODES |
1500 |
Prompts × generations rollouts |
DIFFICULTY |
easy |
Starting tier when CURRICULUM=1; static tier when CURRICULUM=0 |
CURRICULUM |
1 |
1 enables easy→medium→hard prompt-ramp + adaptive eval-tier |
CURRICULUM_PROMOTE |
0.55 |
Held-out success rate that promotes the eval tier one step |
CURRICULUM_DEMOTE |
0.10 |
Rolling success rate that demotes the eval tier one step |
MAX_STEPS |
18 |
Max steps per episode |
NUM_GENERATIONS |
8 |
GRPO group size (bigger = better signal) |
NUM_GPUS |
auto-detected | accelerate launch --num_processes value |
CHECKPOINT_EVAL_STEPS |
25 |
Run a held-out eval every N updates |
CHECKPOINT_EVAL_EPISODES |
8 |
Episodes per mid-training eval |
EVAL_EPISODES |
32 |
Episodes for pre/post eval (statistical power) |
OUTPUT_DIR |
runs/unsloth-grpo |
LoRA adapter output |
EVIDENCE_DIR |
evidence |
Where curves, CSVs, plots are written |
PUSH_REPO |
${HF_USERNAME}/cernenv-grpo-qwen2.5-3b |
Hub repo for adapters + evidence |
AUTOSTART |
0 |
Set to 1 to start training on Space boot |
How to use
This Space exposes a tiny FastAPI control panel:
GET /— status + run info + live training-progress evidence (curves, before/after metrics, plots)POST /train— start / restart a training runGET /logs?tail=N— live tail oftraining.logGET /metrics— pre / post / Δ metrics JSONGET /evidence— list of evidence artifacts on diskGET /evidence/{name}— download an artifact (training_curve.png,training_log.csv, etc.)
Training-progress evidence saved (and pushed to Hub)
training_log.csv— per-step reward, loss, KL, lr, grad-normtraining_curve.png— reward + loss vs stepreward_components.csv— per-rollout terminal vs shaping reward, plus discovery / mass / channel / parsed-action rates per logging step. This is the "watch individual reward function columns" view recommended in the hackathon FAQ — it makes verifier hacks visible (rising mean reward without rising mass/channel correctness ⇒ red flag).reward_components.png— 2-panel plot rendered from the above CSVcheckpoint_evals.csv— held-out eval everyCHECKPOINT_EVAL_STEPSupdatescheckpoint_progression.png— mean reward + success/mass/channel accuracy vs steppre_eval.jsonl/post_eval.jsonl— full per-episode rollouts before vs afterbefore_after_summary.png— pre/post bar chart with Δ annotationsreward_distribution.png— pre vs post reward histogrambefore_after_metrics.json— machine-readable metrics + deltassample_trajectories.md— cherry-picked pre vs post agent tracescurriculum_state.json— adaptive-curriculum tier/promotion log
Click "Start training" in the UI, or set AUTOSTART=1 in the Space variables to kick off immediately on boot.
When training finishes, the LoRA adapters are pushed to PUSH_REPO.
Local equivalent
The same training run is reproducible locally with:
# single GPU (with curriculum)
PYTHONPATH=. python -m training.training_unsloth \
--model_name unsloth/Qwen2.5-3B-Instruct \
--difficulty easy --curriculum --total_episodes 1500 --max_steps 18 \
--num_generations 8 --output_dir runs/unsloth-grpo \
--evidence_dir evidence
# multi-GPU (e.g. 4× A100, with curriculum)
PYTHONPATH=. accelerate launch --num_processes 4 --mixed_precision bf16 \
-m training.training_unsloth \
--difficulty easy --curriculum \
--total_episodes 1500 --num_generations 8 \
--output_dir runs/unsloth-grpo --evidence_dir evidence