anugrah55's picture
Update CERNenv Space
1db8346 verified
metadata
title: CERNenv Trainer
emoji: ⚛️
colorFrom: indigo
colorTo: pink
sdk: docker
suggested_hardware: a100x4
suggested_storage: medium
pinned: false
license: bsd-3-clause
short_description: GRPO trainer for CERNenv (Unsloth + LoRA, A100)

CERNenv Trainer (Hugging Face Space, A100)

Fine-tunes a small instruction-tuned LLM (Large Language Model) to act as an LHC (Large Hadron Collider) physicist inside the CERNenv OpenEnv environment using GRPO (Group-Relative Policy Optimization), Unsloth, and LoRA (Low-Rank Adaptation).

Hardware

  • Recommended: 4× A100 (a100x4, 320 GB VRAM, ~$10/hr)
  • Single GPU also supported: a100-large (slower, fewer episodes recommended)
  • Minimum: T4 / L4 (use the Colab notebook fallback)

Required Space secrets

Secret Purpose
HF_TOKEN Hugging Face token with write access for model push
HF_USERNAME Hub username, used as the default model-repo owner

Optional environment variables

Variable Default Notes
MODEL_NAME unsloth/Qwen2.5-3B-Instruct Any chat model Unsloth supports
TOTAL_EPISODES 1500 Prompts × generations rollouts
DIFFICULTY easy easy / medium / hard
MAX_STEPS 18 Max steps per episode
NUM_GENERATIONS 8 GRPO group size (bigger = better signal)
NUM_GPUS auto-detected accelerate launch --num_processes value
CHECKPOINT_EVAL_STEPS 25 Run a held-out eval every N updates
CHECKPOINT_EVAL_EPISODES 8 Episodes per mid-training eval
EVAL_EPISODES 32 Episodes for pre/post eval (statistical power)
OUTPUT_DIR runs/unsloth-grpo LoRA adapter output
EVIDENCE_DIR evidence Where curves, CSVs, plots are written
PUSH_REPO ${HF_USERNAME}/cernenv-grpo-qwen2.5-3b Hub repo for adapters + evidence
AUTOSTART 0 Set to 1 to start training on Space boot

How to use

This Space exposes a tiny FastAPI control panel:

  • GET / — status + run info + live training-progress evidence (curves, before/after metrics, plots)
  • POST /train — start / restart a training run
  • GET /logs?tail=N — live tail of training.log
  • GET /metrics — pre / post / Δ metrics JSON
  • GET /evidence — list of evidence artifacts on disk
  • GET /evidence/{name} — download an artifact (training_curve.png, training_log.csv, etc.)

Training-progress evidence saved (and pushed to Hub)

  • training_log.csv — per-step reward, loss, KL, lr, grad-norm
  • training_curve.png — reward + loss vs step
  • checkpoint_evals.csv — held-out eval every CHECKPOINT_EVAL_STEPS updates
  • checkpoint_progression.png — mean reward + success/mass/channel accuracy vs step
  • pre_eval.jsonl / post_eval.jsonl — full per-episode rollouts before vs after
  • before_after_summary.png — pre/post bar chart with Δ annotations
  • reward_distribution.png — pre vs post reward histogram
  • before_after_metrics.json — machine-readable metrics + deltas
  • sample_trajectories.md — cherry-picked pre vs post agent traces

Click "Start training" in the UI, or set AUTOSTART=1 in the Space variables to kick off immediately on boot.

When training finishes, the LoRA adapters are pushed to PUSH_REPO.

Local equivalent

The same training run is reproducible locally with:

# single GPU
PYTHONPATH=. python -m training.training_unsloth \
  --model_name unsloth/Qwen2.5-3B-Instruct \
  --difficulty easy --total_episodes 1500 --max_steps 18 \
  --num_generations 8 --output_dir runs/unsloth-grpo \
  --evidence_dir evidence

# multi-GPU (e.g. 4× A100)
PYTHONPATH=. accelerate launch --num_processes 4 --mixed_precision bf16 \
  -m training.training_unsloth \
  --total_episodes 1500 --num_generations 8 \
  --output_dir runs/unsloth-grpo --evidence_dir evidence