anugrah55 commited on
Commit
e0148c3
·
verified ·
1 Parent(s): 698bd51

Update trainer Space

Browse files
Files changed (5) hide show
  1. Dockerfile +31 -31
  2. README.md +120 -91
  3. __init__.py +0 -0
  4. app.py +673 -0
  5. requirements.txt +24 -0
Dockerfile CHANGED
@@ -1,31 +1,31 @@
1
- # CERNenv trainer Space (Docker, A100)
2
- FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
3
-
4
- ENV DEBIAN_FRONTEND=noninteractive \
5
- PYTHONUNBUFFERED=1 \
6
- PIP_NO_CACHE_DIR=1 \
7
- HF_HOME=/home/user/.cache/huggingface \
8
- TRANSFORMERS_CACHE=/home/user/.cache/huggingface/transformers \
9
- PYTHONPATH=/home/user/app
10
-
11
- RUN apt-get update && apt-get install -y --no-install-recommends \
12
- python3.11 python3.11-venv python3.11-dev python3-pip \
13
- git curl ca-certificates build-essential \
14
- && rm -rf /var/lib/apt/lists/* \
15
- && ln -sf /usr/bin/python3.11 /usr/local/bin/python \
16
- && ln -sf /usr/bin/python3.11 /usr/local/bin/python3
17
-
18
- RUN useradd -ms /bin/bash user
19
- USER user
20
- ENV PATH="/home/user/.local/bin:${PATH}"
21
- WORKDIR /home/user/app
22
-
23
- COPY --chown=user:user space/training/requirements.txt /tmp/requirements.txt
24
- RUN python -m pip install --upgrade pip && \
25
- python -m pip install --user -r /tmp/requirements.txt
26
-
27
- COPY --chown=user:user . /home/user/app
28
-
29
- EXPOSE 7860
30
-
31
- CMD ["python", "-m", "uvicorn", "space.training.app:app", "--host", "0.0.0.0", "--port", "7860"]
 
1
+ # CERNenv trainer Space (Docker, A100)
2
+ FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
3
+
4
+ ENV DEBIAN_FRONTEND=noninteractive \
5
+ PYTHONUNBUFFERED=1 \
6
+ PIP_NO_CACHE_DIR=1 \
7
+ HF_HOME=/home/user/.cache/huggingface \
8
+ TRANSFORMERS_CACHE=/home/user/.cache/huggingface/transformers \
9
+ PYTHONPATH=/home/user/app
10
+
11
+ RUN apt-get update && apt-get install -y --no-install-recommends \
12
+ python3.11 python3.11-venv python3.11-dev python3-pip \
13
+ git curl ca-certificates build-essential \
14
+ && rm -rf /var/lib/apt/lists/* \
15
+ && ln -sf /usr/bin/python3.11 /usr/local/bin/python \
16
+ && ln -sf /usr/bin/python3.11 /usr/local/bin/python3
17
+
18
+ RUN useradd -ms /bin/bash user
19
+ USER user
20
+ ENV PATH="/home/user/.local/bin:${PATH}"
21
+ WORKDIR /home/user/app
22
+
23
+ COPY --chown=user:user space/training/requirements.txt /home/user/app/space-training-requirements.txt
24
+ RUN python -m pip install --upgrade pip && \
25
+ python -m pip install --user -r /home/user/app/space-training-requirements.txt
26
+
27
+ COPY --chown=user:user . /home/user/app
28
+
29
+ EXPOSE 7860
30
+
31
+ CMD ["python", "-m", "uvicorn", "space.training.app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,91 +1,120 @@
1
- ---
2
- title: CERNenv Trainer
3
- emoji: ⚛️
4
- colorFrom: indigo
5
- colorTo: pink
6
- sdk: docker
7
- suggested_hardware: a100x4
8
- suggested_storage: medium
9
- pinned: false
10
- license: bsd-3-clause
11
- short_description: GRPO trainer for CERNenv (Unsloth + LoRA, A100)
12
- ---
13
-
14
- # CERNenv Trainer (Hugging Face Space, A100)
15
-
16
- Fine-tunes a small instruction-tuned LLM (Large Language Model) to act as
17
- an LHC (Large Hadron Collider) physicist inside the **CERNenv** OpenEnv
18
- environment using **GRPO** (Group-Relative Policy Optimization),
19
- **Unsloth**, and **LoRA** (Low-Rank Adaptation).
20
-
21
- ## Hardware
22
- - Recommended: **4× A100 (`a100x4`, 320 GB VRAM, ~$10/hr)**
23
- - Single GPU also supported: `a100-large` (slower, fewer episodes recommended)
24
- - Minimum: T4 / L4 (use the Colab notebook fallback)
25
-
26
- ## Required Space secrets
27
- | Secret | Purpose |
28
- | --- | --- |
29
- | `HF_TOKEN` | Hugging Face token with `write` access for model push |
30
- | `HF_USERNAME` | Hub username, used as the default model-repo owner |
31
-
32
- ## Optional environment variables
33
- | Variable | Default | Notes |
34
- | --- | --- | --- |
35
- | `MODEL_NAME` | `unsloth/Qwen2.5-3B-Instruct` | Any chat model Unsloth supports |
36
- | `TOTAL_EPISODES` | `1500` | Prompts × generations rollouts |
37
- | `DIFFICULTY` | `easy` | `easy` / `medium` / `hard` |
38
- | `MAX_STEPS` | `18` | Max steps per episode |
39
- | `NUM_GENERATIONS` | `8` | GRPO group size (bigger = better signal) |
40
- | `NUM_GPUS` | auto-detected | `accelerate launch --num_processes` value |
41
- | `CHECKPOINT_EVAL_STEPS` | `25` | Run a held-out eval every N updates |
42
- | `CHECKPOINT_EVAL_EPISODES` | `8` | Episodes per mid-training eval |
43
- | `EVAL_EPISODES` | `32` | Episodes for pre/post eval (statistical power) |
44
- | `OUTPUT_DIR` | `runs/unsloth-grpo` | LoRA adapter output |
45
- | `EVIDENCE_DIR` | `evidence` | Where curves, CSVs, plots are written |
46
- | `PUSH_REPO` | `${HF_USERNAME}/cernenv-grpo-qwen2.5-3b` | Hub repo for adapters + evidence |
47
- | `AUTOSTART` | `0` | Set to `1` to start training on Space boot |
48
-
49
- ## How to use
50
-
51
- This Space exposes a tiny FastAPI control panel:
52
- - `GET /` status + run info + **live training-progress evidence** (curves, before/after metrics, plots)
53
- - `POST /train` start / restart a training run
54
- - `GET /logs?tail=N` live tail of `training.log`
55
- - `GET /metrics` pre / post / Δ metrics JSON
56
- - `GET /evidence` list of evidence artifacts on disk
57
- - `GET /evidence/{name}` download an artifact (`training_curve.png`, `training_log.csv`, etc.)
58
-
59
- ### Training-progress evidence saved (and pushed to Hub)
60
- - `training_log.csv` per-step reward, loss, KL, lr, grad-norm
61
- - `training_curve.png` reward + loss vs step
62
- - `checkpoint_evals.csv` held-out eval every `CHECKPOINT_EVAL_STEPS` updates
63
- - `checkpoint_progression.png` mean reward + success/mass/channel accuracy vs step
64
- - `pre_eval.jsonl` / `post_eval.jsonl` full per-episode rollouts before vs after
65
- - `before_after_summary.png` pre/post bar chart with Δ annotations
66
- - `reward_distribution.png` pre vs post reward histogram
67
- - `before_after_metrics.json` machine-readable metrics + deltas
68
- - `sample_trajectories.md` cherry-picked pre vs post agent traces
69
-
70
- Click **"Start training"** in the UI, or set `AUTOSTART=1` in the Space variables to kick off immediately on boot.
71
-
72
- When training finishes, the LoRA adapters are pushed to `PUSH_REPO`.
73
-
74
- ## Local equivalent
75
-
76
- The same training run is reproducible locally with:
77
-
78
- ```bash
79
- # single GPU
80
- PYTHONPATH=. python -m training.training_unsloth \
81
- --model_name unsloth/Qwen2.5-3B-Instruct \
82
- --difficulty easy --total_episodes 1500 --max_steps 18 \
83
- --num_generations 8 --output_dir runs/unsloth-grpo \
84
- --evidence_dir evidence
85
-
86
- # multi-GPU (e.g. A100)
87
- PYTHONPATH=. accelerate launch --num_processes 4 --mixed_precision bf16 \
88
- -m training.training_unsloth \
89
- --total_episodes 1500 --num_generations 8 \
90
- --output_dir runs/unsloth-grpo --evidence_dir evidence
91
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: CERNenv Trainer
3
+ emoji: ⚛️
4
+ colorFrom: indigo
5
+ colorTo: pink
6
+ sdk: docker
7
+ suggested_hardware: a100x4
8
+ suggested_storage: medium
9
+ pinned: false
10
+ license: bsd-3-clause
11
+ short_description: GRPO trainer for CERNenv (Unsloth + LoRA, A100)
12
+ ---
13
+
14
+ # CERNenv Trainer (Hugging Face Space, A100)
15
+
16
+ Fine-tunes a small instruction-tuned LLM (Large Language Model) to act as
17
+ an LHC (Large Hadron Collider) physicist inside the **CERNenv** OpenEnv
18
+ environment using **GRPO** (Group-Relative Policy Optimization),
19
+ **Unsloth**, and **LoRA** (Low-Rank Adaptation).
20
+
21
+ ## Hardware
22
+
23
+ - Recommended: **4× A100 (`a100x4`, 320 GB VRAM, ~$10/hr)**
24
+ - Single GPU also supported: `a100-large` (slower, fewer episodes recommended)
25
+ - Minimum: T4 / L4 (use the Colab notebook fallback)
26
+
27
+ ### Budget guidance (~$27 envelope, the default for this hackathon run)
28
+
29
+ A 1500-episode GRPO run with `MODEL_NAME=unsloth/Qwen2.5-3B-Instruct`,
30
+ `NUM_GENERATIONS=8`, `MAX_STEPS=18` typically lands as follows:
31
+
32
+ | Hardware | $/hr | Wall-clock | Cost (1× run) | Headroom in $27 |
33
+ | ------------ | ----- | ---------- | ------------- | --------------- |
34
+ | `a100x4` | ~$10 | ~1.5–2 h | ~$15–20 | 1 re-run |
35
+ | `a100-large` | ~$4 | ~2.5–3 h | ~$10–12 | 2+ re-runs |
36
+ | `l40sx4` | ~$8 | ~2 h | ~$16 | 1 re-run |
37
+
38
+ `a100x4` gets the trained adapters + evidence into your hands fastest; the
39
+ multi-GPU launcher (`accelerate launch --num_processes 4`) is already wired
40
+ in `_build_training_cmd`. If you want extra safety margin in case anything
41
+ needs a re-run, drop to `a100-large` wall-clock is ~2× longer but cost
42
+ is ~50% lower, leaving you with budget for two complete attempts.
43
+
44
+ ## Required Space secrets
45
+ | Secret | Purpose |
46
+ | --- | --- |
47
+ | `HF_TOKEN` | Hugging Face token with `write` access for model push |
48
+ | `HF_USERNAME` | Hub username, used as the default model-repo owner |
49
+
50
+ ## Optional environment variables
51
+ | Variable | Default | Notes |
52
+ | --- | --- | --- |
53
+ | `MODEL_NAME` | `unsloth/Qwen2.5-3B-Instruct` | Any chat model Unsloth supports |
54
+ | `TOTAL_EPISODES` | `1500` | Prompts × generations rollouts |
55
+ | `DIFFICULTY` | `easy` | Starting tier when `CURRICULUM=1`; static tier when `CURRICULUM=0` |
56
+ | `CURRICULUM` | `1` | `1` enables easy→medium→hard prompt-ramp + adaptive eval-tier |
57
+ | `CURRICULUM_PROMOTE` | `0.55` | Held-out success rate that promotes the eval tier one step |
58
+ | `CURRICULUM_DEMOTE` | `0.10` | Rolling success rate that demotes the eval tier one step |
59
+ | `MAX_STEPS` | `18` | Max steps per episode |
60
+ | `NUM_GENERATIONS` | `8` | GRPO group size (bigger = better signal) |
61
+ | `NUM_GPUS` | auto-detected | `accelerate launch --num_processes` value |
62
+ | `CHECKPOINT_EVAL_STEPS` | `25` | Run a held-out eval every N updates |
63
+ | `CHECKPOINT_EVAL_EPISODES` | `8` | Episodes per mid-training eval |
64
+ | `EVAL_EPISODES` | `32` | Episodes for pre/post eval (statistical power) |
65
+ | `OUTPUT_DIR` | `runs/unsloth-grpo` | LoRA adapter output |
66
+ | `EVIDENCE_DIR` | `evidence` | Where curves, CSVs, plots are written |
67
+ | `PUSH_REPO` | `${HF_USERNAME}/cernenv-grpo-qwen2.5-3b` | Hub repo for adapters + evidence |
68
+ | `AUTOSTART` | `0` | Set to `1` to start training on Space boot |
69
+
70
+ ## How to use
71
+
72
+ This Space exposes a tiny FastAPI control panel:
73
+ - `GET /` — status + run info + **live training-progress evidence** (curves, before/after metrics, plots)
74
+ - `POST /train` — start / restart a training run
75
+ - `GET /logs?tail=N` — live tail of `training.log`
76
+ - `GET /metrics` pre / post / Δ metrics JSON
77
+ - `GET /evidence` — list of evidence artifacts on disk
78
+ - `GET /evidence/{name}` — download an artifact (`training_curve.png`, `training_log.csv`, etc.)
79
+
80
+ ### Training-progress evidence saved (and pushed to Hub)
81
+ - `training_log.csv` — per-step reward, loss, KL, lr, grad-norm
82
+ - `training_curve.png` reward + loss vs step
83
+ - `reward_components.csv` per-rollout terminal vs shaping reward, plus
84
+ discovery / mass / channel / parsed-action rates per logging step.
85
+ This is the "watch individual reward function columns" view recommended
86
+ in the hackathon FAQ — it makes verifier hacks visible (rising mean
87
+ reward without rising mass/channel correctness red flag).
88
+ - `reward_components.png` — 2-panel plot rendered from the above CSV
89
+ - `checkpoint_evals.csv` — held-out eval every `CHECKPOINT_EVAL_STEPS` updates
90
+ - `checkpoint_progression.png` — mean reward + success/mass/channel accuracy vs step
91
+ - `pre_eval.jsonl` / `post_eval.jsonl` — full per-episode rollouts before vs after
92
+ - `before_after_summary.png` — pre/post bar chart with Δ annotations
93
+ - `reward_distribution.png` — pre vs post reward histogram
94
+ - `before_after_metrics.json` — machine-readable metrics + deltas
95
+ - `sample_trajectories.md` — cherry-picked pre vs post agent traces
96
+ - `curriculum_state.json` — adaptive-curriculum tier/promotion log
97
+
98
+ Click **"Start training"** in the UI, or set `AUTOSTART=1` in the Space variables to kick off immediately on boot.
99
+
100
+ When training finishes, the LoRA adapters are pushed to `PUSH_REPO`.
101
+
102
+ ## Local equivalent
103
+
104
+ The same training run is reproducible locally with:
105
+
106
+ ```bash
107
+ # single GPU (with curriculum)
108
+ PYTHONPATH=. python -m training.training_unsloth \
109
+ --model_name unsloth/Qwen2.5-3B-Instruct \
110
+ --difficulty easy --curriculum --total_episodes 1500 --max_steps 18 \
111
+ --num_generations 8 --output_dir runs/unsloth-grpo \
112
+ --evidence_dir evidence
113
+
114
+ # multi-GPU (e.g. 4× A100, with curriculum)
115
+ PYTHONPATH=. accelerate launch --num_processes 4 --mixed_precision bf16 \
116
+ -m training.training_unsloth \
117
+ --difficulty easy --curriculum \
118
+ --total_episodes 1500 --num_generations 8 \
119
+ --output_dir runs/unsloth-grpo --evidence_dir evidence
120
+ ```
__init__.py ADDED
File without changes
app.py ADDED
@@ -0,0 +1,673 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """FastAPI control panel for the CERNenv trainer Space.
2
+
3
+ Endpoints:
4
+ GET / → status page (HTML)
5
+ GET /status → JSON status of the current training run
6
+ GET /metrics → JSON snapshot of reward / success rate
7
+ GET /logs → tail of the training log
8
+ POST /train → start (or restart) a training run
9
+ GET /health → liveness probe
10
+
11
+ Designed to run on a Hugging Face Space with `sdk: docker`. Heavy training
12
+ work runs in a background thread so the HTTP server stays responsive.
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import json
18
+ import logging
19
+ import os
20
+ import subprocess
21
+ import sys
22
+ import threading
23
+ import time
24
+ from datetime import datetime, timezone
25
+ from pathlib import Path
26
+ from typing import Any, Dict, Optional
27
+
28
+ from fastapi import FastAPI, HTTPException
29
+ from fastapi.responses import FileResponse, HTMLResponse, JSONResponse, PlainTextResponse
30
+ from fastapi.staticfiles import StaticFiles
31
+
32
+
33
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
34
+ logger = logging.getLogger(__name__)
35
+
36
+
37
+ def _resolve_repo_root() -> Path:
38
+ env_root = os.environ.get("CERNENV_ROOT")
39
+ candidates = []
40
+ if env_root:
41
+ candidates.append(Path(env_root))
42
+ candidates.extend([
43
+ Path("/home/user/app"),
44
+ Path(__file__).resolve().parent.parent.parent,
45
+ ])
46
+ for p in candidates:
47
+ try:
48
+ if p.exists():
49
+ return p.resolve()
50
+ except OSError:
51
+ continue
52
+ return candidates[-1].resolve()
53
+
54
+
55
+ REPO_ROOT = _resolve_repo_root()
56
+ LOG_DIR = REPO_ROOT / "training" / "runs"
57
+ try:
58
+ LOG_DIR.mkdir(parents=True, exist_ok=True)
59
+ except OSError as exc: # pragma: no cover - read-only filesystem fallback
60
+ logger.warning("could not create %s (%s); using /tmp", LOG_DIR, exc)
61
+ LOG_DIR = Path("/tmp/cernenv-runs")
62
+ LOG_DIR.mkdir(parents=True, exist_ok=True)
63
+ LOG_FILE = LOG_DIR / "training.log"
64
+ EVIDENCE_DIR = REPO_ROOT / "evidence"
65
+ try:
66
+ EVIDENCE_DIR.mkdir(parents=True, exist_ok=True)
67
+ except OSError: # pragma: no cover
68
+ EVIDENCE_DIR = Path("/tmp/cernenv-evidence")
69
+ EVIDENCE_DIR.mkdir(parents=True, exist_ok=True)
70
+ METRICS_FILE = EVIDENCE_DIR / "before_after_metrics.json"
71
+
72
+
73
+ def _env(name: str, default: str) -> str:
74
+ return os.environ.get(name, default)
75
+
76
+
77
+ def _detect_gpus() -> int:
78
+ try:
79
+ import torch # type: ignore
80
+ if torch.cuda.is_available():
81
+ return torch.cuda.device_count()
82
+ except Exception:
83
+ pass
84
+ try:
85
+ out = subprocess.run(
86
+ ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
87
+ capture_output=True, text=True, timeout=5,
88
+ )
89
+ return len([l for l in out.stdout.splitlines() if l.strip()])
90
+ except Exception:
91
+ return 0
92
+
93
+
94
+ _NUM_GPUS = _detect_gpus()
95
+
96
+
97
+ CONFIG = {
98
+ "model_name": _env("MODEL_NAME", "unsloth/Qwen2.5-3B-Instruct"),
99
+ "difficulty": _env("DIFFICULTY", "easy"),
100
+ "curriculum": _env("CURRICULUM", "1") == "1",
101
+ "curriculum_promote": float(_env("CURRICULUM_PROMOTE", "0.55")),
102
+ "curriculum_demote": float(_env("CURRICULUM_DEMOTE", "0.10")),
103
+ "total_episodes": int(_env("TOTAL_EPISODES", "1500")),
104
+ "max_steps": int(_env("MAX_STEPS", "18")),
105
+ "num_generations": int(_env("NUM_GENERATIONS", "8")),
106
+ "checkpoint_eval_steps": int(_env("CHECKPOINT_EVAL_STEPS", "25")),
107
+ "checkpoint_eval_episodes": int(_env("CHECKPOINT_EVAL_EPISODES", "8")),
108
+ "eval_episodes": int(_env("EVAL_EPISODES", "32")),
109
+ "output_dir": _env("OUTPUT_DIR", "runs/unsloth-grpo"),
110
+ "evidence_dir": _env("EVIDENCE_DIR", "evidence"),
111
+ "num_gpus": int(_env("NUM_GPUS", str(_NUM_GPUS or 1))),
112
+ "hf_username": _env("HF_USERNAME", "anugrah55"),
113
+ "push_repo": _env(
114
+ "PUSH_REPO",
115
+ f"{_env('HF_USERNAME', 'anugrah55')}/cernenv-grpo-qwen2.5-3b",
116
+ ),
117
+ "autostart": _env("AUTOSTART", "0") == "1",
118
+ }
119
+
120
+
121
+ # ── Run state ────────────────────────────────────────────────────────────
122
+
123
+
124
+ class RunState:
125
+ def __init__(self) -> None:
126
+ self.lock = threading.Lock()
127
+ self.thread: Optional[threading.Thread] = None
128
+ self.process: Optional[subprocess.Popen] = None
129
+ self.status: str = "idle" # idle | running | finished | failed
130
+ self.started_at: Optional[str] = None
131
+ self.finished_at: Optional[str] = None
132
+ self.last_error: Optional[str] = None
133
+ self.last_config: Dict[str, Any] = {}
134
+
135
+ def to_dict(self) -> Dict[str, Any]:
136
+ with self.lock:
137
+ return {
138
+ "status": self.status,
139
+ "started_at": self.started_at,
140
+ "finished_at": self.finished_at,
141
+ "last_error": self.last_error,
142
+ "last_config": self.last_config,
143
+ }
144
+
145
+
146
+ STATE = RunState()
147
+
148
+
149
+ # ── Training pipeline ────────────────────────────────────────────────────
150
+
151
+
152
+ def _stream_subprocess(cmd: list[str], log_handle) -> int:
153
+ log_handle.write(f"\n$ {' '.join(cmd)}\n")
154
+ log_handle.flush()
155
+ proc = subprocess.Popen(
156
+ cmd,
157
+ cwd=str(REPO_ROOT),
158
+ stdout=subprocess.PIPE,
159
+ stderr=subprocess.STDOUT,
160
+ bufsize=1,
161
+ universal_newlines=True,
162
+ env={**os.environ, "PYTHONPATH": str(REPO_ROOT)},
163
+ )
164
+ STATE.process = proc
165
+ assert proc.stdout is not None
166
+ for line in proc.stdout:
167
+ log_handle.write(line)
168
+ log_handle.flush()
169
+ rc = proc.wait()
170
+ log_handle.write(f"[exit code {rc}]\n")
171
+ log_handle.flush()
172
+ STATE.process = None
173
+ return rc
174
+
175
+
176
+ def _build_training_cmd(config: Dict[str, Any]) -> list[str]:
177
+ """Compose the training launcher (single-GPU python or multi-GPU accelerate)."""
178
+ base = [
179
+ "-m", "training.training_unsloth",
180
+ "--model_name", config["model_name"],
181
+ "--difficulty", config["difficulty"],
182
+ "--total_episodes", str(config["total_episodes"]),
183
+ "--max_steps", str(config["max_steps"]),
184
+ "--num_generations", str(config["num_generations"]),
185
+ "--checkpoint_eval_steps", str(config["checkpoint_eval_steps"]),
186
+ "--checkpoint_eval_episodes", str(config["checkpoint_eval_episodes"]),
187
+ "--output_dir", config["output_dir"],
188
+ "--evidence_dir", config["evidence_dir"],
189
+ ]
190
+ if config.get("curriculum"):
191
+ base.extend([
192
+ "--curriculum",
193
+ "--curriculum_promote", str(config["curriculum_promote"]),
194
+ "--curriculum_demote", str(config["curriculum_demote"]),
195
+ ])
196
+ n = max(int(config.get("num_gpus", 1)), 1)
197
+ if n > 1:
198
+ return ["accelerate", "launch", "--num_processes", str(n), "--mixed_precision", "bf16"] + base
199
+ return [sys.executable] + base
200
+
201
+
202
+ def _push_evidence_to_hub(*, evidence_dir: Path, repo_id: str, log) -> None:
203
+ """Upload the entire evidence/ directory to the model repo."""
204
+ token = os.environ.get("HF_TOKEN")
205
+ if not token:
206
+ log.write("\n[skip] HF_TOKEN not set — evidence not pushed\n")
207
+ log.flush()
208
+ return
209
+ try:
210
+ from huggingface_hub import HfApi
211
+ api = HfApi(token=token)
212
+ api.upload_folder(
213
+ folder_path=str(evidence_dir),
214
+ repo_id=repo_id,
215
+ repo_type="model",
216
+ path_in_repo="evidence",
217
+ commit_message="Upload CERNenv training evidence (curves, evals, plots)",
218
+ )
219
+ log.write(f"\n[ok] uploaded evidence/ → https://huggingface.co/{repo_id}/tree/main/evidence\n")
220
+ log.flush()
221
+ except Exception as exc:
222
+ log.write(f"\n[warn] evidence push failed: {exc}\n")
223
+ log.flush()
224
+
225
+
226
+ def _training_pipeline(config: Dict[str, Any]) -> None:
227
+ started = datetime.now(timezone.utc).isoformat()
228
+ with STATE.lock:
229
+ STATE.status = "running"
230
+ STATE.started_at = started
231
+ STATE.finished_at = None
232
+ STATE.last_error = None
233
+ STATE.last_config = dict(config)
234
+
235
+ evidence_dir = Path(config["evidence_dir"]).resolve()
236
+ evidence_dir.mkdir(parents=True, exist_ok=True)
237
+
238
+ LOG_FILE.parent.mkdir(parents=True, exist_ok=True)
239
+ with open(LOG_FILE, "a") as log:
240
+ log.write(f"\n=== Training started {started} ===\n")
241
+ log.write(json.dumps(config, indent=2) + "\n")
242
+ log.flush()
243
+ try:
244
+ output_dir = config["output_dir"]
245
+ difficulty = config["difficulty"]
246
+ max_steps = str(config["max_steps"])
247
+ eval_episodes = str(config["eval_episodes"])
248
+ model_name = config["model_name"]
249
+ push_repo = config["push_repo"]
250
+ evidence_str = config["evidence_dir"]
251
+ pre_jsonl = f"{evidence_str}/pre_eval.jsonl"
252
+ post_jsonl = f"{evidence_str}/post_eval.jsonl"
253
+
254
+ log.write("\n--- baseline sanity check (random / heuristic / oracle) ---\n")
255
+ log.flush()
256
+ for agent in ("random", "heuristic", "oracle"):
257
+ _stream_subprocess(
258
+ [
259
+ sys.executable, "-m", "scripts.run_agent",
260
+ "--agent", agent, "--difficulty", difficulty,
261
+ "--episodes", "3", "--quiet",
262
+ ],
263
+ log,
264
+ )
265
+
266
+ log.write(f"\n--- pre-train evaluation ({eval_episodes} eps) ---\n")
267
+ log.flush()
268
+ rc = _stream_subprocess(
269
+ [
270
+ sys.executable, "-m", "training.evaluate",
271
+ "--model_name", model_name,
272
+ "--difficulty", difficulty,
273
+ "--episodes", eval_episodes,
274
+ "--max_steps", max_steps,
275
+ "--tag", "pre_train",
276
+ "--out", pre_jsonl,
277
+ ],
278
+ log,
279
+ )
280
+ if rc != 0:
281
+ # don't abort — we still want training + post-eval evidence.
282
+ log.write(f"\n[warn] pre-train eval failed (rc={rc}); continuing without baseline\n")
283
+ log.flush()
284
+
285
+ log.write(f"\n--- GRPO training ({config['num_gpus']} GPU process(es)) ---\n")
286
+ log.flush()
287
+ rc = _stream_subprocess(_build_training_cmd(config), log)
288
+ if rc != 0:
289
+ raise RuntimeError(f"training failed (rc={rc})")
290
+
291
+ # ── LoRA save-and-reload smoke test ─────────────────────
292
+ # Hackathon FAQ Q9: "Do not upcast a 4-bit model to 16-bit
293
+ # and then merge the LoRA weights naively" — the canonical
294
+ # cause of a broken push. Before we burn time on the full
295
+ # post-train evaluation (32 eps), do a 2-episode cold-load
296
+ # rollout against the saved adapters. If that fails, abort
297
+ # immediately so we surface a save problem, not a 30-min
298
+ # eval timeout.
299
+ log.write(
300
+ f"\n--- adapter save/reload smoke test "
301
+ f"(loading {output_dir} cold-start, 2 eps) ---\n"
302
+ )
303
+ log.flush()
304
+ rc = _stream_subprocess(
305
+ [
306
+ sys.executable, "-m", "training.evaluate",
307
+ "--model_name", model_name,
308
+ "--adapter_dir", output_dir,
309
+ "--difficulty", difficulty,
310
+ "--episodes", "2",
311
+ "--max_steps", max_steps,
312
+ "--tag", "smoke",
313
+ "--out", f"{evidence_str}/smoke_eval.jsonl",
314
+ ],
315
+ log,
316
+ )
317
+ if rc != 0:
318
+ raise RuntimeError(
319
+ f"adapter smoke test failed (rc={rc}); refusing to push "
320
+ f"unloadable adapters to the Hub. Inspect {output_dir} and "
321
+ "verify adapter_config.json + adapter_model.safetensors exist."
322
+ )
323
+
324
+ log.write(f"\n--- post-train evaluation ({eval_episodes} eps) ---\n")
325
+ log.flush()
326
+ rc = _stream_subprocess(
327
+ [
328
+ sys.executable, "-m", "training.evaluate",
329
+ "--model_name", model_name,
330
+ "--adapter_dir", output_dir,
331
+ "--difficulty", difficulty,
332
+ "--episodes", eval_episodes,
333
+ "--max_steps", max_steps,
334
+ "--tag", "post_train",
335
+ "--out", post_jsonl,
336
+ ],
337
+ log,
338
+ )
339
+ if rc != 0:
340
+ log.write(f"\n[warn] post-train eval failed (rc={rc}); evidence will be partial\n")
341
+ log.flush()
342
+
343
+ log.write("\n--- evidence: before/after summary, distribution, trajectories ---\n")
344
+ log.flush()
345
+ try:
346
+ from training.evidence import (
347
+ EvidencePaths,
348
+ render_before_after,
349
+ render_sample_trajectories,
350
+ render_training_curve,
351
+ render_reward_components,
352
+ render_checkpoint_progression,
353
+ )
354
+ paths = EvidencePaths(root=Path(evidence_str))
355
+ paths.ensure()
356
+ metrics = render_before_after(
357
+ pre_jsonl=Path(pre_jsonl),
358
+ post_jsonl=Path(post_jsonl),
359
+ summary_png=paths.before_after_summary_png,
360
+ distribution_png=paths.reward_distribution_png,
361
+ metrics_json=paths.before_after_metrics_json,
362
+ )
363
+ render_sample_trajectories(
364
+ pre_jsonl=Path(pre_jsonl),
365
+ post_jsonl=Path(post_jsonl),
366
+ md_path=paths.sample_trajectories_md,
367
+ )
368
+ render_training_curve(paths.training_log_csv, paths.training_curve_png)
369
+ render_reward_components(
370
+ paths.reward_components_csv, paths.reward_components_png,
371
+ )
372
+ render_checkpoint_progression(
373
+ paths.checkpoint_evals_csv, paths.checkpoint_progression_png,
374
+ )
375
+ log.write(json.dumps(metrics, indent=2) + "\n")
376
+ log.flush()
377
+ except Exception as exc:
378
+ log.write(f"[warn] evidence rendering failed: {exc}\n")
379
+ log.flush()
380
+
381
+ if os.environ.get("HF_TOKEN"):
382
+ log.write("\n--- push adapters to Hub ---\n")
383
+ log.flush()
384
+ _stream_subprocess(
385
+ [
386
+ sys.executable, "-m", "scripts.push_to_hub", "model",
387
+ "--adapter_dir", output_dir,
388
+ "--repo_id", push_repo,
389
+ "--base_model", model_name,
390
+ ],
391
+ log,
392
+ )
393
+ _push_evidence_to_hub(
394
+ evidence_dir=evidence_dir,
395
+ repo_id=push_repo,
396
+ log=log,
397
+ )
398
+ else:
399
+ log.write("\n[skip] HF_TOKEN not set — not pushing to Hub\n")
400
+ log.flush()
401
+
402
+ with STATE.lock:
403
+ STATE.status = "finished"
404
+ except Exception as exc:
405
+ logger.exception("training pipeline failed")
406
+ with STATE.lock:
407
+ STATE.status = "failed"
408
+ STATE.last_error = str(exc)
409
+ finally:
410
+ finished = datetime.now(timezone.utc).isoformat()
411
+ log.write(f"\n=== Training ended {finished} ===\n")
412
+ log.flush()
413
+ with STATE.lock:
414
+ STATE.finished_at = finished
415
+
416
+
417
+ def _start_training(config: Dict[str, Any]) -> None:
418
+ with STATE.lock:
419
+ if STATE.status == "running":
420
+ raise RuntimeError("a training run is already in progress")
421
+ STATE.thread = threading.Thread(
422
+ target=_training_pipeline,
423
+ args=(config,),
424
+ name="cernenv-trainer",
425
+ daemon=True,
426
+ )
427
+ STATE.thread.start()
428
+
429
+
430
+ # ── FastAPI app ──────────────────────────────────────────────────────────
431
+
432
+
433
+ app = FastAPI(title="CERNenv Trainer", version="0.1.0")
434
+
435
+
436
+ _HTML = """\
437
+ <!doctype html>
438
+ <html lang=en>
439
+ <head>
440
+ <meta charset=utf-8>
441
+ <title>CERNenv Trainer</title>
442
+ <style>
443
+ body { font-family: ui-sans-serif, system-ui, sans-serif; margin: 2rem auto;
444
+ max-width: 1000px; color:#111; padding: 0 1rem; line-height:1.5 }
445
+ h1 { margin-bottom: 0 }
446
+ h2 { margin-top: 2rem; border-bottom:1px solid #eee; padding-bottom:.25rem }
447
+ .muted { color:#666 }
448
+ pre { background:#0e1116; color:#e6edf3; padding:1rem; border-radius:6px;
449
+ overflow-x:auto; max-height:40vh; font-size:.85em }
450
+ button { font-size:1rem; padding:.6rem 1rem; border-radius:6px; border:1px solid #888;
451
+ background:#fff; cursor:pointer; margin-right:.4rem }
452
+ .pill { display:inline-block; padding:.1rem .55rem; border-radius:999px;
453
+ background:#eef; color:#225; font-size:.85em }
454
+ .ok { background:#dfd; color:#272 }
455
+ .fail { background:#fdd; color:#822 }
456
+ .run { background:#fdf6d8; color:#774 }
457
+ table { border-collapse:collapse; margin:.5rem 0 }
458
+ td, th { padding:.25rem .8rem .25rem 0; vertical-align: top; text-align:left }
459
+ th { color:#444; font-weight:600 }
460
+ .grid { display:grid; grid-template-columns:1fr 1fr; gap:1rem }
461
+ .card { border:1px solid #e5e7eb; border-radius:8px; padding:.75rem; background:#fafafa }
462
+ .card img { max-width:100%; border-radius:4px }
463
+ .delta-pos { color:#15803d; font-weight:600 }
464
+ .delta-neg { color:#b91c1c; font-weight:600 }
465
+ code { background:#f4f4f4; padding:.05rem .35rem; border-radius:4px }
466
+ a { color:#1d4ed8 }
467
+ </style>
468
+ </head>
469
+ <body>
470
+ <h1>⚛️ CERNenv Trainer</h1>
471
+ <p class=muted>GRPO + Unsloth + LoRA on the CERNenv LHC discovery environment. Multi-GPU on Hugging Face Spaces.</p>
472
+
473
+ <h2>Run status</h2>
474
+ <p>Status: <span id=status class=pill>?</span></p>
475
+ <table id=meta></table>
476
+ <p>
477
+ <button onclick="startRun()">▶ Start training</button>
478
+ <button onclick="refresh()">↻ Refresh</button>
479
+ <a href="/evidence" target=_blank><button>📁 Evidence index</button></a>
480
+ <a href="/docs" target=_blank><button>🛠 API</button></a>
481
+ </p>
482
+
483
+ <h2>Training-progress evidence</h2>
484
+ <p class=muted>Auto-updated as training runs. All artifacts are also saved to <code>evidence/</code> and pushed to the model repo on the Hub.</p>
485
+ <div class=grid>
486
+ <div class=card><b>Per-step training curve</b><br>
487
+ <img id=curve src="/evidence/training_curve.png" onerror="this.style.display='none'">
488
+ <div id=curve_missing class=muted style="display:none">(not yet — waiting for first GRPO step)</div>
489
+ </div>
490
+ <div class=card><b>Reward components (terminal vs shaping)</b><br>
491
+ <img id=components src="/evidence/reward_components.png" onerror="this.style.display='none'">
492
+ <div id=components_missing class=muted style="display:none">(populated after a few rollouts — watches verifier hacks)</div>
493
+ </div>
494
+ <div class=card><b>Mid-training checkpoint progression</b><br>
495
+ <img id=ckpt src="/evidence/checkpoint_progression.png" onerror="this.style.display='none'">
496
+ <div id=ckpt_missing class=muted style="display:none">(not yet — waiting for first checkpoint eval)</div>
497
+ </div>
498
+ <div class=card><b>Before vs after summary</b><br>
499
+ <img id=summary src="/evidence/before_after_summary.png" onerror="this.style.display='none'">
500
+ <div id=summary_missing class=muted style="display:none">(generated after post-train eval)</div>
501
+ </div>
502
+ <div class=card><b>Reward distribution: pre vs post</b><br>
503
+ <img id=dist src="/evidence/reward_distribution.png" onerror="this.style.display='none'">
504
+ <div id=dist_missing class=muted style="display:none">(generated after post-train eval)</div>
505
+ </div>
506
+ </div>
507
+
508
+ <h2>Before / after metrics</h2>
509
+ <table id=metrics_table>
510
+ <tr><th>metric</th><th>pre</th><th>post</th><th>Δ</th></tr>
511
+ </table>
512
+
513
+ <h2>Live logs (tail)</h2>
514
+ <pre id=logs>loading…</pre>
515
+
516
+ <script>
517
+ function fmt(v) {
518
+ if (v == null) return '–';
519
+ if (typeof v === 'number') return v.toFixed(3);
520
+ return v;
521
+ }
522
+ function fmtDelta(d) {
523
+ if (d == null || isNaN(d)) return '–';
524
+ const sign = d >= 0 ? '+' : '';
525
+ const cls = d >= 0 ? 'delta-pos' : 'delta-neg';
526
+ return `<span class="${cls}">${sign}${d.toFixed(3)}</span>`;
527
+ }
528
+
529
+ async function refresh() {
530
+ // status
531
+ const s = await fetch('/status').then(r => r.json());
532
+ const pill = document.getElementById('status');
533
+ pill.textContent = s.status;
534
+ pill.className = 'pill ' + ({idle:'',running:'run',finished:'ok',failed:'fail'}[s.status] || '');
535
+
536
+ const meta = document.getElementById('meta');
537
+ meta.innerHTML = '';
538
+ const obj = {
539
+ started_at: s.started_at, finished_at: s.finished_at, error: s.last_error,
540
+ ...(s.last_config || {}),
541
+ };
542
+ for (const [k, v] of Object.entries(obj)) {
543
+ if (v == null || v === '') continue;
544
+ const tr = document.createElement('tr');
545
+ tr.innerHTML = `<td><b>${k}</b></td><td><code>${v}</code></td>`;
546
+ meta.appendChild(tr);
547
+ }
548
+
549
+ // metrics
550
+ const m = await fetch('/metrics').then(r => r.json()).catch(() => ({pre:null, post:null}));
551
+ const tbody = document.getElementById('metrics_table');
552
+ tbody.innerHTML = '<tr><th>metric</th><th>pre</th><th>post</th><th>Δ</th></tr>';
553
+ const fields = ['mean_reward', 'success_rate', 'mass_acc', 'channel_acc', 'median_reward'];
554
+ for (const f of fields) {
555
+ const pre = m.pre && m.pre[f];
556
+ const post = m.post && m.post[f];
557
+ const delta = m.delta && m.delta[f];
558
+ const tr = document.createElement('tr');
559
+ tr.innerHTML = `<td><code>${f}</code></td><td>${fmt(pre)}</td><td>${fmt(post)}</td><td>${fmtDelta(delta)}</td>`;
560
+ tbody.appendChild(tr);
561
+ }
562
+
563
+ // bust caches on plots
564
+ const bust = '?t=' + Date.now();
565
+ for (const [imgId, missingId] of [
566
+ ['curve', 'curve_missing'],
567
+ ['components', 'components_missing'],
568
+ ['ckpt', 'ckpt_missing'],
569
+ ['summary', 'summary_missing'],
570
+ ['dist', 'dist_missing'],
571
+ ]) {
572
+ const img = document.getElementById(imgId);
573
+ const miss = document.getElementById(missingId);
574
+ const baseSrc = img.getAttribute('src').split('?')[0];
575
+ const probe = new Image();
576
+ probe.onload = () => { img.src = baseSrc + bust; img.style.display=''; miss.style.display='none'; };
577
+ probe.onerror = () => { img.style.display='none'; miss.style.display=''; };
578
+ probe.src = baseSrc + bust;
579
+ }
580
+
581
+ const logs = await fetch('/logs?tail=200').then(r => r.text());
582
+ document.getElementById('logs').textContent = logs || '(no logs yet)';
583
+ }
584
+ async function startRun() {
585
+ const r = await fetch('/train', {method:'POST'});
586
+ if (!r.ok) alert((await r.json()).detail || 'failed');
587
+ setTimeout(refresh, 500);
588
+ }
589
+ refresh();
590
+ setInterval(refresh, 5000);
591
+ </script>
592
+ </body>
593
+ </html>
594
+ """
595
+
596
+
597
+ @app.get("/", response_class=HTMLResponse)
598
+ def index() -> HTMLResponse:
599
+ return HTMLResponse(_HTML)
600
+
601
+
602
+ @app.get("/health")
603
+ def health() -> Dict[str, str]:
604
+ return {"status": "ok"}
605
+
606
+
607
+ @app.get("/status")
608
+ def status() -> JSONResponse:
609
+ return JSONResponse(STATE.to_dict())
610
+
611
+
612
+ @app.get("/metrics")
613
+ def metrics() -> JSONResponse:
614
+ if METRICS_FILE.exists():
615
+ try:
616
+ return JSONResponse(json.loads(METRICS_FILE.read_text()))
617
+ except Exception:
618
+ return JSONResponse({"error": "metrics file unreadable"}, status_code=500)
619
+ return JSONResponse({"pre": None, "post": None, "delta": None})
620
+
621
+
622
+ @app.get("/evidence")
623
+ def evidence_index() -> JSONResponse:
624
+ """List every evidence artifact currently on disk."""
625
+ files = []
626
+ if EVIDENCE_DIR.exists():
627
+ for p in sorted(EVIDENCE_DIR.iterdir()):
628
+ if p.is_file():
629
+ files.append({
630
+ "name": p.name,
631
+ "size": p.stat().st_size,
632
+ "url": f"/evidence/{p.name}",
633
+ })
634
+ return JSONResponse({"dir": str(EVIDENCE_DIR), "files": files})
635
+
636
+
637
+ @app.get("/evidence/{name}")
638
+ def evidence_file(name: str):
639
+ """Serve a single evidence artifact (PNG/CSV/JSON/MD) by filename."""
640
+ if "/" in name or ".." in name:
641
+ raise HTTPException(status_code=400, detail="invalid name")
642
+ target = EVIDENCE_DIR / name
643
+ if not target.exists() or not target.is_file():
644
+ raise HTTPException(status_code=404, detail=f"{name} not found")
645
+ return FileResponse(target)
646
+
647
+
648
+ @app.get("/logs", response_class=PlainTextResponse)
649
+ def logs(tail: int = 400) -> PlainTextResponse:
650
+ if not LOG_FILE.exists():
651
+ return PlainTextResponse("")
652
+ text = LOG_FILE.read_text()
653
+ lines = text.splitlines()
654
+ return PlainTextResponse("\n".join(lines[-max(tail, 1):]))
655
+
656
+
657
+ @app.post("/train")
658
+ def train() -> JSONResponse:
659
+ try:
660
+ _start_training(dict(CONFIG))
661
+ except RuntimeError as exc:
662
+ raise HTTPException(status_code=409, detail=str(exc))
663
+ return JSONResponse({"status": "started", "config": CONFIG})
664
+
665
+
666
+ @app.on_event("startup")
667
+ def _maybe_autostart() -> None:
668
+ if CONFIG["autostart"]:
669
+ try:
670
+ _start_training(dict(CONFIG))
671
+ logger.info("autostarted training run")
672
+ except RuntimeError as exc:
673
+ logger.warning("autostart skipped: %s", exc)
requirements.txt ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cu124
2
+ # Strategy: pin only torch (so we get the right CUDA wheel) and unsloth (which
3
+ # locks the rest of the matrix — trl, transformers, peft, etc — transitively
4
+ # via its package metadata). We avoid hand-pinning the surrounding libraries
5
+ # because hand-pins kept producing import-time syntax errors from skew.
6
+ torch==2.6.0
7
+ torchvision==0.21.0
8
+ torchaudio==2.6.0
9
+ unsloth==2026.4.8
10
+ # Pin transformers to 4.x: the 5.x series dropped many legacy model classes
11
+ # (e.g. BloomPreTrainedModel) that current peft / trl still reference, which
12
+ # broke imports at runtime.
13
+ transformers>=4.51.3,<5.0
14
+ trl>=0.18.2,<=0.24.0,!=0.19.0
15
+ peft>=0.18.0,<0.20
16
+ xformers
17
+ matplotlib>=3.8.0
18
+ numpy>=1.24.0
19
+ scipy>=1.10.0
20
+ pydantic>=2.0.0
21
+ fastapi>=0.110.0
22
+ uvicorn>=0.27.0
23
+ huggingface_hub>=0.24.0
24
+ openenv-core[core]>=0.2.3