Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
3724e90
·
1 Parent(s): 7dafc94

trackio: post-hoc replay of train_metrics.jsonl into a HF Space dashboard

Browse files

Adds training/replay_to_trackio.py — fetches each adapter repo's
train_metrics.jsonl from the hub and streams every step into a
Trackio run (with hyperparameters in config). Auto-deploys the
dashboard as a Space; runs become filterable/comparable.

Live at: https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio

Currently 3 runs populated (the 3 single-turn runs that have
finished); the multi-step + Llama-self runs will land once their
HF Jobs finish — re-run the script then.

Run names + groups:
qwen_to_qwen_baseline (single-turn, control)
qwen_to_llama_thinkoff_hero (single-turn, the headline)
qwen_to_llama_thinkon (single-turn, A/B variant)
qwen_to_llama_multistep (multi-turn) — pending
llama_to_llama_self (single-turn) — pending

README updated with the dashboard link in the Links section + the
replay script in the training pipeline file table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. README.md +2 -0
  2. training/replay_to_trackio.py +213 -0
README.md CHANGED
@@ -21,6 +21,7 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
21
 
22
  - 🌍 **Env (this Space):** https://huggingface.co/spaces/rishabh16196/prompt_golf_env
23
  - 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
 
24
  - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
25
  - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
26
  - 📓 **Colab training notebook:** [`notebooks/prompt_golf_train_minimal.ipynb`](./notebooks/prompt_golf_train_minimal.ipynb)
@@ -44,6 +45,7 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
44
  | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
45
  | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
46
  | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
 
47
  | [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
48
 
49
  ---
 
21
 
22
  - 🌍 **Env (this Space):** https://huggingface.co/spaces/rishabh16196/prompt_golf_env
23
  - 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
24
+ - 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
25
  - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
26
  - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
27
  - 📓 **Colab training notebook:** [`notebooks/prompt_golf_train_minimal.ipynb`](./notebooks/prompt_golf_train_minimal.ipynb)
 
45
  | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
46
  | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
47
  | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
48
+ | [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
49
  | [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
50
 
51
  ---
training/replay_to_trackio.py ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Replay our existing train_metrics.jsonl files into a Trackio dashboard
3
+ hosted on a HuggingFace Space.
4
+
5
+ Each trained-adapter repo on the Hub has a `train_metrics.jsonl` with
6
+ per-step metrics (reward, raw_task_score, avg_tokens, loss, ...).
7
+ We log every row into a separate Trackio run, configured with that
8
+ adapter's hyperparameters and the resulting demo-CSV summary numbers.
9
+
10
+ After running, the dashboard is live at:
11
+ https://huggingface.co/spaces/<TRACKIO_SPACE>
12
+
13
+ Run:
14
+ python training/replay_to_trackio.py --space-id rishabh16196/prompt-golf-trackio
15
+ """
16
+
17
+ from __future__ import annotations
18
+
19
+ import argparse
20
+ import io
21
+ import json
22
+ import os
23
+ import urllib.request
24
+ from typing import Dict, List, Optional
25
+
26
+ import trackio
27
+
28
+
29
+ # Per-adapter metadata: how the run was configured + which hub repo
30
+ # holds its train_metrics.jsonl. Add more entries as runs land.
31
+ RUNS = [
32
+ {
33
+ "name": "qwen_to_qwen_baseline",
34
+ "group": "single-turn",
35
+ "repo": "rishabh16196/prompt-golf-grpo-1.5b",
36
+ "config": {
37
+ "agent_model": "Qwen/Qwen3-1.7B",
38
+ "target_model": "Qwen/Qwen3-1.7B",
39
+ "judge_model": "Qwen/Qwen3-8B",
40
+ "thinking": False,
41
+ "turn_limit": 1,
42
+ "max_steps": 500,
43
+ "num_generations": 8,
44
+ "lr": 5e-6,
45
+ "beta": 0.04,
46
+ "task_bank_size": 87,
47
+ "story": "same-family control (weak target)",
48
+ },
49
+ },
50
+ {
51
+ "name": "qwen_to_llama_thinkoff_hero",
52
+ "group": "single-turn",
53
+ "repo": "rishabh16196/prompt-golf-qwen-to-llama-nothink",
54
+ "config": {
55
+ "agent_model": "Qwen/Qwen3-1.7B",
56
+ "target_model": "meta-llama/Llama-3.2-3B-Instruct",
57
+ "judge_model": "Qwen/Qwen3-8B",
58
+ "thinking": False,
59
+ "turn_limit": 1,
60
+ "max_steps": 500,
61
+ "num_generations": 8,
62
+ "lr": 5e-6,
63
+ "beta": 0.04,
64
+ "task_bank_size": 90,
65
+ "story": "cross-family hero",
66
+ },
67
+ },
68
+ {
69
+ "name": "qwen_to_llama_thinkon",
70
+ "group": "single-turn",
71
+ "repo": "rishabh16196/prompt-golf-qwen-to-llama",
72
+ "config": {
73
+ "agent_model": "Qwen/Qwen3-1.7B",
74
+ "target_model": "meta-llama/Llama-3.2-3B-Instruct",
75
+ "judge_model": "Qwen/Qwen3-8B",
76
+ "thinking": True,
77
+ "turn_limit": 1,
78
+ "max_steps": 500,
79
+ "num_generations": 8,
80
+ "lr": 5e-6,
81
+ "beta": 0.04,
82
+ "task_bank_size": 90,
83
+ "story": "cross-family thinking-ON A/B variant",
84
+ },
85
+ },
86
+ {
87
+ "name": "qwen_to_llama_multistep",
88
+ "group": "multi-turn",
89
+ "repo": "rishabh16196/prompt-golf-multistep-llama",
90
+ "config": {
91
+ "agent_model": "Qwen/Qwen3-1.7B",
92
+ "target_model": "meta-llama/Llama-3.2-3B-Instruct",
93
+ "judge_model": "Qwen/Qwen3-8B",
94
+ "thinking": False,
95
+ "turn_limit": 3,
96
+ "max_steps": 150,
97
+ "num_generations": 4,
98
+ "lr": 2e-6,
99
+ "beta": 0.04,
100
+ "warmstart_from": "rishabh16196/prompt-golf-qwen-to-llama-nothink",
101
+ "story": "trajectory-level GRPO, warmstarted",
102
+ },
103
+ },
104
+ {
105
+ "name": "llama_to_llama_self",
106
+ "group": "single-turn",
107
+ "repo": "rishabh16196/prompt-golf-llama-self",
108
+ "config": {
109
+ "agent_model": "meta-llama/Llama-3.2-3B-Instruct",
110
+ "target_model": "meta-llama/Llama-3.2-3B-Instruct",
111
+ "judge_model": "Qwen/Qwen3-8B",
112
+ "thinking": False,
113
+ "turn_limit": 1,
114
+ "max_steps": 500,
115
+ "num_generations": 8,
116
+ "lr": 5e-6,
117
+ "beta": 0.04,
118
+ "task_bank_size": 90,
119
+ "story": "self-improvement: Llama writes prompts for Llama",
120
+ },
121
+ },
122
+ ]
123
+
124
+
125
+ def fetch_jsonl(repo: str, path: str = "train_metrics.jsonl") -> Optional[List[Dict]]:
126
+ """Pull a JSONL file from a Hub model repo. Returns None if missing."""
127
+ url = f"https://huggingface.co/{repo}/resolve/main/{path}"
128
+ headers = {}
129
+ token = os.environ.get("HF_TOKEN")
130
+ if token:
131
+ headers["Authorization"] = f"Bearer {token}"
132
+ try:
133
+ req = urllib.request.Request(url, headers=headers)
134
+ with urllib.request.urlopen(req, timeout=60) as r:
135
+ text = r.read().decode("utf-8")
136
+ except Exception as e:
137
+ print(f" [skip] {repo}/{path}: {e}", flush=True)
138
+ return None
139
+ rows = []
140
+ for line in text.splitlines():
141
+ line = line.strip()
142
+ if not line:
143
+ continue
144
+ try:
145
+ rows.append(json.loads(line))
146
+ except json.JSONDecodeError:
147
+ pass
148
+ return rows
149
+
150
+
151
+ def replay_run(run_meta: Dict, project: str, space_id: str) -> bool:
152
+ """Push one run's per-step metrics into Trackio. Returns True if logged."""
153
+ rows = fetch_jsonl(run_meta["repo"])
154
+ if not rows:
155
+ return False
156
+
157
+ print(f" [{run_meta['name']}] {len(rows)} steps from {run_meta['repo']}",
158
+ flush=True)
159
+ run = trackio.init(
160
+ project=project,
161
+ name=run_meta["name"],
162
+ group=run_meta.get("group"),
163
+ space_id=space_id,
164
+ config=run_meta.get("config", {}),
165
+ resume="never",
166
+ )
167
+ for row in rows:
168
+ # Prefer explicit step; fall back to position
169
+ step = row.get("step")
170
+ # Strip the step from the metric dict so it isn't re-logged as a metric
171
+ metrics = {k: v for k, v in row.items() if k != "step" and v is not None}
172
+ # Coerce to scalar where possible
173
+ clean = {}
174
+ for k, v in metrics.items():
175
+ if isinstance(v, (int, float, bool)):
176
+ clean[k] = v
177
+ elif isinstance(v, str):
178
+ # Skip free-form strings to keep dashboard charts clean
179
+ continue
180
+ if clean:
181
+ trackio.log(clean, step=step)
182
+ trackio.finish()
183
+ return True
184
+
185
+
186
+ def main() -> None:
187
+ p = argparse.ArgumentParser()
188
+ p.add_argument("--space-id", default="rishabh16196/prompt-golf-trackio",
189
+ help="HF Space to host the Trackio dashboard.")
190
+ p.add_argument("--project", default="prompt-golf",
191
+ help="Project name within the Trackio dashboard.")
192
+ p.add_argument("--only", default=None,
193
+ help="Comma-separated run names to replay (default: all).")
194
+ args = p.parse_args()
195
+
196
+ target_runs = RUNS
197
+ if args.only:
198
+ wanted = {x.strip() for x in args.only.split(",") if x.strip()}
199
+ target_runs = [r for r in RUNS if r["name"] in wanted]
200
+ print(f"Replaying {len(target_runs)} runs to "
201
+ f"https://huggingface.co/spaces/{args.space_id}", flush=True)
202
+
203
+ n_logged = 0
204
+ for r in target_runs:
205
+ if replay_run(r, project=args.project, space_id=args.space_id):
206
+ n_logged += 1
207
+
208
+ print(f"\nDone. {n_logged}/{len(target_runs)} runs replayed.", flush=True)
209
+ print(f"Dashboard: https://huggingface.co/spaces/{args.space_id}", flush=True)
210
+
211
+
212
+ if __name__ == "__main__":
213
+ main()