ritishshrirao commited on
Commit
e44cdee
·
1 Parent(s): d822755

Update training config, add checkpointing on HF

Browse files
README.md CHANGED
@@ -180,7 +180,7 @@ For a standalone Linux server or SSH box, there is also a wrapper script that ac
180
  VENV_PATH="$HOME/arl" \
181
  INSTALL_TRAIN_DEPS=1 \
182
  TRAIN_ENV_CONFIG_PATH="config/shared_config.json" \
183
- TRAIN_SELF_PLAY_CONFIG_PATH="config/self_play_training_hf_a10g_smoke.json" \
184
  TRAIN_SELF_PLAY_OUTPUT_DIR="artifacts/self_play_server" \
185
  bash scripts/train_self_play_standalone.sh
186
  ```
@@ -194,12 +194,12 @@ Useful overrides for the standalone script:
194
 
195
  The training config also supports `"model_topology": "dual"|"shared"`, `"phase_schedule": "generator_answerer"|"answerer_generator_answerer"`, `"tuning_mode": "full"|"lora"`, and `"canonical_graph_mode": "generate"|"fixed"` so you can switch between two-model vs single-model self-play, full fine-tuning vs LoRA adapters, and whether canonical graph structure is generated each round or kept fixed while training question/answer behavior.
196
 
197
- ### Hugging Face Job A10G Run (Separate From The Space)
198
 
199
- For a short verification run (enough to confirm W&B logging before scaling up), use:
200
 
201
  ```bash
202
- osint-env train-self-play --config config/shared_config.json --train-config config/self_play_training_hf_a10g_smoke.json
203
  ```
204
 
205
  This config:
@@ -207,8 +207,8 @@ This config:
207
  - uses `Qwen/Qwen2.5-0.5B-Instruct`
208
  - enables W&B reporting (`wandb_enabled: true`)
209
  - uses `pipeline_mode: "swarm_v2"` with `canonical_graph_mode: "fixed"` to keep canonical graph candidates stable while training question/answer behavior
210
- - keeps training intentionally short (`rounds=2`, `max_steps=50` per phase)
211
- - uses full fine-tuning plus fused AdamW, bf16/tf32, larger generation batches, and extra dataloader workers to make better use of an A10G
212
 
213
  To enable canonical graph generation during swarm_v2 training, switch `"canonical_graph_mode"` to `"generate"` in the training config.
214
 
@@ -220,25 +220,29 @@ osint-env-launch-hf-job \
220
  --job-image "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel" \
221
  --repo-url "https://github.com/your-org/meta-knowledge-graph.git" \
222
  --repo-ref "main" \
223
- --flavor "a10g-small" \
224
  --env-config "config/shared_config.json" \
225
- --train-config "config/self_play_training_hf_a10g_smoke.json" \
226
  --output-bucket "your-hf-bucket" \
227
  --wait
228
  ```
229
 
230
- The launcher talks to the Hugging Face Jobs API through `huggingface_hub`, so the Space can remain on CPU while the training job runs on separate A10G compute.
231
 
232
  Optional Space startup wiring still exists if you want it:
233
 
234
  1. Keep the Space on CPU if it is serving inference/UI only.
235
- 2. Set `RUN_SELF_PLAY_TRAINING=1` only if you intentionally want startup-time training inside the Space container.
236
  3. Optional overrides:
237
- - `TRAIN_SELF_PLAY_CONFIG_PATH` (default: `config/self_play_training_hf_a10g_smoke.json`)
 
238
  - `TRAIN_ENV_CONFIG_PATH` (default: `config/shared_config.json`)
239
  - `TRAIN_SELF_PLAY_OUTPUT_DIR` to override where artifacts land
240
  - `RUN_SELF_PLAY_DRY_RUN=1` to test startup wiring without GRPO updates
241
  - `RUN_SELF_PLAY_BACKGROUND=1` to keep the API up while startup-time training runs
 
 
 
242
  - `OSINT_TRAIN_STRICT_ASSERTS=1` to fail fast when reward variance, KL, loss, grad norms, or parameter updates stay zero
243
 
244
  W&B run naming is controlled by `wandb_run_name_prefix` and will emit phase-specific runs like `...-r001-generator` and `...-r001-answerer`.
 
180
  VENV_PATH="$HOME/arl" \
181
  INSTALL_TRAIN_DEPS=1 \
182
  TRAIN_ENV_CONFIG_PATH="config/shared_config.json" \
183
+ TRAIN_SELF_PLAY_CONFIG_PATH="config/self_play_training_hf_l40s_full.json" \
184
  TRAIN_SELF_PLAY_OUTPUT_DIR="artifacts/self_play_server" \
185
  bash scripts/train_self_play_standalone.sh
186
  ```
 
194
 
195
  The training config also supports `"model_topology": "dual"|"shared"`, `"phase_schedule": "generator_answerer"|"answerer_generator_answerer"`, `"tuning_mode": "full"|"lora"`, and `"canonical_graph_mode": "generate"|"fixed"` so you can switch between two-model vs single-model self-play, full fine-tuning vs LoRA adapters, and whether canonical graph structure is generated each round or kept fixed while training question/answer behavior.
196
 
197
+ ### Hugging Face Job L40S Run (Separate From The Space)
198
 
199
+ For a budgeted full fine-tuning run on `l40s` hardware, use:
200
 
201
  ```bash
202
+ osint-env train-self-play --config config/shared_config.json --train-config config/self_play_training_hf_l40s_full.json
203
  ```
204
 
205
  This config:
 
207
  - uses `Qwen/Qwen2.5-0.5B-Instruct`
208
  - enables W&B reporting (`wandb_enabled: true`)
209
  - uses `pipeline_mode: "swarm_v2"` with `canonical_graph_mode: "fixed"` to keep canonical graph candidates stable while training question/answer behavior
210
+ - keeps the VRAM-heavy settings aligned with the smoke config while extending runtime (`rounds=6`, `max_steps=120` per phase)
211
+ - uses full fine-tuning plus fused AdamW, bf16/tf32, larger generation batches, and extra dataloader workers to make better use of an L40S
212
 
213
  To enable canonical graph generation during swarm_v2 training, switch `"canonical_graph_mode"` to `"generate"` in the training config.
214
 
 
220
  --job-image "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel" \
221
  --repo-url "https://github.com/your-org/meta-knowledge-graph.git" \
222
  --repo-ref "main" \
223
+ --flavor "l40s" \
224
  --env-config "config/shared_config.json" \
225
+ --train-config "config/self_play_training_hf_l40s_full.json" \
226
  --output-bucket "your-hf-bucket" \
227
  --wait
228
  ```
229
 
230
+ The launcher talks to the Hugging Face Jobs API through `huggingface_hub`, so the Space can remain on CPU while the training job runs on separate L40S compute.
231
 
232
  Optional Space startup wiring still exists if you want it:
233
 
234
  1. Keep the Space on CPU if it is serving inference/UI only.
235
+ 2. The startup script now defaults to running self-play on boot with the full L40S config.
236
  3. Optional overrides:
237
+ - `RUN_SELF_PLAY_TRAINING=0` to disable startup-time training
238
+ - `TRAIN_SELF_PLAY_CONFIG_PATH` (default: `config/self_play_training_hf_l40s_full.json`)
239
  - `TRAIN_ENV_CONFIG_PATH` (default: `config/shared_config.json`)
240
  - `TRAIN_SELF_PLAY_OUTPUT_DIR` to override where artifacts land
241
  - `RUN_SELF_PLAY_DRY_RUN=1` to test startup wiring without GRPO updates
242
  - `RUN_SELF_PLAY_BACKGROUND=1` to keep the API up while startup-time training runs
243
+ - `OSINT_HF_CHECKPOINT_REPO_ID` to force uploads into a specific HF model repo
244
+ - `OSINT_HF_CHECKPOINT_REPO_TYPE` to switch repo type (`model` by default)
245
+ - `OSINT_HF_CHECKPOINT_REPO_PRIVATE=0` to create/update a public checkpoint repo
246
  - `OSINT_TRAIN_STRICT_ASSERTS=1` to fail fast when reward variance, KL, loss, grad norms, or parameter updates stay zero
247
 
248
  W&B run naming is controlled by `wandb_run_name_prefix` and will emit phase-specific runs like `...-r001-generator` and `...-r001-answerer`.
config/self_play_training_hf_l40s_full.json ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "rounds": 6,
3
+ "output_dir": "artifacts/self_play_hf_l40s_full",
4
+ "dry_run": false,
5
+ "wandb_enabled": true,
6
+ "wandb_project": "osint-self-play-train",
7
+ "wandb_entity": "",
8
+ "wandb_run_name_prefix": "qwen25-05b-instruct-l40s-full",
9
+ "pipeline_mode": "swarm_v2",
10
+ "canonical_graph_mode": "fixed",
11
+ "model_topology": "shared",
12
+ "phase_schedule": "generator_answerer",
13
+ "tuning_mode": "full",
14
+ "shared_model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
15
+ "seed_tasks_per_round": 16,
16
+ "generated_tasks_per_round": 24,
17
+ "generator_prompts_per_round": 24,
18
+ "max_graph_context_nodes": 24,
19
+ "max_graph_context_edges": 24,
20
+ "max_support_edges": 6,
21
+ "answerer_judge_max_new_tokens": 32,
22
+ "generated_task_max_new_tokens": 640,
23
+ "post_training_eval_questions": 24,
24
+ "post_training_eval_answer_max_new_tokens": 128,
25
+ "generator_phase": {
26
+ "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
27
+ "learning_rate": 5e-06,
28
+ "max_steps": 120,
29
+ "per_device_train_batch_size": 4,
30
+ "gradient_accumulation_steps": 1,
31
+ "num_generations": 4,
32
+ "max_completion_length": 384,
33
+ "max_prompt_length": 768,
34
+ "generation_batch_size": 16,
35
+ "temperature": 0.9,
36
+ "top_p": 0.95,
37
+ "repetition_penalty": 1.1,
38
+ "beta": 0.01,
39
+ "epsilon": 0.2,
40
+ "num_iterations": 1,
41
+ "loss_type": "dapo",
42
+ "scale_rewards": "group",
43
+ "logging_steps": 5,
44
+ "save_steps": 30,
45
+ "save_total_limit": 4,
46
+ "optim": "adamw_torch_fused",
47
+ "bf16": true,
48
+ "tf32": true,
49
+ "gradient_checkpointing": false,
50
+ "dataloader_num_workers": 4,
51
+ "dataloader_persistent_workers": true,
52
+ "dataloader_prefetch_factor": 4,
53
+ "output_subdir": "generator_train",
54
+ "use_vllm": false,
55
+ "vllm_mode": "colocate"
56
+ },
57
+ "answerer_phase": {
58
+ "model_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
59
+ "learning_rate": 3e-06,
60
+ "max_steps": 120,
61
+ "per_device_train_batch_size": 4,
62
+ "gradient_accumulation_steps": 1,
63
+ "num_generations": 4,
64
+ "max_completion_length": 256,
65
+ "max_prompt_length": 768,
66
+ "generation_batch_size": 16,
67
+ "temperature": 0.7,
68
+ "top_p": 0.95,
69
+ "repetition_penalty": 1.1,
70
+ "beta": 0.01,
71
+ "epsilon": 0.2,
72
+ "num_iterations": 1,
73
+ "loss_type": "dapo",
74
+ "scale_rewards": "group",
75
+ "logging_steps": 5,
76
+ "save_steps": 30,
77
+ "save_total_limit": 4,
78
+ "optim": "adamw_torch_fused",
79
+ "bf16": true,
80
+ "tf32": true,
81
+ "gradient_checkpointing": false,
82
+ "dataloader_num_workers": 4,
83
+ "dataloader_persistent_workers": true,
84
+ "dataloader_prefetch_factor": 4,
85
+ "output_subdir": "answerer_train",
86
+ "use_vllm": false,
87
+ "vllm_mode": "colocate"
88
+ },
89
+ "lora": {
90
+ "r": 8,
91
+ "alpha": 16,
92
+ "dropout": 0.05,
93
+ "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
94
+ "bias": "none",
95
+ "task_type": "CAUSAL_LM"
96
+ }
97
+ }
docs/adversarial_self_play.md CHANGED
@@ -122,6 +122,8 @@ Per round and phase you will now find:
122
  - `self_play_summary.json`: top-level run summary.
123
  - `post_training_evaluation.json`: generated-question evaluation written after training.
124
 
 
 
125
  ## Compute Mode
126
 
127
  When compute is available:
@@ -142,7 +144,7 @@ Example:
142
  VENV_PATH="$HOME/arl" \
143
  INSTALL_TRAIN_DEPS=1 \
144
  TRAIN_ENV_CONFIG_PATH="config/shared_config.json" \
145
- TRAIN_SELF_PLAY_CONFIG_PATH="config/self_play_training_hf_a10g_smoke.json" \
146
  TRAIN_SELF_PLAY_OUTPUT_DIR="artifacts/self_play_server" \
147
  bash scripts/train_self_play_standalone.sh
148
  ```
 
122
  - `self_play_summary.json`: top-level run summary.
123
  - `post_training_evaluation.json`: generated-question evaluation written after training.
124
 
125
+ If `HF_TOKEN` is available, the trainer can also mirror phase folders and summary artifacts to a Hugging Face repo. By default it derives a repo on the same account as the Space using `SPACE_ID`/`HF_SPACE_ID` and a `-checkpoints` suffix. You can override that with `OSINT_HF_CHECKPOINT_REPO_ID`.
126
+
127
  ## Compute Mode
128
 
129
  When compute is available:
 
144
  VENV_PATH="$HOME/arl" \
145
  INSTALL_TRAIN_DEPS=1 \
146
  TRAIN_ENV_CONFIG_PATH="config/shared_config.json" \
147
+ TRAIN_SELF_PLAY_CONFIG_PATH="config/self_play_training_hf_l40s_full.json" \
148
  TRAIN_SELF_PLAY_OUTPUT_DIR="artifacts/self_play_server" \
149
  bash scripts/train_self_play_standalone.sh
150
  ```
scripts/space_start.sh CHANGED
@@ -9,9 +9,9 @@ _is_true() {
9
  }
10
 
11
  ENV_CONFIG_PATH="${TRAIN_ENV_CONFIG_PATH:-config/shared_config.json}"
12
- TRAIN_CONFIG_PATH="${TRAIN_SELF_PLAY_CONFIG_PATH:-config/self_play_training_hf_a10g_smoke.json}"
13
  TRAIN_OUTPUT_DIR="${TRAIN_SELF_PLAY_OUTPUT_DIR:-}"
14
- RUN_FLAG="${RUN_SELF_PLAY_TRAINING:-0}"
15
  DRY_RUN_FLAG="${RUN_SELF_PLAY_DRY_RUN:-0}"
16
  BACKGROUND_FLAG="${RUN_SELF_PLAY_BACKGROUND:-1}"
17
 
@@ -44,6 +44,9 @@ if _is_true "$RUN_FLAG"; then
44
  if [ -n "${TRAIN_OUTPUT_DIR}" ]; then
45
  echo "[space_start] Train output dir: ${TRAIN_OUTPUT_DIR}"
46
  fi
 
 
 
47
  if _is_true "$BACKGROUND_FLAG"; then
48
  echo "[space_start] Launching self-play in background so the Space API can stay online."
49
  _train_self_play &
 
9
  }
10
 
11
  ENV_CONFIG_PATH="${TRAIN_ENV_CONFIG_PATH:-config/shared_config.json}"
12
+ TRAIN_CONFIG_PATH="${TRAIN_SELF_PLAY_CONFIG_PATH:-config/self_play_training_hf_l40s_full.json}"
13
  TRAIN_OUTPUT_DIR="${TRAIN_SELF_PLAY_OUTPUT_DIR:-}"
14
+ RUN_FLAG="${RUN_SELF_PLAY_TRAINING:-1}"
15
  DRY_RUN_FLAG="${RUN_SELF_PLAY_DRY_RUN:-0}"
16
  BACKGROUND_FLAG="${RUN_SELF_PLAY_BACKGROUND:-1}"
17
 
 
44
  if [ -n "${TRAIN_OUTPUT_DIR}" ]; then
45
  echo "[space_start] Train output dir: ${TRAIN_OUTPUT_DIR}"
46
  fi
47
+ if [ -n "${OSINT_HF_CHECKPOINT_REPO_ID:-}" ]; then
48
+ echo "[space_start] HF checkpoint repo: ${OSINT_HF_CHECKPOINT_REPO_ID}"
49
+ fi
50
  if _is_true "$BACKGROUND_FLAG"; then
51
  echo "[space_start] Launching self-play in background so the Space API can stay online."
52
  _train_self_play &
src/osint_env/training/hf_jobs.py CHANGED
@@ -265,10 +265,10 @@ def build_parser() -> argparse.ArgumentParser:
265
  )
266
  parser.add_argument(
267
  "--train-config",
268
- default=os.getenv("TRAIN_SELF_PLAY_CONFIG_PATH", "config/self_play_training_hf_a10g_smoke.json"),
269
  help="Training config path inside the training image or checked-out repo.",
270
  )
271
- parser.add_argument("--flavor", default=os.getenv("HF_JOB_FLAVOR", "a10g-small"))
272
  parser.add_argument("--timeout", default=os.getenv("HF_JOB_TIMEOUT", "8h"))
273
  parser.add_argument("--namespace", default=os.getenv("HF_JOB_NAMESPACE", ""))
274
  parser.add_argument("--run-name", default=os.getenv("HF_JOB_RUN_NAME", "osint-self-play-job"))
 
265
  )
266
  parser.add_argument(
267
  "--train-config",
268
+ default=os.getenv("TRAIN_SELF_PLAY_CONFIG_PATH", "config/self_play_training_hf_l40s_full.json"),
269
  help="Training config path inside the training image or checked-out repo.",
270
  )
271
+ parser.add_argument("--flavor", default=os.getenv("HF_JOB_FLAVOR", "l40s"))
272
  parser.add_argument("--timeout", default=os.getenv("HF_JOB_TIMEOUT", "8h"))
273
  parser.add_argument("--namespace", default=os.getenv("HF_JOB_NAMESPACE", ""))
274
  parser.add_argument("--run-name", default=os.getenv("HF_JOB_RUN_NAME", "osint-self-play-job"))
src/osint_env/training/rewards.py CHANGED
@@ -953,7 +953,7 @@ class GeneratorRewardFunction:
953
  context_pressure = self._context_pressure_score(validation_result)
954
  parl_parallel, parl_finish = self._parl_scores(candidate)
955
  hardness_component = max(0.0, min(1.0, (hardness + 0.4) / 1.4))
956
- consistency_component = max(
957
  0.0,
958
  min(
959
  1.0,
 
953
  context_pressure = self._context_pressure_score(validation_result)
954
  parl_parallel, parl_finish = self._parl_scores(candidate)
955
  hardness_component = max(0.0, min(1.0, (hardness + 0.4) / 1.4))
956
+ consistency_component = max(
957
  0.0,
958
  min(
959
  1.0,
src/osint_env/training/self_play.py CHANGED
@@ -3,6 +3,7 @@ from __future__ import annotations
3
  import inspect
4
  import json
5
  import os
 
6
  from dataclasses import dataclass
7
  from pathlib import Path
8
  import random
@@ -46,6 +47,111 @@ class _RoundArtifacts:
46
  generated_tasks_path: str
47
 
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  def _require_training_stack() -> tuple[Any, Any, Any]:
51
  try:
@@ -1489,6 +1595,11 @@ def _run_adversarial_self_play_swarm_v2(
1489
 
1490
  run_dir = Path(training_config.output_dir)
1491
  run_dir.mkdir(parents=True, exist_ok=True)
 
 
 
 
 
1492
 
1493
  env = OSINTEnvironment(env_config, llm=build_llm_client(env_config.llm))
1494
  seed_tasks = list(env.tasks)
@@ -1566,6 +1677,11 @@ def _run_adversarial_self_play_swarm_v2(
1566
  report_to=answerer_pre_report_to,
1567
  run_name=answerer_pre_run_name,
1568
  )
 
 
 
 
 
1569
  answerer_model = str(answerer_pre_train_result["model_path"])
1570
  if topology == "shared":
1571
  generator_model = answerer_model
@@ -1614,6 +1730,11 @@ def _run_adversarial_self_play_swarm_v2(
1614
  report_to=generator_report_to,
1615
  run_name=generator_run_name,
1616
  )
 
 
 
 
 
1617
  generator_model = str(generator_train_result["model_path"])
1618
  if topology == "shared":
1619
  answerer_model = generator_model
@@ -1719,6 +1840,11 @@ def _run_adversarial_self_play_swarm_v2(
1719
  report_to=answerer_report_to,
1720
  run_name=answerer_run_name,
1721
  )
 
 
 
 
 
1722
  answerer_model = str(answerer_train_result["model_path"])
1723
  if topology == "shared":
1724
  generator_model = answerer_model
@@ -1790,6 +1916,8 @@ def _run_adversarial_self_play_swarm_v2(
1790
  summary_path = run_dir / "self_play_summary.json"
1791
  summary_path.write_text(json.dumps(final_payload, indent=2, sort_keys=True), encoding="utf-8")
1792
  final_payload["summary_path"] = str(summary_path)
 
 
1793
  return final_payload
1794
 
1795
 
@@ -1813,6 +1941,11 @@ def run_adversarial_self_play(
1813
 
1814
  run_dir = Path(training_config.output_dir)
1815
  run_dir.mkdir(parents=True, exist_ok=True)
 
 
 
 
 
1816
 
1817
  env = OSINTEnvironment(env_config, llm=build_llm_client(env_config.llm))
1818
  seed_tasks = list(env.tasks)
@@ -1878,6 +2011,11 @@ def run_adversarial_self_play(
1878
  report_to=answerer_pre_report_to,
1879
  run_name=answerer_pre_run_name,
1880
  )
 
 
 
 
 
1881
  answerer_model = str(answerer_pre_train_result["model_path"])
1882
  if topology == "shared":
1883
  generator_model = answerer_model
@@ -1921,6 +2059,11 @@ def run_adversarial_self_play(
1921
  report_to=generator_report_to,
1922
  run_name=generator_run_name,
1923
  )
 
 
 
 
 
1924
  generator_model = str(generator_train_result["model_path"])
1925
  if topology == "shared":
1926
  answerer_model = generator_model
@@ -1993,6 +2136,11 @@ def run_adversarial_self_play(
1993
  report_to=answerer_report_to,
1994
  run_name=answerer_run_name,
1995
  )
 
 
 
 
 
1996
  answerer_model = str(answerer_train_result["model_path"])
1997
  if topology == "shared":
1998
  generator_model = answerer_model
@@ -2067,5 +2215,7 @@ def run_adversarial_self_play(
2067
  summary_path = run_dir / "self_play_summary.json"
2068
  summary_path.write_text(json.dumps(final_payload, indent=2, sort_keys=True), encoding="utf-8")
2069
  final_payload["summary_path"] = str(summary_path)
 
 
2070
 
2071
  return final_payload
 
3
  import inspect
4
  import json
5
  import os
6
+ import re
7
  from dataclasses import dataclass
8
  from pathlib import Path
9
  import random
 
47
  generated_tasks_path: str
48
 
49
 
50
+ def _is_true_env(value: str | None) -> bool:
51
+ token = str(value or "").strip().lower()
52
+ return token in {"1", "true", "yes", "y", "on"}
53
+
54
+
55
+ def _resolve_hf_upload_token() -> str:
56
+ for env_name in ("HF_TOKEN", "HUGGINGFACE_HUB_TOKEN", "HUGGING_FACE_HUB_TOKEN"):
57
+ token = str(os.getenv(env_name, "")).strip()
58
+ if token:
59
+ return token
60
+ return ""
61
+
62
+
63
+ def _slugify_hf_repo_name(value: str) -> str:
64
+ token = re.sub(r"[^a-zA-Z0-9._-]+", "-", str(value).strip().lower())
65
+ token = re.sub(r"-{2,}", "-", token).strip("-.")
66
+ return token
67
+
68
+
69
+ def _default_hf_checkpoint_repo_id(run_dir: Path) -> str:
70
+ explicit = str(os.getenv("OSINT_HF_CHECKPOINT_REPO_ID", "")).strip()
71
+ if explicit:
72
+ return explicit
73
+
74
+ space_id = str(os.getenv("SPACE_ID") or os.getenv("HF_SPACE_ID") or "").strip()
75
+ if "/" not in space_id:
76
+ return ""
77
+
78
+ owner, _, space_name = space_id.partition("/")
79
+ suffix = str(os.getenv("OSINT_HF_CHECKPOINT_REPO_SUFFIX", "-checkpoints")).strip() or "-checkpoints"
80
+ repo_name = _slugify_hf_repo_name(f"{space_name}{suffix}") or "osint-self-play-checkpoints"
81
+ return f"{owner}/{repo_name}"
82
+
83
+
84
+ def _hf_checkpoint_repo_prefix(run_dir: Path) -> str:
85
+ explicit = str(os.getenv("OSINT_HF_CHECKPOINT_PATH_PREFIX", "")).strip().strip("/")
86
+ if explicit:
87
+ return explicit
88
+ return _slugify_hf_repo_name(run_dir.name) or "self-play"
89
+
90
+
91
+ def _hf_relative_repo_path(local_path: Path, run_dir: Path) -> str:
92
+ prefix = _hf_checkpoint_repo_prefix(run_dir)
93
+ try:
94
+ relative = local_path.relative_to(run_dir).as_posix()
95
+ except ValueError:
96
+ relative = local_path.name
97
+ return f"{prefix}/{relative}".strip("/")
98
+
99
+
100
+ def _maybe_upload_folder_to_hf(local_dir: Path, run_dir: Path, commit_message: str) -> None:
101
+ repo_id = _default_hf_checkpoint_repo_id(run_dir)
102
+ token = _resolve_hf_upload_token()
103
+ if not repo_id or not token or not local_dir.exists():
104
+ return
105
+
106
+ try:
107
+ from huggingface_hub import HfApi
108
+ except ImportError:
109
+ print("[self_play][hf_upload] huggingface_hub missing; skipping checkpoint upload.")
110
+ return
111
+
112
+ repo_type = str(os.getenv("OSINT_HF_CHECKPOINT_REPO_TYPE", "model")).strip() or "model"
113
+ private = _is_true_env(os.getenv("OSINT_HF_CHECKPOINT_REPO_PRIVATE", "1"))
114
+ path_in_repo = _hf_relative_repo_path(local_dir, run_dir)
115
+ api = HfApi(token=token)
116
+ api.create_repo(repo_id=repo_id, repo_type=repo_type, private=private, exist_ok=True)
117
+ api.upload_folder(
118
+ folder_path=str(local_dir),
119
+ repo_id=repo_id,
120
+ repo_type=repo_type,
121
+ path_in_repo=path_in_repo,
122
+ commit_message=commit_message,
123
+ ignore_patterns=["*.pyc", "__pycache__", ".DS_Store"],
124
+ )
125
+ print(f"[self_play][hf_upload] uploaded {local_dir} -> {repo_type}:{repo_id}/{path_in_repo}")
126
+
127
+
128
+ def _maybe_upload_file_to_hf(local_file: Path, run_dir: Path, commit_message: str) -> None:
129
+ repo_id = _default_hf_checkpoint_repo_id(run_dir)
130
+ token = _resolve_hf_upload_token()
131
+ if not repo_id or not token or not local_file.exists():
132
+ return
133
+
134
+ try:
135
+ from huggingface_hub import HfApi
136
+ except ImportError:
137
+ print("[self_play][hf_upload] huggingface_hub missing; skipping artifact upload.")
138
+ return
139
+
140
+ repo_type = str(os.getenv("OSINT_HF_CHECKPOINT_REPO_TYPE", "model")).strip() or "model"
141
+ private = _is_true_env(os.getenv("OSINT_HF_CHECKPOINT_REPO_PRIVATE", "1"))
142
+ path_in_repo = _hf_relative_repo_path(local_file, run_dir)
143
+ api = HfApi(token=token)
144
+ api.create_repo(repo_id=repo_id, repo_type=repo_type, private=private, exist_ok=True)
145
+ api.upload_file(
146
+ path_or_fileobj=str(local_file),
147
+ repo_id=repo_id,
148
+ repo_type=repo_type,
149
+ path_in_repo=path_in_repo,
150
+ commit_message=commit_message,
151
+ )
152
+ print(f"[self_play][hf_upload] uploaded {local_file} -> {repo_type}:{repo_id}/{path_in_repo}")
153
+
154
+
155
 
156
  def _require_training_stack() -> tuple[Any, Any, Any]:
157
  try:
 
1595
 
1596
  run_dir = Path(training_config.output_dir)
1597
  run_dir.mkdir(parents=True, exist_ok=True)
1598
+ checkpoint_repo_id = _default_hf_checkpoint_repo_id(run_dir)
1599
+ if checkpoint_repo_id and _resolve_hf_upload_token():
1600
+ print(f"[self_play][hf_upload] checkpoint uploads enabled -> {checkpoint_repo_id}")
1601
+ else:
1602
+ print("[self_play][hf_upload] checkpoint uploads disabled; set HF token and/or OSINT_HF_CHECKPOINT_REPO_ID.")
1603
 
1604
  env = OSINTEnvironment(env_config, llm=build_llm_client(env_config.llm))
1605
  seed_tasks = list(env.tasks)
 
1677
  report_to=answerer_pre_report_to,
1678
  run_name=answerer_pre_run_name,
1679
  )
1680
+ _maybe_upload_folder_to_hf(
1681
+ round_dir / f"{training_config.answerer_phase.output_subdir}_pre",
1682
+ run_dir,
1683
+ f"Upload answerer-pre checkpoints for round {round_index:03d}",
1684
+ )
1685
  answerer_model = str(answerer_pre_train_result["model_path"])
1686
  if topology == "shared":
1687
  generator_model = answerer_model
 
1730
  report_to=generator_report_to,
1731
  run_name=generator_run_name,
1732
  )
1733
+ _maybe_upload_folder_to_hf(
1734
+ round_dir / training_config.generator_phase.output_subdir,
1735
+ run_dir,
1736
+ f"Upload generator checkpoints for round {round_index:03d}",
1737
+ )
1738
  generator_model = str(generator_train_result["model_path"])
1739
  if topology == "shared":
1740
  answerer_model = generator_model
 
1840
  report_to=answerer_report_to,
1841
  run_name=answerer_run_name,
1842
  )
1843
+ _maybe_upload_folder_to_hf(
1844
+ round_dir / training_config.answerer_phase.output_subdir,
1845
+ run_dir,
1846
+ f"Upload answerer checkpoints for round {round_index:03d}",
1847
+ )
1848
  answerer_model = str(answerer_train_result["model_path"])
1849
  if topology == "shared":
1850
  generator_model = answerer_model
 
1916
  summary_path = run_dir / "self_play_summary.json"
1917
  summary_path.write_text(json.dumps(final_payload, indent=2, sort_keys=True), encoding="utf-8")
1918
  final_payload["summary_path"] = str(summary_path)
1919
+ _maybe_upload_file_to_hf(summary_path, run_dir, "Upload self-play summary")
1920
+ _maybe_upload_file_to_hf(run_dir / "post_training_evaluation.json", run_dir, "Upload post-training evaluation")
1921
  return final_payload
1922
 
1923
 
 
1941
 
1942
  run_dir = Path(training_config.output_dir)
1943
  run_dir.mkdir(parents=True, exist_ok=True)
1944
+ checkpoint_repo_id = _default_hf_checkpoint_repo_id(run_dir)
1945
+ if checkpoint_repo_id and _resolve_hf_upload_token():
1946
+ print(f"[self_play][hf_upload] checkpoint uploads enabled -> {checkpoint_repo_id}")
1947
+ else:
1948
+ print("[self_play][hf_upload] checkpoint uploads disabled; set HF token and/or OSINT_HF_CHECKPOINT_REPO_ID.")
1949
 
1950
  env = OSINTEnvironment(env_config, llm=build_llm_client(env_config.llm))
1951
  seed_tasks = list(env.tasks)
 
2011
  report_to=answerer_pre_report_to,
2012
  run_name=answerer_pre_run_name,
2013
  )
2014
+ _maybe_upload_folder_to_hf(
2015
+ round_dir / f"{training_config.answerer_phase.output_subdir}_pre",
2016
+ run_dir,
2017
+ f"Upload answerer-pre checkpoints for round {round_index:03d}",
2018
+ )
2019
  answerer_model = str(answerer_pre_train_result["model_path"])
2020
  if topology == "shared":
2021
  generator_model = answerer_model
 
2059
  report_to=generator_report_to,
2060
  run_name=generator_run_name,
2061
  )
2062
+ _maybe_upload_folder_to_hf(
2063
+ round_dir / training_config.generator_phase.output_subdir,
2064
+ run_dir,
2065
+ f"Upload generator checkpoints for round {round_index:03d}",
2066
+ )
2067
  generator_model = str(generator_train_result["model_path"])
2068
  if topology == "shared":
2069
  answerer_model = generator_model
 
2136
  report_to=answerer_report_to,
2137
  run_name=answerer_run_name,
2138
  )
2139
+ _maybe_upload_folder_to_hf(
2140
+ round_dir / training_config.answerer_phase.output_subdir,
2141
+ run_dir,
2142
+ f"Upload answerer checkpoints for round {round_index:03d}",
2143
+ )
2144
  answerer_model = str(answerer_train_result["model_path"])
2145
  if topology == "shared":
2146
  generator_model = answerer_model
 
2215
  summary_path = run_dir / "self_play_summary.json"
2216
  summary_path.write_text(json.dumps(final_payload, indent=2, sort_keys=True), encoding="utf-8")
2217
  final_payload["summary_path"] = str(summary_path)
2218
+ _maybe_upload_file_to_hf(summary_path, run_dir, "Upload self-play summary")
2219
+ _maybe_upload_file_to_hf(run_dir / "post_training_evaluation.json", run_dir, "Upload post-training evaluation")
2220
 
2221
  return final_payload