Humanlearning commited on
Commit
0e7f59c
·
1 Parent(s): 7d32451

feat: enhance reward configuration management with new logging functions, add parallel Modal training guidelines to documentation, and improve reward config hashing for deterministic behavior

Browse files
.agents/skills/cybersecurity-owasp-trainer/SKILL.md CHANGED
@@ -89,6 +89,70 @@ Stop or roll back if reward rises while sampled traces show deny-all patches, ha
89
 
90
  Stop or downgrade to local-dev only if Modal training/eval shows runtime scenario compilation, cache misses in required mode, or cache hit rate below the configured target.
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ## TRL, OpenEnv, And Unsloth Guidance
93
 
94
  - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
 
89
 
90
  Stop or downgrade to local-dev only if Modal training/eval shows runtime scenario compilation, cache misses in required mode, or cache hit rate below the configured target.
91
 
92
+ ## Parallel Modal Runs
93
+
94
+ Parallel GRPO runs are allowed, but they must not share mutable experiment
95
+ identity or mutate shared caches while another job is training.
96
+
97
+ Before launching another run:
98
+
99
+ 1. Check active Modal apps:
100
+
101
+ ```bash
102
+ uv run --extra modal modal app list
103
+ ```
104
+
105
+ 2. Inspect any active `CyberSecurity_OWASP` app before starting another job:
106
+
107
+ ```bash
108
+ uv run --extra modal modal app logs <app-id>
109
+ ```
110
+
111
+ 3. Use both detach layers for long jobs:
112
+
113
+ ```bash
114
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
115
+ --max-steps 300 \
116
+ --dataset-size 64 \
117
+ --num-generations 8 \
118
+ --max-completion-length 768 \
119
+ --difficulty 0 \
120
+ --trace-log-every 10 \
121
+ --seed-start 10000 \
122
+ --detach
123
+ ```
124
+
125
+ The Modal CLI `--detach` keeps the remote function alive after the local
126
+ entrypoint disconnects. The launcher `--detach` prevents the parent Modal
127
+ function from waiting on the spawned GPU call. Use both; using only the script
128
+ flag can let Modal stop the run when the local client exits.
129
+
130
+ For concurrent experiments:
131
+
132
+ - Assign every run a distinct `--seed-start` range, normally at least 10,000
133
+ seeds apart.
134
+ - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`.
135
+ - Do not run `prepare-cache --cache-force` while any training job is active.
136
+ - Leave `--push-to-hub` off unless every job has a unique `--output-repo-id`.
137
+ - Keep Trackio run names unique. The launcher timestamp normally handles this;
138
+ set `RUN_NAME` only when it is globally unique.
139
+ - Use the same Trackio Space/project for comparison, but never reuse a run
140
+ name.
141
+ - Treat `CyberSecurity_OWASP-model-cache` and
142
+ `CyberSecurity_OWASP-scenario-cache` as shared read-mostly volumes during
143
+ training. Run checkpoints and artifacts must live under the run-specific
144
+ output directory.
145
+ - For clean comparisons, keep model, difficulty, dataset size, generation
146
+ length, reward config, and cache version fixed; vary only `seed-start` or the
147
+ one hyperparameter being tested.
148
+
149
+ On Windows, if Modal startup fails with a Unicode `charmap` encoding error,
150
+ rerun the command with UTF-8 enabled:
151
+
152
+ ```powershell
153
+ $env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
154
+ ```
155
+
156
  ## TRL, OpenEnv, And Unsloth Guidance
157
 
158
  - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
AGENTS.md CHANGED
@@ -1079,6 +1079,62 @@ Then scale gradually:
1079
 
1080
  For high-volume rollouts, prefer local Docker or Uvicorn over remote HF Spaces because local WebSocket sessions reduce latency and avoid Space limits.
1081
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1082
  ---
1083
 
1084
  ## README requirements
 
1079
 
1080
  For high-volume rollouts, prefer local Docker or Uvicorn over remote HF Spaces because local WebSocket sessions reduce latency and avoid Space limits.
1081
 
1082
+ ### Parallel Modal training runs
1083
+
1084
+ Parallel Modal GRPO runs are allowed only when they do not overwrite each
1085
+ other's evidence, checkpoints, scenario assignments, or Hub outputs.
1086
+
1087
+ Before launching another run:
1088
+
1089
+ 1. Check active Modal apps:
1090
+
1091
+ ```bash
1092
+ uv run --extra modal modal app list
1093
+ ```
1094
+
1095
+ 2. If a `CyberSecurity_OWASP` app is active, inspect it before launching:
1096
+
1097
+ ```bash
1098
+ uv run --extra modal modal app logs <app-id>
1099
+ ```
1100
+
1101
+ 3. Use Modal CLI-level detach and the launcher detach flag together, otherwise
1102
+ the spawned GPU function may stop when the local entrypoint exits:
1103
+
1104
+ ```bash
1105
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
1106
+ --max-steps 300 \
1107
+ --dataset-size 64 \
1108
+ --num-generations 8 \
1109
+ --max-completion-length 768 \
1110
+ --difficulty 0 \
1111
+ --trace-log-every 10 \
1112
+ --seed-start 10000 \
1113
+ --detach
1114
+ ```
1115
+
1116
+ When running jobs in parallel:
1117
+
1118
+ - Give every run a distinct `--seed-start` range, spaced by at least 10,000
1119
+ seeds unless a smaller controlled comparison is intentional.
1120
+ - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
1121
+ scenarios in the training hot path.
1122
+ - Do not run `prepare-cache --cache-force` while any training job is active.
1123
+ Scenario-cache writes can invalidate or race training resets.
1124
+ - Leave `--push-to-hub` off for parallel experiments unless each run has a
1125
+ unique `--output-repo-id`.
1126
+ - Keep run names unique. The launcher timestamp normally handles this; set an
1127
+ explicit `RUN_NAME` only when it is globally unique.
1128
+ - Use different Trackio run names but the same Trackio Space so reward,
1129
+ throughput, GPU utilization, invalid-action rate, and success metrics remain
1130
+ comparable.
1131
+ - Treat the shared Modal volumes as shared infrastructure: model cache and
1132
+ scenario cache should be read-only during parallel training; run/checkpoint
1133
+ outputs must live under each run's unique output directory.
1134
+ - If the goal is a clean reward comparison, keep model, difficulty,
1135
+ `dataset-size`, `num-generations`, `max-completion-length`, and reward config
1136
+ fixed, changing only `seed-start` or the one hyperparameter being tested.
1137
+
1138
  ---
1139
 
1140
  ## README requirements
README.md CHANGED
@@ -211,6 +211,20 @@ uv run python scripts/track_pytest.py tests
211
 
212
  Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
213
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  ## Modal Ephemeral Runs
215
 
216
  Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
@@ -307,6 +321,59 @@ container starts. Scalar Trackio metrics still log every reward callback, while
307
  sample trace tables and Trace objects are throttled by `--trace-log-every`
308
  (`1` restores every-callback logging, `0` disables trace artifacts).
309
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
310
  If running from a public repository and you do not want Modal to package the
311
  local workspace, use public source mode:
312
 
 
211
 
212
  Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
213
 
214
+ Training, baseline, and smoke runs also log the effective reward config at step
215
+ 0. In Trackio, open **Media & Tables** and select the `reward_config` table to
216
+ see the actual values for each reward key, including stage-specific values,
217
+ caps, thresholds, terminate flags, and descriptions. Scalar metrics under
218
+ `reward_config/<key>/<field>` expose the same numeric values for plotting and
219
+ filtering, for example `reward_config/policy_inspected/value` and
220
+ `reward_config/shaping_weight/resolved`.
221
+
222
+ Each run config includes `reward_config_id`, `reward_config_hash`,
223
+ `reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
224
+ compare runs with the same scenario/model settings and different
225
+ `reward_config_hash` values to see which reward weights produced each training
226
+ curve.
227
+
228
  ## Modal Ephemeral Runs
229
 
230
  Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
 
321
  sample trace tables and Trace objects are throttled by `--trace-log-every`
322
  (`1` restores every-callback logging, `0` disables trace artifacts).
323
 
324
+ ### Parallel Modal GRPO Runs
325
+
326
+ Parallel Modal GRPO runs are safe when each run has its own seed range, run
327
+ name, and output target, while the shared cache volumes remain read-only.
328
+ Before launching another job, check what is already active:
329
+
330
+ ```bash
331
+ uv run --extra modal modal app list
332
+ uv run --extra modal modal app logs <app-id>
333
+ ```
334
+
335
+ Launch long-running parallel jobs with both Modal CLI detach and the launcher
336
+ detach flag. The CLI-level `--detach` keeps the remote function alive after the
337
+ local entrypoint exits; the launcher `--detach` prevents the parent Modal
338
+ function from waiting on the GPU call.
339
+
340
+ ```bash
341
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
342
+ --max-steps 300 \
343
+ --dataset-size 64 \
344
+ --num-generations 8 \
345
+ --max-completion-length 768 \
346
+ --difficulty 0 \
347
+ --trace-log-every 10 \
348
+ --seed-start 10000 \
349
+ --detach
350
+ ```
351
+
352
+ For multiple concurrent experiments:
353
+
354
+ - Use a unique `--seed-start` range for every run, normally spaced by at least
355
+ 10,000 seeds.
356
+ - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
357
+ scenarios during training.
358
+ - Do not run `prepare-cache --cache-force` while training jobs are active.
359
+ - Keep `--push-to-hub` disabled unless each run has a unique
360
+ `--output-repo-id`.
361
+ - Let the launcher generate unique timestamped Trackio run names, or set an
362
+ explicit `RUN_NAME` only when it is globally unique.
363
+ - Use the same Trackio Space/project for comparable metrics, but never reuse a
364
+ run name.
365
+ - Treat `CyberSecurity_OWASP-model-cache` and
366
+ `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
367
+ during training. Run outputs and checkpoints should stay under each run's
368
+ unique output directory.
369
+
370
+ If a Windows shell fails with a Unicode `charmap` encoding error during Modal
371
+ startup, rerun with UTF-8 enabled for that command:
372
+
373
+ ```powershell
374
+ $env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
375
+ ```
376
+
377
  If running from a public repository and you do not want Modal to package the
378
  local workspace, use public source mode:
379
 
reward_config.py CHANGED
@@ -2,6 +2,8 @@
2
 
3
  from __future__ import annotations
4
 
 
 
5
  import os
6
  from dataclasses import dataclass
7
  from pathlib import Path
@@ -88,6 +90,106 @@ def load_reward_settings(path: str | Path | None = None) -> RewardSettings:
88
  return settings
89
 
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  def validate_reward_settings(settings: RewardSettings) -> None:
92
  if settings.mode not in REWARD_MODES:
93
  raise ValueError("reward.mode must be dense_train or sparse_eval")
@@ -103,6 +205,26 @@ def validate_reward_settings(settings: RewardSettings) -> None:
103
  raise ValueError(f"reward.{key}.description is required")
104
 
105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  def compute_token_penalty(
107
  completion_tokens: int,
108
  settings: RewardSettings | None = None,
 
2
 
3
  from __future__ import annotations
4
 
5
+ import hashlib
6
+ import json
7
  import os
8
  from dataclasses import dataclass
9
  from pathlib import Path
 
90
  return settings
91
 
92
 
93
+ def flatten_reward_config(
94
+ settings: RewardSettings | None = None,
95
+ ) -> list[dict[str, Any]]:
96
+ """Return display-friendly reward config rows for tracking dashboards."""
97
+
98
+ settings = settings or load_reward_settings()
99
+ rows: list[dict[str, Any]] = []
100
+ for key in sorted(settings.raw):
101
+ entry = settings.raw[key]
102
+ if not isinstance(entry, dict):
103
+ continue
104
+ has_resolved_value = "value" in entry or settings.stage in entry
105
+ rows.append(
106
+ {
107
+ "key": key,
108
+ "value": _empty_if_missing(entry.get("value")),
109
+ "stage_value": _empty_if_missing(entry.get(settings.stage)),
110
+ "resolved": settings.value(key, 0.0) if has_resolved_value else "",
111
+ "cap": _empty_if_missing(entry.get("cap")),
112
+ "threshold": _empty_if_missing(
113
+ entry.get("threshold", entry.get("threshold_lines"))
114
+ ),
115
+ "severe_threshold": _empty_if_missing(
116
+ entry.get("severe_threshold", entry.get("severe_threshold_lines"))
117
+ ),
118
+ "terminate": bool(entry.get("terminate", False)),
119
+ "description": str(entry.get("description", "")),
120
+ }
121
+ )
122
+ return rows
123
+
124
+
125
+ def reward_config_hash(settings: RewardSettings | None = None) -> str:
126
+ """Return a deterministic hash for the effective reward configuration."""
127
+
128
+ settings = settings or load_reward_settings()
129
+ payload = {
130
+ "mode": settings.mode,
131
+ "training_mode": settings.training_mode,
132
+ "stage": settings.stage,
133
+ "shaping_weight": settings.shaping_weight,
134
+ "raw": _strip_descriptions(settings.raw),
135
+ }
136
+ encoded = json.dumps(payload, sort_keys=True, separators=(",", ":"), default=str)
137
+ return hashlib.sha256(encoded.encode("utf-8")).hexdigest()
138
+
139
+
140
+ def reward_config_summary(settings: RewardSettings | None = None) -> dict[str, Any]:
141
+ """Return reward config identity and flattened rows for run metadata."""
142
+
143
+ settings = settings or load_reward_settings()
144
+ config_hash = reward_config_hash(settings)
145
+ source = Path(settings.source_path)
146
+ return {
147
+ "reward_config_id": (
148
+ f"{source.stem}-{settings.mode}-{settings.stage}-{config_hash[:12]}"
149
+ ),
150
+ "reward_config_hash": config_hash,
151
+ "reward_config_source": str(source),
152
+ "reward_config_source_name": source.name,
153
+ "reward_mode": settings.mode,
154
+ "reward_training_mode": settings.training_mode,
155
+ "reward_stage": settings.stage,
156
+ "reward_shaping_weight": settings.shaping_weight,
157
+ "reward_entries": flatten_reward_config(settings),
158
+ }
159
+
160
+
161
+ def reward_config_run_config(settings: RewardSettings | None = None) -> dict[str, Any]:
162
+ """Return compact reward config fields safe to store in Trackio run config."""
163
+
164
+ summary = reward_config_summary(settings)
165
+ reward_values = {
166
+ str(row["key"]): {
167
+ key: value
168
+ for key, value in row.items()
169
+ if key != "key" and value != ""
170
+ }
171
+ for row in summary["reward_entries"]
172
+ }
173
+ config = {
174
+ "reward_config_id": summary["reward_config_id"],
175
+ "reward_config_hash": summary["reward_config_hash"],
176
+ "reward_config_source": summary["reward_config_source"],
177
+ "reward_config_source_name": summary["reward_config_source_name"],
178
+ "reward_mode": summary["reward_mode"],
179
+ "reward_training_mode": summary["reward_training_mode"],
180
+ "reward_stage": summary["reward_stage"],
181
+ "reward_shaping_weight": summary["reward_shaping_weight"],
182
+ "reward_config_values": reward_values,
183
+ "reward_config_values_json": json.dumps(reward_values, sort_keys=True),
184
+ }
185
+ for reward_key, values in reward_values.items():
186
+ safe_reward_key = _config_key_safe(reward_key)
187
+ for field, value in values.items():
188
+ if isinstance(value, (int, float, bool)):
189
+ config[f"reward_config__{safe_reward_key}__{field}"] = value
190
+ return config
191
+
192
+
193
  def validate_reward_settings(settings: RewardSettings) -> None:
194
  if settings.mode not in REWARD_MODES:
195
  raise ValueError("reward.mode must be dense_train or sparse_eval")
 
205
  raise ValueError(f"reward.{key}.description is required")
206
 
207
 
208
+ def _empty_if_missing(value: Any) -> Any:
209
+ return "" if value is None else value
210
+
211
+
212
+ def _strip_descriptions(value: Any) -> Any:
213
+ if isinstance(value, dict):
214
+ return {
215
+ str(key): _strip_descriptions(item)
216
+ for key, item in value.items()
217
+ if key != "description"
218
+ }
219
+ if isinstance(value, list):
220
+ return [_strip_descriptions(item) for item in value]
221
+ return value
222
+
223
+
224
+ def _config_key_safe(value: str) -> str:
225
+ return "".join(char if char.isalnum() or char == "_" else "_" for char in value).strip("_")
226
+
227
+
228
  def compute_token_penalty(
229
  completion_tokens: int,
230
  settings: RewardSettings | None = None,
scripts/modal_ephemeral_train.py CHANGED
@@ -131,6 +131,7 @@ def run_ephemeral_smoke(
131
  _configure_scenario_cache_env(required=True)
132
  from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
133
  from CyberSecurity_OWASP.config import load_scenario_authoring_config
 
134
  from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
135
  CybersecurityOwaspEnvironment,
136
  )
@@ -140,7 +141,9 @@ def run_ephemeral_smoke(
140
  aggregate_episode_metrics,
141
  episode_record_from_state,
142
  log_episode_batch,
 
143
  log_trackio_metrics,
 
144
  trace_table_rows,
145
  trackio_run,
146
  )
@@ -162,10 +165,13 @@ def run_ephemeral_smoke(
162
 
163
  baseline = []
164
  oracle = []
 
 
165
  run_context = {
166
  "algo": "modal_ephemeral_smoke",
167
  "reward_version": "reward_v2",
168
  "env_version": "0.1.0",
 
169
  }
170
 
171
  for offset in range(episodes):
@@ -274,6 +280,7 @@ def run_ephemeral_smoke(
274
  "tracking_trace_rows": trace_table_rows(episode_records),
275
  "baseline": baseline,
276
  "oracle": oracle,
 
277
  }
278
  with trackio_run(
279
  run_name=run_name,
@@ -284,9 +291,11 @@ def run_ephemeral_smoke(
284
  "episodes": episodes,
285
  "seed_start": seed_start,
286
  "mode": "smoke",
 
287
  },
288
  group="smoke",
289
  ):
 
290
  logged_metrics = log_episode_batch(episode_records, step=0)
291
  log_trackio_metrics(
292
  {
 
131
  _configure_scenario_cache_env(required=True)
132
  from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
133
  from CyberSecurity_OWASP.config import load_scenario_authoring_config
134
+ from CyberSecurity_OWASP.reward_config import load_reward_settings
135
  from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
136
  CybersecurityOwaspEnvironment,
137
  )
 
141
  aggregate_episode_metrics,
142
  episode_record_from_state,
143
  log_episode_batch,
144
+ log_reward_config,
145
  log_trackio_metrics,
146
+ reward_config_trackio_config,
147
  trace_table_rows,
148
  trackio_run,
149
  )
 
165
 
166
  baseline = []
167
  oracle = []
168
+ reward_settings = load_reward_settings()
169
+ reward_tracking_config = reward_config_trackio_config(reward_settings)
170
  run_context = {
171
  "algo": "modal_ephemeral_smoke",
172
  "reward_version": "reward_v2",
173
  "env_version": "0.1.0",
174
+ **reward_tracking_config,
175
  }
176
 
177
  for offset in range(episodes):
 
280
  "tracking_trace_rows": trace_table_rows(episode_records),
281
  "baseline": baseline,
282
  "oracle": oracle,
283
+ **reward_tracking_config,
284
  }
285
  with trackio_run(
286
  run_name=run_name,
 
291
  "episodes": episodes,
292
  "seed_start": seed_start,
293
  "mode": "smoke",
294
+ **reward_tracking_config,
295
  },
296
  group="smoke",
297
  ):
298
+ log_reward_config(reward_settings, step=0)
299
  logged_metrics = log_episode_batch(episode_records, step=0)
300
  log_trackio_metrics(
301
  {
scripts/modal_train_grpo.py CHANGED
@@ -16,6 +16,7 @@ Example:
16
 
17
  from __future__ import annotations
18
 
 
19
  import os
20
  import pathlib
21
  import subprocess
@@ -45,6 +46,7 @@ PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
45
  PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
46
  PUBLIC_REPO_BRANCH = "master"
47
  DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
 
48
  _IMAGE_NOTICE_PRINTED = False
49
 
50
 
@@ -120,6 +122,56 @@ def _validate_vllm_config(*, use_vllm: bool, vllm_gpu_memory_utilization: float)
120
  )
121
 
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  def _configure_modal_cache_env() -> dict[str, str]:
124
  values = {
125
  "HF_HOME": str(HF_HOME_DIR),
@@ -410,7 +462,6 @@ def verify_modal_scenario_cache_for_training(
410
  from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
411
  CybersecurityOwaspEnvironment,
412
  )
413
- from CyberSecurity_OWASP.reward_config import compute_token_penalty
414
  from CyberSecurity_OWASP.server.curriculum import CurriculumController
415
  from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
416
 
@@ -514,6 +565,436 @@ def check_training_imports() -> dict[str, str]:
514
  volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
515
  secrets=secrets,
516
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
517
  def train_cybersecurity_owasp_grpo(
518
  env_repo_id: str = "",
519
  output_repo_id: str = "",
@@ -580,16 +1061,21 @@ def train_cybersecurity_owasp_grpo(
580
  from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
581
  CybersecurityOwaspEnvironment,
582
  )
583
- from CyberSecurity_OWASP.reward_config import compute_token_penalty
 
 
 
584
  from CyberSecurity_OWASP.server.curriculum import CurriculumController
585
  from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
586
  from training.trackio_utils import (
587
  aggregate_episode_metrics,
588
  episode_record_from_state,
589
  episode_trace_fingerprint,
 
590
  log_gpu_metrics,
591
  log_trace_table,
592
  log_trackio_metrics,
 
593
  train_metric_aliases,
594
  )
595
  from training.grpo_curriculum import (
@@ -625,6 +1111,8 @@ def train_cybersecurity_owasp_grpo(
625
  os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
626
  os.environ["TRACKIO_PROJECT"] = trackio_project
627
  os.environ.setdefault("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
 
 
628
 
629
  model_slug = model_name.replace("/", "-")
630
  stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
@@ -761,6 +1249,10 @@ def train_cybersecurity_owasp_grpo(
761
  "scenario_group_id": self.scenario_group_id,
762
  "scenario_assignment": dict(self.scenario_assignment),
763
  "scenario_prompt_length": len(obs.scenario_prompt),
 
 
 
 
764
  }
765
  )
766
  return obs.scenario_prompt
@@ -1012,6 +1504,7 @@ def train_cybersecurity_owasp_grpo(
1012
  "algo": "grpo",
1013
  "reward_version": "reward_v2",
1014
  "env_version": "0.1.0",
 
1015
  },
1016
  )
1017
  record.update(
@@ -1116,6 +1609,10 @@ def train_cybersecurity_owasp_grpo(
1116
  "trace_fingerprint": fingerprint,
1117
  "num_generations": num_generations,
1118
  "run_name": run_name,
 
 
 
 
1119
  }
1120
  )
1121
  try:
@@ -1148,6 +1645,7 @@ def train_cybersecurity_owasp_grpo(
1148
  class TrackioSystemMetricsCallback(TrainerCallback):
1149
  def on_train_begin(self, args, state, control, **kwargs):
1150
  try:
 
1151
  metrics = log_gpu_metrics(step=int(state.global_step or 0))
1152
  log_trackio_metrics(
1153
  {
@@ -1160,8 +1658,13 @@ def train_cybersecurity_owasp_grpo(
1160
  },
1161
  step=int(state.global_step or 0),
1162
  )
 
 
 
 
 
1163
  except Exception as exc:
1164
- print(f"Trackio GPU metrics initialization skipped: {exc!r}")
1165
  return control
1166
  if metrics:
1167
  system_summary = ", ".join(
@@ -1199,6 +1702,8 @@ def train_cybersecurity_owasp_grpo(
1199
  print(f"Trackio Project: {trackio_project}")
1200
  print(f"Output repo: {output_repo_id}")
1201
  print(f"Run name: {run_name}")
 
 
1202
  print(f"Model cache volume: {CACHE_VOLUME_NAME}")
1203
  print(f"Scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
1204
  print(f"Scenario cache dir: {scenario_cache_env['CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR']}")
@@ -1451,6 +1956,7 @@ def train_cybersecurity_owasp_grpo(
1451
  "push_to_hub": push_to_hub,
1452
  "scenario_cache_volume": SCENARIO_CACHE_VOLUME_NAME,
1453
  "scenario_cache_mode": "require",
 
1454
  }
1455
 
1456
 
@@ -1506,8 +2012,46 @@ def main(
1506
  result = check_training_imports.remote()
1507
  print(result)
1508
  return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1509
  if mode != "train":
1510
- raise ValueError("mode must be 'prepare-cache', 'train', or 'config'")
1511
 
1512
  (
1513
  resolved_gradient_accumulation_steps,
 
16
 
17
  from __future__ import annotations
18
 
19
+ import json
20
  import os
21
  import pathlib
22
  import subprocess
 
46
  PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
47
  PUBLIC_REPO_BRANCH = "master"
48
  DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
49
+ GRPO_TRAINING_TIMEOUT_SECONDS = 24 * 60 * 60
50
  _IMAGE_NOTICE_PRINTED = False
51
 
52
 
 
122
  )
123
 
124
 
125
+ def _extract_first_json_object(text: str) -> dict[str, Any] | None:
126
+ stripped = text.strip()
127
+ candidates = [stripped]
128
+ if "```" in stripped:
129
+ for part in stripped.split("```"):
130
+ part = part.strip()
131
+ if part.startswith("json"):
132
+ part = part[4:].strip()
133
+ candidates.append(part)
134
+
135
+ for candidate in candidates:
136
+ try:
137
+ loaded = json.loads(candidate)
138
+ except Exception:
139
+ continue
140
+ if isinstance(loaded, dict):
141
+ return loaded
142
+
143
+ start = stripped.find("{")
144
+ while start >= 0:
145
+ depth = 0
146
+ in_string = False
147
+ escaped = False
148
+ for index in range(start, len(stripped)):
149
+ char = stripped[index]
150
+ if in_string:
151
+ if escaped:
152
+ escaped = False
153
+ elif char == "\\":
154
+ escaped = True
155
+ elif char == '"':
156
+ in_string = False
157
+ continue
158
+ if char == '"':
159
+ in_string = True
160
+ elif char == "{":
161
+ depth += 1
162
+ elif char == "}":
163
+ depth -= 1
164
+ if depth == 0:
165
+ try:
166
+ loaded = json.loads(stripped[start : index + 1])
167
+ except Exception:
168
+ break
169
+ if isinstance(loaded, dict):
170
+ return loaded
171
+ start = stripped.find("{", start + 1)
172
+ return None
173
+
174
+
175
  def _configure_modal_cache_env() -> dict[str, str]:
176
  values = {
177
  "HF_HOME": str(HF_HOME_DIR),
 
462
  from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
463
  CybersecurityOwaspEnvironment,
464
  )
 
465
  from CyberSecurity_OWASP.server.curriculum import CurriculumController
466
  from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
467
 
 
565
  volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
566
  secrets=secrets,
567
  )
568
+ def run_cybersecurity_owasp_baseline(
569
+ max_steps: int = 50,
570
+ dataset_size: int = 1,
571
+ difficulty: int = 0,
572
+ split: str = "train",
573
+ model_name: str = DEFAULT_GEMMA_MODEL,
574
+ max_seq_length: int = 4096,
575
+ max_completion_length: int = 768,
576
+ trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
577
+ trackio_project: str = "CyberSecurity_OWASP-grpo",
578
+ num_generations: int = 1,
579
+ trace_log_every: int = 1,
580
+ seed_start: int = 0,
581
+ git_sha: str = "nogit",
582
+ run_name: str = "baseline",
583
+ source_mode: str = "local",
584
+ repo_url: str = PUBLIC_REPO_URL,
585
+ repo_branch: str = PUBLIC_REPO_BRANCH,
586
+ ) -> dict[str, str | int | float]:
587
+ import statistics
588
+ import time
589
+
590
+ import torch
591
+ from huggingface_hub import snapshot_download, whoami
592
+ from unsloth import FastVisionModel
593
+ import transformers.utils.hub as transformers_hub
594
+
595
+ from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
596
+ from CyberSecurity_OWASP.config import load_scenario_authoring_config
597
+ from CyberSecurity_OWASP.reward_config import load_reward_settings
598
+ from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
599
+ CybersecurityOwaspEnvironment,
600
+ )
601
+ from CyberSecurity_OWASP.server.curriculum import CurriculumController
602
+ from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
603
+ from training.trackio_utils import (
604
+ aggregate_episode_metrics,
605
+ episode_record_from_state,
606
+ log_reward_config,
607
+ log_trace_table,
608
+ log_trackio_metrics,
609
+ reward_config_trackio_config,
610
+ trackio_run,
611
+ )
612
+
613
+ model_name = _ensure_gemma4_model(model_name)
614
+ if int(num_generations) != 1:
615
+ raise ValueError("Baseline mode runs the untrained model with --num-generations 1.")
616
+
617
+ cache_env = _configure_modal_cache_env()
618
+ scenario_cache_env = _configure_scenario_cache_env(required=True)
619
+ transformers_hub.TRANSFORMERS_CACHE = cache_env["HF_HUB_CACHE"]
620
+ hf_token = os.environ.get("HF_TOKEN")
621
+ if not hf_token:
622
+ raise RuntimeError(f"HF_TOKEN is missing from the Modal secret {SECRET_NAME}.")
623
+ try:
624
+ whoami(token=hf_token)
625
+ except Exception as exc:
626
+ raise RuntimeError("HF_TOKEN could not be validated before baseline run.") from exc
627
+
628
+ os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
629
+ os.environ["TRACKIO_PROJECT"] = trackio_project
630
+ reward_settings = load_reward_settings()
631
+ reward_tracking_config = reward_config_trackio_config(reward_settings)
632
+ run_name = run_name or "baseline"
633
+ output_dir = RUNS_DIR / run_name
634
+ output_dir.mkdir(parents=True, exist_ok=True)
635
+
636
+ try:
637
+ cache_volume.reload()
638
+ print(f"Reloaded Modal model cache volume: {CACHE_VOLUME_NAME}")
639
+ except Exception as exc:
640
+ print(f"Model cache volume reload skipped: {exc!r}")
641
+ try:
642
+ scenario_cache_volume.reload()
643
+ print(f"Reloaded Modal scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
644
+ except Exception as exc:
645
+ print(f"Scenario cache volume reload skipped: {exc!r}")
646
+
647
+ settings = load_scenario_authoring_config()
648
+ scenario_profile = CurriculumController(settings=settings).select_profile(
649
+ seed=seed_start,
650
+ split=split,
651
+ requested_difficulty=difficulty,
652
+ )
653
+ resolved_difficulty = int(scenario_profile["difficulty"])
654
+ scenario_cache = ScenarioCache(SCENARIO_CACHE_DIR, settings=settings)
655
+ coverage = scenario_cache.assert_coverage(
656
+ split=split,
657
+ difficulty=resolved_difficulty,
658
+ )
659
+ entries = scenario_cache.validated_entries(
660
+ split=split,
661
+ difficulty=resolved_difficulty,
662
+ ) or scenario_cache.validated_entries(split=split)
663
+ if not entries:
664
+ raise RuntimeError(f"No validated scenario cache entries found for split={split!r}.")
665
+
666
+ print(f"Baseline run name: {run_name}")
667
+ print(f"Source mode: {source_mode}")
668
+ if source_mode == "public":
669
+ print(f"Installed CyberSecurity_OWASP from public repo: {repo_url}@{repo_branch}")
670
+ else:
671
+ print("Packaged local CyberSecurity_OWASP repo.")
672
+ print(f"Trackio Space: {trackio_space_id}")
673
+ print(f"Trackio Project: {trackio_project}")
674
+ print(f"Reward config: {reward_tracking_config['reward_config_id']}")
675
+ print(f"Reward config hash: {reward_tracking_config['reward_config_hash']}")
676
+ print(f"Scenario cache dir: {scenario_cache_env['CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR']}")
677
+ print(f"Scenario cache coverage: {coverage}")
678
+ print(
679
+ "Baseline generation config: "
680
+ f"episodes={dataset_size}, max_episode_steps={max_steps}, "
681
+ f"num_generations={num_generations}, max_completion_length={max_completion_length}, "
682
+ f"trace_log_every={trace_log_every}"
683
+ )
684
+
685
+ expected_model_cache = _hf_model_cache_path(model_name)
686
+ print(f"Expected HF model cache path: {expected_model_cache}")
687
+ print(f"Model cache hit before load: {expected_model_cache.exists()}")
688
+ try:
689
+ snapshot_path = snapshot_download(
690
+ repo_id=model_name,
691
+ cache_dir=str(HF_HUB_CACHE_DIR),
692
+ token=hf_token,
693
+ )
694
+ print(f"Model snapshot ready: {snapshot_path}")
695
+ cache_volume.commit()
696
+ except Exception as exc:
697
+ print(f"Explicit model snapshot prefetch failed; loading directly. Error: {exc!r}")
698
+
699
+ model_api = FastVisionModel
700
+ model, tokenizer = model_api.from_pretrained(
701
+ model_name=model_name,
702
+ max_seq_length=max_seq_length,
703
+ load_in_4bit=False,
704
+ fast_inference=False,
705
+ cache_dir=str(HF_HUB_CACHE_DIR),
706
+ token=hf_token,
707
+ )
708
+ if hasattr(model_api, "for_inference"):
709
+ model_api.for_inference(model)
710
+ model.eval()
711
+ cache_volume.commit()
712
+ device = next(model.parameters()).device
713
+ text_tokenizer = getattr(tokenizer, "tokenizer", tokenizer)
714
+
715
+ def render_prompt(observation, actions: list[dict[str, Any]]) -> str:
716
+ recent_actions = actions[-8:]
717
+ return (
718
+ "You are the untrained baseline model for a defensive local AppSec "
719
+ "repair environment. Use only the listed local tools. Return exactly "
720
+ "one JSON object and no markdown.\n\n"
721
+ f"{observation.scenario_prompt}\n\n"
722
+ f"Current phase: {observation.phase}\n"
723
+ f"Available actions: {observation.available_actions}\n"
724
+ f"Last tool result: {observation.last_tool_result}\n"
725
+ f"Recent actions: {json.dumps(recent_actions, sort_keys=True)}\n\n"
726
+ 'Required format: {"tool_name":"inspect_policy_graph","arguments":{}}'
727
+ )
728
+
729
+ def generate_action_text(prompt: str) -> tuple[str, list[int], list[int]]:
730
+ messages = [{"role": "user", "content": prompt}]
731
+ prompt_text = prompt
732
+ for candidate in (tokenizer, text_tokenizer):
733
+ if hasattr(candidate, "apply_chat_template"):
734
+ try:
735
+ prompt_text = candidate.apply_chat_template(
736
+ messages,
737
+ tokenize=False,
738
+ add_generation_prompt=True,
739
+ )
740
+ break
741
+ except Exception:
742
+ prompt_text = prompt
743
+ encode = tokenizer
744
+ try:
745
+ inputs = encode(
746
+ prompt_text,
747
+ return_tensors="pt",
748
+ truncation=True,
749
+ max_length=max_seq_length,
750
+ )
751
+ except Exception:
752
+ inputs = text_tokenizer(
753
+ prompt_text,
754
+ return_tensors="pt",
755
+ truncation=True,
756
+ max_length=max_seq_length,
757
+ )
758
+ if hasattr(inputs, "to"):
759
+ inputs = inputs.to(device)
760
+ else:
761
+ inputs = {
762
+ key: value.to(device) if hasattr(value, "to") else value
763
+ for key, value in inputs.items()
764
+ }
765
+ input_ids = inputs.get("input_ids")
766
+ input_len = int(input_ids.shape[-1]) if input_ids is not None else 0
767
+ pad_token_id = getattr(text_tokenizer, "pad_token_id", None)
768
+ if pad_token_id is None:
769
+ pad_token_id = getattr(text_tokenizer, "eos_token_id", None)
770
+ with torch.inference_mode():
771
+ generated = model.generate(
772
+ **inputs,
773
+ max_new_tokens=max_completion_length,
774
+ do_sample=False,
775
+ pad_token_id=pad_token_id,
776
+ )
777
+ output_ids = generated[0]
778
+ completion_ids = output_ids[input_len:]
779
+ decode = getattr(text_tokenizer, "decode", None) or getattr(tokenizer, "decode")
780
+ text = decode(completion_ids, skip_special_tokens=True)
781
+ prompt_ids = (
782
+ [int(item) for item in input_ids[0].detach().cpu().tolist()]
783
+ if input_ids is not None
784
+ else []
785
+ )
786
+ return text, prompt_ids, [int(item) for item in completion_ids.detach().cpu().tolist()]
787
+
788
+ def action_from_completion(raw_text: str) -> tuple[CyberSecurityOWASPAction, str | None]:
789
+ loaded = _extract_first_json_object(raw_text)
790
+ if loaded is None:
791
+ return CyberSecurityOWASPAction(tool_name="noop", arguments={}), "invalid_json"
792
+ arguments = loaded.get("arguments", {})
793
+ if not isinstance(arguments, dict):
794
+ arguments = {}
795
+ payload = {
796
+ "tool_name": loaded.get("tool_name", "noop"),
797
+ "arguments": arguments,
798
+ }
799
+ try:
800
+ return CyberSecurityOWASPAction(**payload), None
801
+ except Exception as exc:
802
+ return (
803
+ CyberSecurityOWASPAction(tool_name="noop", arguments={}),
804
+ f"invalid_action_schema: {exc}",
805
+ )
806
+
807
+ episode_records: list[dict[str, Any]] = []
808
+ raw_traces: list[dict[str, Any]] = []
809
+ invalid_model_outputs = 0
810
+ generation_started = time.monotonic()
811
+ config = {
812
+ "base_model": model_name,
813
+ "algo": "baseline",
814
+ "difficulty": difficulty,
815
+ "split": split,
816
+ "max_episode_steps": max_steps,
817
+ "dataset_size": dataset_size,
818
+ "num_generations": num_generations,
819
+ "max_completion_length": max_completion_length,
820
+ "git_sha": git_sha,
821
+ **reward_tracking_config,
822
+ }
823
+
824
+ with trackio_run(
825
+ run_name=run_name,
826
+ run_type="baseline",
827
+ config=config,
828
+ project=trackio_project,
829
+ space_id=trackio_space_id,
830
+ group="baseline",
831
+ auto_log_gpu=True,
832
+ ):
833
+ log_reward_config(reward_settings, step=0)
834
+ for episode_index in range(max(1, int(dataset_size))):
835
+ entry = entries[(seed_start + episode_index) % len(entries)]
836
+ env = CybersecurityOwaspEnvironment()
837
+ try:
838
+ observation = env.reset(
839
+ seed=int(entry["seed"]),
840
+ split=str(entry["split"]),
841
+ difficulty=int(entry["difficulty"]),
842
+ )
843
+ env.state.max_steps = int(max_steps)
844
+ actions: list[dict[str, Any]] = []
845
+ model_steps: list[dict[str, Any]] = []
846
+ prompt_token_count = 0
847
+ completion_token_count = 0
848
+
849
+ for step_index in range(int(max_steps)):
850
+ if observation.done:
851
+ break
852
+ prompt = render_prompt(observation, actions)
853
+ raw_text, prompt_ids, completion_ids = generate_action_text(prompt)
854
+ prompt_token_count += len(prompt_ids)
855
+ completion_token_count += len(completion_ids)
856
+ action, invalid_reason = action_from_completion(raw_text)
857
+ if invalid_reason:
858
+ invalid_model_outputs += 1
859
+ observation = env.step(action)
860
+ action_dump = action.model_dump()
861
+ actions.append(action_dump)
862
+ model_steps.append(
863
+ {
864
+ "step": step_index + 1,
865
+ "raw_completion": raw_text,
866
+ "action": action_dump,
867
+ "invalid_model_output": invalid_reason,
868
+ "observation_message": observation.message,
869
+ "reward": observation.reward,
870
+ "done": observation.done,
871
+ }
872
+ )
873
+
874
+ env.state.completion_tokens = completion_token_count
875
+ env.state.metrics["prompt_tokens"] = prompt_token_count
876
+ env.state.metrics["completion_tokens"] = completion_token_count
877
+ final_observation = observation.model_dump()
878
+ record = episode_record_from_state(
879
+ env.state,
880
+ run_context={
881
+ "base_model": model_name,
882
+ "algo": "baseline",
883
+ "reward_version": "reward_v2",
884
+ "env_version": "0.1.0",
885
+ **reward_tracking_config,
886
+ },
887
+ final_observation=final_observation,
888
+ )
889
+ record.update(
890
+ {
891
+ "reward_total": float(env.state.accumulated_reward),
892
+ "success": bool(env.state.success),
893
+ "episode_length": int(env.state.step_count),
894
+ "invalid_model_output_count": sum(
895
+ 1 for item in model_steps if item["invalid_model_output"]
896
+ ),
897
+ "prompt_tokens": prompt_token_count,
898
+ "completion_tokens": completion_token_count,
899
+ }
900
+ )
901
+ episode_records.append(record)
902
+ raw_traces.append(
903
+ {
904
+ "episode_index": episode_index,
905
+ "task_id": env.state.task_id,
906
+ "seed": env.state.seed,
907
+ "split": env.state.split,
908
+ "difficulty": env.state.difficulty,
909
+ "domain": env.state.domain,
910
+ "bug_family": env.state.bug_family,
911
+ "steps": model_steps,
912
+ }
913
+ )
914
+ finally:
915
+ env.close()
916
+
917
+ metrics = aggregate_episode_metrics(episode_records)
918
+ metrics.update(
919
+ {
920
+ "baseline/episode_count": float(len(episode_records)),
921
+ "baseline/reward_total_mean": statistics.mean(
922
+ float(item.get("reward_total", 0.0)) for item in episode_records
923
+ ),
924
+ "baseline/success_rate": statistics.mean(
925
+ 1.0 if item.get("success") else 0.0 for item in episode_records
926
+ ),
927
+ "baseline/invalid_model_output_rate": invalid_model_outputs
928
+ / max(1.0, sum(float(item.get("episode_length", 0)) for item in episode_records)),
929
+ "baseline/num_generations": float(num_generations),
930
+ "baseline/max_episode_steps": float(max_steps),
931
+ "baseline/max_completion_length": float(max_completion_length),
932
+ }
933
+ )
934
+ log_trackio_metrics(metrics, step=episode_index + 1)
935
+ if trace_log_every > 0 and (
936
+ episode_index == 0 or (episode_index + 1) % trace_log_every == 0
937
+ ):
938
+ log_trace_table(
939
+ [episode_records[-1]],
940
+ table_name="baseline_traces",
941
+ step=episode_index + 1,
942
+ )
943
+
944
+ elapsed_s = time.monotonic() - generation_started
945
+ summary = {
946
+ "run_name": run_name,
947
+ "trackio_space_id": trackio_space_id,
948
+ "trackio_project": trackio_project,
949
+ "model_name": model_name,
950
+ "dataset_size": len(episode_records),
951
+ "max_episode_steps": int(max_steps),
952
+ "difficulty": int(difficulty),
953
+ "split": split,
954
+ "num_generations": int(num_generations),
955
+ "max_completion_length": int(max_completion_length),
956
+ "mean_reward": (
957
+ statistics.mean(float(item.get("reward_total", 0.0)) for item in episode_records)
958
+ if episode_records
959
+ else 0.0
960
+ ),
961
+ "success_rate": (
962
+ statistics.mean(1.0 if item.get("success") else 0.0 for item in episode_records)
963
+ if episode_records
964
+ else 0.0
965
+ ),
966
+ "invalid_model_output_count": int(invalid_model_outputs),
967
+ "elapsed_s": elapsed_s,
968
+ **reward_tracking_config,
969
+ }
970
+ artifact_path = output_dir / "baseline_rollouts.json"
971
+ artifact_path.write_text(
972
+ json.dumps(
973
+ {
974
+ "summary": summary,
975
+ "episodes": episode_records,
976
+ "raw_traces": raw_traces,
977
+ },
978
+ indent=2,
979
+ sort_keys=True,
980
+ default=str,
981
+ ),
982
+ encoding="utf-8",
983
+ )
984
+ volume.commit()
985
+ cache_volume.commit()
986
+ scenario_cache_volume.commit()
987
+ print(f"Baseline artifact saved to {artifact_path}")
988
+ return {**summary, "artifact_path": str(artifact_path)}
989
+
990
+
991
+ @app.function(
992
+ image=training_image,
993
+ gpu="L4",
994
+ timeout=GRPO_TRAINING_TIMEOUT_SECONDS,
995
+ volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
996
+ secrets=secrets,
997
+ )
998
  def train_cybersecurity_owasp_grpo(
999
  env_repo_id: str = "",
1000
  output_repo_id: str = "",
 
1061
  from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
1062
  CybersecurityOwaspEnvironment,
1063
  )
1064
+ from CyberSecurity_OWASP.reward_config import (
1065
+ compute_token_penalty,
1066
+ load_reward_settings,
1067
+ )
1068
  from CyberSecurity_OWASP.server.curriculum import CurriculumController
1069
  from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
1070
  from training.trackio_utils import (
1071
  aggregate_episode_metrics,
1072
  episode_record_from_state,
1073
  episode_trace_fingerprint,
1074
+ log_reward_config,
1075
  log_gpu_metrics,
1076
  log_trace_table,
1077
  log_trackio_metrics,
1078
+ reward_config_trackio_config,
1079
  train_metric_aliases,
1080
  )
1081
  from training.grpo_curriculum import (
 
1111
  os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
1112
  os.environ["TRACKIO_PROJECT"] = trackio_project
1113
  os.environ.setdefault("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
1114
+ reward_settings = load_reward_settings()
1115
+ reward_tracking_config = reward_config_trackio_config(reward_settings)
1116
 
1117
  model_slug = model_name.replace("/", "-")
1118
  stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
 
1249
  "scenario_group_id": self.scenario_group_id,
1250
  "scenario_assignment": dict(self.scenario_assignment),
1251
  "scenario_prompt_length": len(obs.scenario_prompt),
1252
+ "reward_config_id": reward_tracking_config["reward_config_id"],
1253
+ "reward_config_hash": reward_tracking_config["reward_config_hash"],
1254
+ "reward_stage": reward_tracking_config["reward_stage"],
1255
+ "reward_mode": reward_tracking_config["reward_mode"],
1256
  }
1257
  )
1258
  return obs.scenario_prompt
 
1504
  "algo": "grpo",
1505
  "reward_version": "reward_v2",
1506
  "env_version": "0.1.0",
1507
+ **reward_tracking_config,
1508
  },
1509
  )
1510
  record.update(
 
1609
  "trace_fingerprint": fingerprint,
1610
  "num_generations": num_generations,
1611
  "run_name": run_name,
1612
+ "reward_config_id": reward_tracking_config["reward_config_id"],
1613
+ "reward_config_hash": reward_tracking_config["reward_config_hash"],
1614
+ "reward_stage": reward_tracking_config["reward_stage"],
1615
+ "reward_mode": reward_tracking_config["reward_mode"],
1616
  }
1617
  )
1618
  try:
 
1645
  class TrackioSystemMetricsCallback(TrainerCallback):
1646
  def on_train_begin(self, args, state, control, **kwargs):
1647
  try:
1648
+ reward_summary = log_reward_config(reward_settings, step=int(state.global_step or 0))
1649
  metrics = log_gpu_metrics(step=int(state.global_step or 0))
1650
  log_trackio_metrics(
1651
  {
 
1658
  },
1659
  step=int(state.global_step or 0),
1660
  )
1661
+ print(
1662
+ "Trackio reward config logged: "
1663
+ f"{reward_summary['reward_config_id']} "
1664
+ f"({reward_summary['reward_config_hash']})"
1665
+ )
1666
  except Exception as exc:
1667
+ print(f"Trackio initialization metrics skipped: {exc!r}")
1668
  return control
1669
  if metrics:
1670
  system_summary = ", ".join(
 
1702
  print(f"Trackio Project: {trackio_project}")
1703
  print(f"Output repo: {output_repo_id}")
1704
  print(f"Run name: {run_name}")
1705
+ print(f"Reward config: {reward_tracking_config['reward_config_id']}")
1706
+ print(f"Reward config hash: {reward_tracking_config['reward_config_hash']}")
1707
  print(f"Model cache volume: {CACHE_VOLUME_NAME}")
1708
  print(f"Scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
1709
  print(f"Scenario cache dir: {scenario_cache_env['CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR']}")
 
1956
  "push_to_hub": push_to_hub,
1957
  "scenario_cache_volume": SCENARIO_CACHE_VOLUME_NAME,
1958
  "scenario_cache_mode": "require",
1959
+ **reward_tracking_config,
1960
  }
1961
 
1962
 
 
2012
  result = check_training_imports.remote()
2013
  print(result)
2014
  return
2015
+ if mode == "baseline":
2016
+ if int(num_generations) != 1:
2017
+ raise ValueError("baseline mode expects --num-generations 1.")
2018
+ trace_log_every = max(0, int(trace_log_every))
2019
+ run_name = run_name or "baseline"
2020
+ preflight = verify_modal_scenario_cache_for_training.remote(
2021
+ split=split,
2022
+ difficulty=difficulty,
2023
+ dataset_size=dataset_size,
2024
+ seed_start=seed_start,
2025
+ )
2026
+ print(f"CPU scenario cache preflight passed: {preflight}")
2027
+ kwargs = dict(
2028
+ max_steps=max_steps,
2029
+ dataset_size=dataset_size,
2030
+ difficulty=difficulty,
2031
+ split=split,
2032
+ model_name=model_name,
2033
+ max_seq_length=max_seq_length,
2034
+ max_completion_length=max_completion_length,
2035
+ trackio_space_id=trackio_space_id,
2036
+ trackio_project=trackio_project,
2037
+ num_generations=num_generations,
2038
+ trace_log_every=trace_log_every,
2039
+ seed_start=seed_start,
2040
+ git_sha=git_sha,
2041
+ run_name=run_name,
2042
+ source_mode=source_mode,
2043
+ repo_url=repo_url,
2044
+ repo_branch=repo_branch,
2045
+ )
2046
+ if detach:
2047
+ call = run_cybersecurity_owasp_baseline.spawn(**kwargs)
2048
+ print(f"Spawned Modal baseline call: {call.object_id}")
2049
+ else:
2050
+ result = run_cybersecurity_owasp_baseline.remote(**kwargs)
2051
+ print(f"Baseline result: {result}")
2052
+ return
2053
  if mode != "train":
2054
+ raise ValueError("mode must be 'prepare-cache', 'train', 'baseline', or 'config'")
2055
 
2056
  (
2057
  resolved_gradient_accumulation_steps,
tests/test_reward_config.py CHANGED
@@ -4,7 +4,11 @@ import pytest
4
 
5
  from CyberSecurity_OWASP.reward_config import (
6
  compute_token_penalty,
 
7
  load_reward_settings,
 
 
 
8
  )
9
 
10
 
@@ -32,6 +36,38 @@ def test_reward_config_env_overrides(monkeypatch):
32
  assert compute_token_penalty(850, settings) == -0.5
33
 
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  def test_reward_config_rejects_missing_descriptions(monkeypatch):
36
  config_path = Path("outputs/test_reward_config_bad.yaml")
37
  config_path.parent.mkdir(parents=True, exist_ok=True)
 
4
 
5
  from CyberSecurity_OWASP.reward_config import (
6
  compute_token_penalty,
7
+ flatten_reward_config,
8
  load_reward_settings,
9
+ reward_config_hash,
10
+ reward_config_run_config,
11
+ reward_config_summary,
12
  )
13
 
14
 
 
36
  assert compute_token_penalty(850, settings) == -0.5
37
 
38
 
39
+ def test_reward_config_hash_and_flattened_values_are_deterministic(monkeypatch):
40
+ monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
41
+ monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "middle")
42
+
43
+ settings = load_reward_settings()
44
+ first_hash = reward_config_hash(settings)
45
+ second_hash = reward_config_hash(load_reward_settings())
46
+ summary = reward_config_summary(settings)
47
+ run_config = reward_config_run_config(settings)
48
+ rows = {row["key"]: row for row in flatten_reward_config(settings)}
49
+
50
+ assert first_hash == second_hash
51
+ assert len(first_hash) == 64
52
+ assert summary["reward_config_hash"] == first_hash
53
+ assert summary["reward_config_id"].endswith(first_hash[:12])
54
+ assert run_config["reward_config_hash"] == first_hash
55
+ assert run_config["reward_mode"] == "dense_train"
56
+ assert run_config["reward_stage"] == "middle"
57
+ assert run_config["reward_config_values"]["policy_inspected"]["value"] == 0.30
58
+ assert run_config["reward_config_values"]["shaping_weight"]["stage_value"] == 0.7
59
+ assert run_config["reward_config__policy_inspected__value"] == 0.30
60
+ assert run_config["reward_config__shaping_weight__stage_value"] == 0.7
61
+ assert "policy_inspected" in run_config["reward_config_values_json"]
62
+ assert rows["policy_inspected"]["value"] == 0.30
63
+ assert rows["shaping_weight"]["stage_value"] == 0.7
64
+ assert rows["shaping_weight"]["resolved"] == 0.7
65
+ assert rows["step_penalty"]["stage_value"] == -0.01
66
+ assert rows["oversized_patch"]["threshold"] == 80
67
+ assert rows["oversized_patch"]["severe_threshold"] == 180
68
+ assert rows["hidden_file_probe"]["terminate"] is True
69
+
70
+
71
  def test_reward_config_rejects_missing_descriptions(monkeypatch):
72
  config_path = Path("outputs/test_reward_config_bad.yaml")
73
  config_path.parent.mkdir(parents=True, exist_ok=True)
tests/test_trackio_utils.py CHANGED
@@ -1,14 +1,20 @@
1
  import json
 
 
2
 
3
  from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
 
4
  from training.trackio_utils import (
5
  CANONICAL_TRACKIO_SIGNALS,
6
  DERIVED_TRACKIO_METRICS,
 
7
  aggregate_episode_metrics,
8
  episode_record_from_state,
9
  episode_trace_fingerprint,
10
  episode_to_trace_row,
11
  episode_to_tracking_fields,
 
 
12
  )
13
 
14
  from .helpers import apply_secure_patch, make_env, secure_invoice_source, submit_valid_finding
@@ -129,3 +135,69 @@ def test_trace_fingerprint_ignores_episode_id_but_tracks_action_changes():
129
  assert episode_trace_fingerprint(base_record) == episode_trace_fingerprint(token_only_reward_change)
130
  assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(changed_trace)
131
  assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(different_scenario)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import json
2
+ import sys
3
+ import types
4
 
5
  from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
6
+ from CyberSecurity_OWASP.reward_config import load_reward_settings
7
  from training.trackio_utils import (
8
  CANONICAL_TRACKIO_SIGNALS,
9
  DERIVED_TRACKIO_METRICS,
10
+ REWARD_CONFIG_TABLE_COLUMNS,
11
  aggregate_episode_metrics,
12
  episode_record_from_state,
13
  episode_trace_fingerprint,
14
  episode_to_trace_row,
15
  episode_to_tracking_fields,
16
+ log_reward_config,
17
+ reward_config_scalar_metrics,
18
  )
19
 
20
  from .helpers import apply_secure_patch, make_env, secure_invoice_source, submit_valid_finding
 
135
  assert episode_trace_fingerprint(base_record) == episode_trace_fingerprint(token_only_reward_change)
136
  assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(changed_trace)
137
  assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(different_scenario)
138
+
139
+
140
+ def test_log_reward_config_emits_scalar_values_and_table(monkeypatch):
141
+ logged: list[tuple[dict, int | None]] = []
142
+
143
+ class FakeTable:
144
+ def __init__(self, *, columns, data=None, rows=None, allow_mixed_types=False):
145
+ self.columns = columns
146
+ self.rows = data if data is not None else rows
147
+ self.data = self.rows
148
+ self.allow_mixed_types = allow_mixed_types
149
+
150
+ fake_trackio = types.SimpleNamespace(config={}, Table=FakeTable)
151
+
152
+ def fake_log(payload, step=None):
153
+ logged.append((payload, step))
154
+
155
+ fake_trackio.log = fake_log
156
+ monkeypatch.setitem(sys.modules, "trackio", fake_trackio)
157
+ monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
158
+ monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "early")
159
+
160
+ settings = load_reward_settings()
161
+ summary = log_reward_config(settings, step=0)
162
+
163
+ assert fake_trackio.config["reward_config_hash"] == summary["reward_config_hash"]
164
+ assert fake_trackio.config["reward_config_values"]["policy_inspected"]["value"] == 0.30
165
+ assert fake_trackio.config["reward_config__policy_inspected__value"] == 0.30
166
+ scalar_payload = next(payload for payload, _step in logged if "reward_config/policy_inspected/value" in payload)
167
+ assert scalar_payload["reward_config/policy_inspected/value"] == 0.30
168
+ assert scalar_payload["reward_config/shaping_weight/resolved"] == 1.0
169
+ assert scalar_payload["reward_config/invalid_action/value"] == -0.20
170
+ assert scalar_payload["reward_config/progressive_cap/value"] == 5.0
171
+ assert scalar_payload["reward_config/oversized_patch/severe_value"] == -1.0
172
+
173
+ table = next(payload["reward_config"] for payload, _step in logged if "reward_config" in payload)
174
+ assert table.columns == list(REWARD_CONFIG_TABLE_COLUMNS)
175
+ assert table.allow_mixed_types is True
176
+ rows = {row[0]: row for row in table.rows}
177
+ assert rows["policy_inspected"][1] == 0.30
178
+ assert rows["shaping_weight"][2] == 1.0
179
+ assert rows["hidden_file_probe"][6] is True
180
+
181
+ logged_text = json.dumps(
182
+ {
183
+ "summary": summary,
184
+ "scalar_payload": scalar_payload,
185
+ "table_rows": table.rows,
186
+ },
187
+ sort_keys=True,
188
+ default=str,
189
+ )
190
+ assert "owner_invoice_id" not in logged_text
191
+ assert "foreign_invoice_id" not in logged_text
192
+
193
+
194
+ def test_reward_config_scalar_metrics_uses_stage_resolved_values(monkeypatch):
195
+ monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
196
+ monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "late")
197
+
198
+ metrics = reward_config_scalar_metrics(load_reward_settings())
199
+
200
+ assert metrics["reward_config/shaping_weight/resolved"] == 0.4
201
+ assert metrics["reward_config/shaping_weight/stage_value"] == 0.4
202
+ assert metrics["reward_config/step_penalty/stage_value"] == -0.02
203
+ assert metrics["reward_config/token_penalty/target_tokens"] == 350.0
training/trackio_utils.py CHANGED
@@ -174,6 +174,19 @@ TRACE_TABLE_COLUMNS = (
174
  "terminal_reason",
175
  )
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  SENSITIVE_TEXT_PATTERNS = (
178
  re.compile(r"hf_[A-Za-z0-9_]+"),
179
  re.compile(r"(?i)(secret|token|password|api[_-]?key)\s*[:=]\s*[^,\s}]+"),
@@ -265,6 +278,10 @@ def _stable_hash(value: Any, length: int = 16) -> str:
265
  return hashlib.sha256(text.encode("utf-8")).hexdigest()[:length]
266
 
267
 
 
 
 
 
268
  def _redact_text(value: Any, limit: int = 800) -> str:
269
  text = str(value)
270
  for pattern in SENSITIVE_TEXT_PATTERNS:
@@ -524,6 +541,10 @@ def episode_record_from_state(
524
  "run/base_model": context.get("base_model", context.get("run/base_model", "")),
525
  "run/algo": context.get("algo", context.get("run/algo", "")),
526
  "run/reward_version": context.get("reward_version", "reward_v2"),
 
 
 
 
527
  "run/env_version": context.get("env_version", "0.1.0"),
528
  "episode_id": getattr(state, "episode_id", ""),
529
  "task_id": getattr(state, "task_id", ""),
@@ -926,7 +947,7 @@ def log_trace_table(
926
  rows = trace_table_rows(episodes)
927
  table = trackio.Table(
928
  columns=list(TRACE_TABLE_COLUMNS),
929
- rows=[[row.get(column, "") for column in TRACE_TABLE_COLUMNS] for row in rows],
930
  allow_mixed_types=True,
931
  )
932
  if step is None:
@@ -1053,6 +1074,132 @@ def log_trackio_metrics(metrics: dict[str, Any], step: int | None = None) -> Non
1053
  trackio.log(numeric, step=step)
1054
 
1055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1056
  def collect_torch_gpu_metrics() -> dict[str, float]:
1057
  """Collect explicit torch CUDA metrics for Trackio scalar dashboards."""
1058
 
 
174
  "terminal_reason",
175
  )
176
 
177
+ REWARD_CONFIG_TABLE_COLUMNS = (
178
+ "key",
179
+ "value",
180
+ "stage_value",
181
+ "cap",
182
+ "threshold",
183
+ "severe_threshold",
184
+ "terminate",
185
+ "description",
186
+ )
187
+
188
+ REWARD_STAGES_FOR_TRACKING = ("early", "middle", "late", "final")
189
+
190
  SENSITIVE_TEXT_PATTERNS = (
191
  re.compile(r"hf_[A-Za-z0-9_]+"),
192
  re.compile(r"(?i)(secret|token|password|api[_-]?key)\s*[:=]\s*[^,\s}]+"),
 
278
  return hashlib.sha256(text.encode("utf-8")).hexdigest()[:length]
279
 
280
 
281
+ def _metric_safe(value: str) -> str:
282
+ return re.sub(r"[^A-Za-z0-9_.-]+", "_", value).strip("_")
283
+
284
+
285
  def _redact_text(value: Any, limit: int = 800) -> str:
286
  text = str(value)
287
  for pattern in SENSITIVE_TEXT_PATTERNS:
 
541
  "run/base_model": context.get("base_model", context.get("run/base_model", "")),
542
  "run/algo": context.get("algo", context.get("run/algo", "")),
543
  "run/reward_version": context.get("reward_version", "reward_v2"),
544
+ "run/reward_config_id": context.get("reward_config_id", ""),
545
+ "run/reward_config_hash": context.get("reward_config_hash", ""),
546
+ "run/reward_mode": context.get("reward_mode", ""),
547
+ "run/reward_stage": context.get("reward_stage", ""),
548
  "run/env_version": context.get("env_version", "0.1.0"),
549
  "episode_id": getattr(state, "episode_id", ""),
550
  "task_id": getattr(state, "task_id", ""),
 
947
  rows = trace_table_rows(episodes)
948
  table = trackio.Table(
949
  columns=list(TRACE_TABLE_COLUMNS),
950
+ data=[[row.get(column, "") for column in TRACE_TABLE_COLUMNS] for row in rows],
951
  allow_mixed_types=True,
952
  )
953
  if step is None:
 
1074
  trackio.log(numeric, step=step)
1075
 
1076
 
1077
+ def reward_config_trackio_config(settings: Any | None = None) -> dict[str, Any]:
1078
+ """Return nonnumeric reward config identity fields for Trackio run config."""
1079
+
1080
+ try:
1081
+ from CyberSecurity_OWASP.reward_config import (
1082
+ load_reward_settings,
1083
+ reward_config_run_config,
1084
+ )
1085
+ except ImportError: # pragma: no cover
1086
+ from reward_config import load_reward_settings, reward_config_run_config
1087
+
1088
+ settings = settings or load_reward_settings()
1089
+ return reward_config_run_config(settings)
1090
+
1091
+
1092
+ def reward_config_scalar_metrics(settings: Any | None = None) -> dict[str, float]:
1093
+ """Return numeric reward config values as scalar Trackio metrics."""
1094
+
1095
+ try:
1096
+ from CyberSecurity_OWASP.reward_config import (
1097
+ load_reward_settings,
1098
+ reward_config_summary,
1099
+ )
1100
+ except ImportError: # pragma: no cover
1101
+ from reward_config import load_reward_settings, reward_config_summary
1102
+
1103
+ settings = settings or load_reward_settings()
1104
+ summary = reward_config_summary(settings)
1105
+ metrics = {
1106
+ "reward_config/shaping_weight/resolved": _float(
1107
+ summary.get("reward_shaping_weight")
1108
+ )
1109
+ }
1110
+ for row in summary.get("reward_entries", []):
1111
+ key = _metric_safe(str(row.get("key", "")))
1112
+ if not key:
1113
+ continue
1114
+ for field in (
1115
+ "value",
1116
+ "stage_value",
1117
+ "resolved",
1118
+ "cap",
1119
+ "threshold",
1120
+ "severe_threshold",
1121
+ "terminate",
1122
+ ):
1123
+ value = row.get(field)
1124
+ if isinstance(value, (int, float, bool)):
1125
+ metrics[f"reward_config/{key}/{field}"] = _float(value)
1126
+
1127
+ raw_entry = settings.entry(str(row.get("key", "")))
1128
+ for extra_key, value in raw_entry.items():
1129
+ if extra_key in {
1130
+ "description",
1131
+ "value",
1132
+ "cap",
1133
+ "threshold",
1134
+ "threshold_lines",
1135
+ "severe_threshold",
1136
+ "severe_threshold_lines",
1137
+ "terminate",
1138
+ *REWARD_STAGES_FOR_TRACKING,
1139
+ }:
1140
+ continue
1141
+ if isinstance(value, (int, float, bool)):
1142
+ metrics[
1143
+ f"reward_config/{key}/{_metric_safe(str(extra_key))}"
1144
+ ] = _float(value)
1145
+ return metrics
1146
+
1147
+
1148
+ def log_reward_config(
1149
+ settings: Any | None = None,
1150
+ *,
1151
+ step: int | None = 0,
1152
+ table_name: str = "reward_config",
1153
+ ) -> dict[str, Any]:
1154
+ """Log reward config scalar values and a Trackio table for one run."""
1155
+
1156
+ try:
1157
+ from CyberSecurity_OWASP.reward_config import (
1158
+ load_reward_settings,
1159
+ reward_config_summary,
1160
+ )
1161
+ except ImportError: # pragma: no cover
1162
+ from reward_config import load_reward_settings, reward_config_summary
1163
+
1164
+ settings = settings or load_reward_settings()
1165
+ summary = reward_config_summary(settings)
1166
+
1167
+ trackio = _load_trackio()
1168
+ config_payload = reward_config_trackio_config(settings)
1169
+ active_config = getattr(trackio, "config", None)
1170
+ if isinstance(active_config, dict):
1171
+ active_config.update(config_payload)
1172
+ context_vars = getattr(trackio, "context_vars", None)
1173
+ current_run_var = getattr(context_vars, "current_run", None)
1174
+ if current_run_var is not None:
1175
+ current_run = current_run_var.get()
1176
+ if current_run is not None and isinstance(getattr(current_run, "config", None), dict):
1177
+ current_run.config.update(config_payload)
1178
+ # Force Trackio to persist the enriched run config even if the
1179
+ # trainer or auto GPU logger emitted an earlier config-only log.
1180
+ current_run._config_logged = False
1181
+ log_trackio_metrics(reward_config_scalar_metrics(settings), step=step)
1182
+
1183
+ rows = []
1184
+ for entry in summary.get("reward_entries", []):
1185
+ rows.append(
1186
+ [
1187
+ entry.get(column, "")
1188
+ for column in REWARD_CONFIG_TABLE_COLUMNS
1189
+ ]
1190
+ )
1191
+ table = trackio.Table(
1192
+ columns=list(REWARD_CONFIG_TABLE_COLUMNS),
1193
+ data=rows,
1194
+ allow_mixed_types=True,
1195
+ )
1196
+ if step is None:
1197
+ trackio.log({table_name: table})
1198
+ else:
1199
+ trackio.log({table_name: table}, step=step)
1200
+ return summary
1201
+
1202
+
1203
  def collect_torch_gpu_metrics() -> dict[str, float]:
1204
  """Collect explicit torch CUDA metrics for Trackio scalar dashboards."""
1205