Spaces:

Humanlearning
/

Cyber_analyst-round1

Sleeping

App Files Files Community

Humanlearning commited on 12 days ago

Commit

0e7f59c

1 Parent(s): 7d32451

feat: enhance reward configuration management with new logging functions, add parallel Modal training guidelines to documentation, and improve reward config hashing for deterministic behavior

Browse files

Files changed (9) hide show

.agents/skills/cybersecurity-owasp-trainer/SKILL.md +64 -0
AGENTS.md +56 -0
README.md +67 -0
reward_config.py +122 -0
scripts/modal_ephemeral_train.py +9 -0
scripts/modal_train_grpo.py +548 -4
tests/test_reward_config.py +36 -0
tests/test_trackio_utils.py +72 -0
training/trackio_utils.py +148 -1

.agents/skills/cybersecurity-owasp-trainer/SKILL.md CHANGED Viewed

@@ -89,6 +89,70 @@ Stop or roll back if reward rises while sampled traces show deny-all patches, ha
 Stop or downgrade to local-dev only if Modal training/eval shows runtime scenario compilation, cache misses in required mode, or cache hit rate below the configured target.
 ## TRL, OpenEnv, And Unsloth Guidance
 - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.

 Stop or downgrade to local-dev only if Modal training/eval shows runtime scenario compilation, cache misses in required mode, or cache hit rate below the configured target.
+## Parallel Modal Runs
+Parallel GRPO runs are allowed, but they must not share mutable experiment
+identity or mutate shared caches while another job is training.
+Before launching another run:
+1. Check active Modal apps:
+```bash
+uv run --extra modal modal app list
+```
+2. Inspect any active `CyberSecurity_OWASP` app before starting another job:
+```bash
+uv run --extra modal modal app logs <app-id>
+```
+3. Use both detach layers for long jobs:
+```bash
+uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
+  --max-steps 300 \
+  --dataset-size 64 \
+  --num-generations 8 \
+  --max-completion-length 768 \
+  --difficulty 0 \
+  --trace-log-every 10 \
+  --seed-start 10000 \
+  --detach
+```
+The Modal CLI `--detach` keeps the remote function alive after the local
+entrypoint disconnects. The launcher `--detach` prevents the parent Modal
+function from waiting on the spawned GPU call. Use both; using only the script
+flag can let Modal stop the run when the local client exits.
+For concurrent experiments:
+- Assign every run a distinct `--seed-start` range, normally at least 10,000
+  seeds apart.
+- Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`.
+- Do not run `prepare-cache --cache-force` while any training job is active.
+- Leave `--push-to-hub` off unless every job has a unique `--output-repo-id`.
+- Keep Trackio run names unique. The launcher timestamp normally handles this;
+  set `RUN_NAME` only when it is globally unique.
+- Use the same Trackio Space/project for comparison, but never reuse a run
+  name.
+- Treat `CyberSecurity_OWASP-model-cache` and
+  `CyberSecurity_OWASP-scenario-cache` as shared read-mostly volumes during
+  training. Run checkpoints and artifacts must live under the run-specific
+  output directory.
+- For clean comparisons, keep model, difficulty, dataset size, generation
+  length, reward config, and cache version fixed; vary only `seed-start` or the
+  one hyperparameter being tested.
+On Windows, if Modal startup fails with a Unicode `charmap` encoding error,
+rerun the command with UTF-8 enabled:
+```powershell
+$env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
+```
 ## TRL, OpenEnv, And Unsloth Guidance
 - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.

AGENTS.md CHANGED Viewed

@@ -1079,6 +1079,62 @@ Then scale gradually:
 For high-volume rollouts, prefer local Docker or Uvicorn over remote HF Spaces because local WebSocket sessions reduce latency and avoid Space limits.
 ---
 ## README requirements

 For high-volume rollouts, prefer local Docker or Uvicorn over remote HF Spaces because local WebSocket sessions reduce latency and avoid Space limits.
+### Parallel Modal training runs
+Parallel Modal GRPO runs are allowed only when they do not overwrite each
+other's evidence, checkpoints, scenario assignments, or Hub outputs.
+Before launching another run:
+1. Check active Modal apps:
+```bash
+uv run --extra modal modal app list
+```
+2. If a `CyberSecurity_OWASP` app is active, inspect it before launching:
+```bash
+uv run --extra modal modal app logs <app-id>
+```
+3. Use Modal CLI-level detach and the launcher detach flag together, otherwise
+the spawned GPU function may stop when the local entrypoint exits:
+```bash
+uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
+  --max-steps 300 \
+  --dataset-size 64 \
+  --num-generations 8 \
+  --max-completion-length 768 \
+  --difficulty 0 \
+  --trace-log-every 10 \
+  --seed-start 10000 \
+  --detach
+```
+When running jobs in parallel:
+- Give every run a distinct `--seed-start` range, spaced by at least 10,000
+  seeds unless a smaller controlled comparison is intentional.
+- Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
+  scenarios in the training hot path.
+- Do not run `prepare-cache --cache-force` while any training job is active.
+  Scenario-cache writes can invalidate or race training resets.
+- Leave `--push-to-hub` off for parallel experiments unless each run has a
+  unique `--output-repo-id`.
+- Keep run names unique. The launcher timestamp normally handles this; set an
+  explicit `RUN_NAME` only when it is globally unique.
+- Use different Trackio run names but the same Trackio Space so reward,
+  throughput, GPU utilization, invalid-action rate, and success metrics remain
+  comparable.
+- Treat the shared Modal volumes as shared infrastructure: model cache and
+  scenario cache should be read-only during parallel training; run/checkpoint
+  outputs must live under each run's unique output directory.
+- If the goal is a clean reward comparison, keep model, difficulty,
+  `dataset-size`, `num-generations`, `max-completion-length`, and reward config
+  fixed, changing only `seed-start` or the one hyperparameter being tested.
 ---
 ## README requirements

README.md CHANGED Viewed

@@ -211,6 +211,20 @@ uv run python scripts/track_pytest.py tests
 Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
 ## Modal Ephemeral Runs
 Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
@@ -307,6 +321,59 @@ container starts. Scalar Trackio metrics still log every reward callback, while
 sample trace tables and Trace objects are throttled by `--trace-log-every`
 (`1` restores every-callback logging, `0` disables trace artifacts).
 If running from a public repository and you do not want Modal to package the
 local workspace, use public source mode:

 Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
+Training, baseline, and smoke runs also log the effective reward config at step
+0. In Trackio, open **Media & Tables** and select the `reward_config` table to
+see the actual values for each reward key, including stage-specific values,
+caps, thresholds, terminate flags, and descriptions. Scalar metrics under
+`reward_config/<key>/<field>` expose the same numeric values for plotting and
+filtering, for example `reward_config/policy_inspected/value` and
+`reward_config/shaping_weight/resolved`.
+Each run config includes `reward_config_id`, `reward_config_hash`,
+`reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
+compare runs with the same scenario/model settings and different
+`reward_config_hash` values to see which reward weights produced each training
+curve.
 ## Modal Ephemeral Runs
 Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
 sample trace tables and Trace objects are throttled by `--trace-log-every`
 (`1` restores every-callback logging, `0` disables trace artifacts).
+### Parallel Modal GRPO Runs
+Parallel Modal GRPO runs are safe when each run has its own seed range, run
+name, and output target, while the shared cache volumes remain read-only.
+Before launching another job, check what is already active:
+```bash
+uv run --extra modal modal app list
+uv run --extra modal modal app logs <app-id>
+```
+Launch long-running parallel jobs with both Modal CLI detach and the launcher
+detach flag. The CLI-level `--detach` keeps the remote function alive after the
+local entrypoint exits; the launcher `--detach` prevents the parent Modal
+function from waiting on the GPU call.
+```bash
+uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
+  --max-steps 300 \
+  --dataset-size 64 \
+  --num-generations 8 \
+  --max-completion-length 768 \
+  --difficulty 0 \
+  --trace-log-every 10 \
+  --seed-start 10000 \
+  --detach
+```
+For multiple concurrent experiments:
+- Use a unique `--seed-start` range for every run, normally spaced by at least
+  10,000 seeds.
+- Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
+  scenarios during training.
+- Do not run `prepare-cache --cache-force` while training jobs are active.
+- Keep `--push-to-hub` disabled unless each run has a unique
+  `--output-repo-id`.
+- Let the launcher generate unique timestamped Trackio run names, or set an
+  explicit `RUN_NAME` only when it is globally unique.
+- Use the same Trackio Space/project for comparable metrics, but never reuse a
+  run name.
+- Treat `CyberSecurity_OWASP-model-cache` and
+  `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
+  during training. Run outputs and checkpoints should stay under each run's
+  unique output directory.
+If a Windows shell fails with a Unicode `charmap` encoding error during Modal
+startup, rerun with UTF-8 enabled for that command:
+```powershell
+$env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
+```
 If running from a public repository and you do not want Modal to package the
 local workspace, use public source mode:

reward_config.py CHANGED Viewed

@@ -2,6 +2,8 @@
 from __future__ import annotations
 import os
 from dataclasses import dataclass
 from pathlib import Path
@@ -88,6 +90,106 @@ def load_reward_settings(path: str | Path | None = None) -> RewardSettings:
     return settings
 def validate_reward_settings(settings: RewardSettings) -> None:
     if settings.mode not in REWARD_MODES:
         raise ValueError("reward.mode must be dense_train or sparse_eval")
@@ -103,6 +205,26 @@ def validate_reward_settings(settings: RewardSettings) -> None:
             raise ValueError(f"reward.{key}.description is required")
 def compute_token_penalty(
     completion_tokens: int,
     settings: RewardSettings | None = None,

 from __future__ import annotations
+import hashlib
+import json
 import os
 from dataclasses import dataclass
 from pathlib import Path
     return settings
+def flatten_reward_config(
+    settings: RewardSettings | None = None,
+) -> list[dict[str, Any]]:
+    """Return display-friendly reward config rows for tracking dashboards."""
+    settings = settings or load_reward_settings()
+    rows: list[dict[str, Any]] = []
+    for key in sorted(settings.raw):
+        entry = settings.raw[key]
+        if not isinstance(entry, dict):
+            continue
+        has_resolved_value = "value" in entry or settings.stage in entry
+        rows.append(
+            {
+                "key": key,
+                "value": _empty_if_missing(entry.get("value")),
+                "stage_value": _empty_if_missing(entry.get(settings.stage)),
+                "resolved": settings.value(key, 0.0) if has_resolved_value else "",
+                "cap": _empty_if_missing(entry.get("cap")),
+                "threshold": _empty_if_missing(
+                    entry.get("threshold", entry.get("threshold_lines"))
+                ),
+                "severe_threshold": _empty_if_missing(
+                    entry.get("severe_threshold", entry.get("severe_threshold_lines"))
+                ),
+                "terminate": bool(entry.get("terminate", False)),
+                "description": str(entry.get("description", "")),
+            }
+        )
+    return rows
+def reward_config_hash(settings: RewardSettings | None = None) -> str:
+    """Return a deterministic hash for the effective reward configuration."""
+    settings = settings or load_reward_settings()
+    payload = {
+        "mode": settings.mode,
+        "training_mode": settings.training_mode,
+        "stage": settings.stage,
+        "shaping_weight": settings.shaping_weight,
+        "raw": _strip_descriptions(settings.raw),
+    }
+    encoded = json.dumps(payload, sort_keys=True, separators=(",", ":"), default=str)
+    return hashlib.sha256(encoded.encode("utf-8")).hexdigest()
+def reward_config_summary(settings: RewardSettings | None = None) -> dict[str, Any]:
+    """Return reward config identity and flattened rows for run metadata."""
+    settings = settings or load_reward_settings()
+    config_hash = reward_config_hash(settings)
+    source = Path(settings.source_path)
+    return {
+        "reward_config_id": (
+            f"{source.stem}-{settings.mode}-{settings.stage}-{config_hash[:12]}"
+        ),
+        "reward_config_hash": config_hash,
+        "reward_config_source": str(source),
+        "reward_config_source_name": source.name,
+        "reward_mode": settings.mode,
+        "reward_training_mode": settings.training_mode,
+        "reward_stage": settings.stage,
+        "reward_shaping_weight": settings.shaping_weight,
+        "reward_entries": flatten_reward_config(settings),
+    }
+def reward_config_run_config(settings: RewardSettings | None = None) -> dict[str, Any]:
+    """Return compact reward config fields safe to store in Trackio run config."""
+    summary = reward_config_summary(settings)
+    reward_values = {
+        str(row["key"]): {
+            key: value
+            for key, value in row.items()
+            if key != "key" and value != ""
+        }
+        for row in summary["reward_entries"]
+    }
+    config = {
+        "reward_config_id": summary["reward_config_id"],
+        "reward_config_hash": summary["reward_config_hash"],
+        "reward_config_source": summary["reward_config_source"],
+        "reward_config_source_name": summary["reward_config_source_name"],
+        "reward_mode": summary["reward_mode"],
+        "reward_training_mode": summary["reward_training_mode"],
+        "reward_stage": summary["reward_stage"],
+        "reward_shaping_weight": summary["reward_shaping_weight"],
+        "reward_config_values": reward_values,
+        "reward_config_values_json": json.dumps(reward_values, sort_keys=True),
+    }
+    for reward_key, values in reward_values.items():
+        safe_reward_key = _config_key_safe(reward_key)
+        for field, value in values.items():
+            if isinstance(value, (int, float, bool)):
+                config[f"reward_config__{safe_reward_key}__{field}"] = value
+    return config
 def validate_reward_settings(settings: RewardSettings) -> None:
     if settings.mode not in REWARD_MODES:
         raise ValueError("reward.mode must be dense_train or sparse_eval")
             raise ValueError(f"reward.{key}.description is required")
+def _empty_if_missing(value: Any) -> Any:
+    return "" if value is None else value
+def _strip_descriptions(value: Any) -> Any:
+    if isinstance(value, dict):
+        return {
+            str(key): _strip_descriptions(item)
+            for key, item in value.items()
+            if key != "description"
+        }
+    if isinstance(value, list):
+        return [_strip_descriptions(item) for item in value]
+    return value
+def _config_key_safe(value: str) -> str:
+    return "".join(char if char.isalnum() or char == "_" else "_" for char in value).strip("_")
 def compute_token_penalty(
     completion_tokens: int,
     settings: RewardSettings | None = None,

scripts/modal_ephemeral_train.py CHANGED Viewed

@@ -131,6 +131,7 @@ def run_ephemeral_smoke(
     _configure_scenario_cache_env(required=True)
     from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
     from CyberSecurity_OWASP.config import load_scenario_authoring_config
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
@@ -140,7 +141,9 @@ def run_ephemeral_smoke(
         aggregate_episode_metrics,
         episode_record_from_state,
         log_episode_batch,
         log_trackio_metrics,
         trace_table_rows,
         trackio_run,
     )
@@ -162,10 +165,13 @@ def run_ephemeral_smoke(
     baseline = []
     oracle = []
     run_context = {
         "algo": "modal_ephemeral_smoke",
         "reward_version": "reward_v2",
         "env_version": "0.1.0",
     }
     for offset in range(episodes):
@@ -274,6 +280,7 @@ def run_ephemeral_smoke(
         "tracking_trace_rows": trace_table_rows(episode_records),
         "baseline": baseline,
         "oracle": oracle,
     }
     with trackio_run(
         run_name=run_name,
@@ -284,9 +291,11 @@ def run_ephemeral_smoke(
             "episodes": episodes,
             "seed_start": seed_start,
             "mode": "smoke",
         },
         group="smoke",
     ):
         logged_metrics = log_episode_batch(episode_records, step=0)
         log_trackio_metrics(
             {

     _configure_scenario_cache_env(required=True)
     from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
     from CyberSecurity_OWASP.config import load_scenario_authoring_config
+    from CyberSecurity_OWASP.reward_config import load_reward_settings
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
         aggregate_episode_metrics,
         episode_record_from_state,
         log_episode_batch,
+        log_reward_config,
         log_trackio_metrics,
+        reward_config_trackio_config,
         trace_table_rows,
         trackio_run,
     )
     baseline = []
     oracle = []
+    reward_settings = load_reward_settings()
+    reward_tracking_config = reward_config_trackio_config(reward_settings)
     run_context = {
         "algo": "modal_ephemeral_smoke",
         "reward_version": "reward_v2",
         "env_version": "0.1.0",
+        **reward_tracking_config,
     }
     for offset in range(episodes):
         "tracking_trace_rows": trace_table_rows(episode_records),
         "baseline": baseline,
         "oracle": oracle,
+        **reward_tracking_config,
     }
     with trackio_run(
         run_name=run_name,
             "episodes": episodes,
             "seed_start": seed_start,
             "mode": "smoke",
+            **reward_tracking_config,
         },
         group="smoke",
     ):
+        log_reward_config(reward_settings, step=0)
         logged_metrics = log_episode_batch(episode_records, step=0)
         log_trackio_metrics(
             {

scripts/modal_train_grpo.py CHANGED Viewed

@@ -16,6 +16,7 @@ Example:
 from __future__ import annotations
 import os
 import pathlib
 import subprocess
@@ -45,6 +46,7 @@ PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
 PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
 PUBLIC_REPO_BRANCH = "master"
 DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
 _IMAGE_NOTICE_PRINTED = False
@@ -120,6 +122,56 @@ def _validate_vllm_config(*, use_vllm: bool, vllm_gpu_memory_utilization: float)
         )
 def _configure_modal_cache_env() -> dict[str, str]:
     values = {
         "HF_HOME": str(HF_HOME_DIR),
@@ -410,7 +462,6 @@ def verify_modal_scenario_cache_for_training(
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
-    from CyberSecurity_OWASP.reward_config import compute_token_penalty
     from CyberSecurity_OWASP.server.curriculum import CurriculumController
     from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
@@ -514,6 +565,436 @@ def check_training_imports() -> dict[str, str]:
     volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
     secrets=secrets,
 )
 def train_cybersecurity_owasp_grpo(
     env_repo_id: str = "",
     output_repo_id: str = "",
@@ -580,16 +1061,21 @@ def train_cybersecurity_owasp_grpo(
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
-    from CyberSecurity_OWASP.reward_config import compute_token_penalty
     from CyberSecurity_OWASP.server.curriculum import CurriculumController
     from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
     from training.trackio_utils import (
         aggregate_episode_metrics,
         episode_record_from_state,
         episode_trace_fingerprint,
         log_gpu_metrics,
         log_trace_table,
         log_trackio_metrics,
         train_metric_aliases,
     )
     from training.grpo_curriculum import (
@@ -625,6 +1111,8 @@ def train_cybersecurity_owasp_grpo(
     os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
     os.environ["TRACKIO_PROJECT"] = trackio_project
     os.environ.setdefault("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
     model_slug = model_name.replace("/", "-")
     stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
@@ -761,6 +1249,10 @@ def train_cybersecurity_owasp_grpo(
                     "scenario_group_id": self.scenario_group_id,
                     "scenario_assignment": dict(self.scenario_assignment),
                     "scenario_prompt_length": len(obs.scenario_prompt),
                 }
             )
             return obs.scenario_prompt
@@ -1012,6 +1504,7 @@ def train_cybersecurity_owasp_grpo(
                     "algo": "grpo",
                     "reward_version": "reward_v2",
                     "env_version": "0.1.0",
                 },
             )
             record.update(
@@ -1116,6 +1609,10 @@ def train_cybersecurity_owasp_grpo(
                         "trace_fingerprint": fingerprint,
                         "num_generations": num_generations,
                         "run_name": run_name,
                     }
                 )
                 try:
@@ -1148,6 +1645,7 @@ def train_cybersecurity_owasp_grpo(
     class TrackioSystemMetricsCallback(TrainerCallback):
         def on_train_begin(self, args, state, control, **kwargs):
             try:
                 metrics = log_gpu_metrics(step=int(state.global_step or 0))
                 log_trackio_metrics(
                     {
@@ -1160,8 +1658,13 @@ def train_cybersecurity_owasp_grpo(
                     },
                     step=int(state.global_step or 0),
                 )
             except Exception as exc:
-                print(f"Trackio GPU metrics initialization skipped: {exc!r}")
                 return control
             if metrics:
                 system_summary = ", ".join(
@@ -1199,6 +1702,8 @@ def train_cybersecurity_owasp_grpo(
     print(f"Trackio Project: {trackio_project}")
     print(f"Output repo: {output_repo_id}")
     print(f"Run name: {run_name}")
     print(f"Model cache volume: {CACHE_VOLUME_NAME}")
     print(f"Scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
     print(f"Scenario cache dir: {scenario_cache_env['CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR']}")
@@ -1451,6 +1956,7 @@ def train_cybersecurity_owasp_grpo(
         "push_to_hub": push_to_hub,
         "scenario_cache_volume": SCENARIO_CACHE_VOLUME_NAME,
         "scenario_cache_mode": "require",
     }
@@ -1506,8 +2012,46 @@ def main(
         result = check_training_imports.remote()
         print(result)
         return
     if mode != "train":
-        raise ValueError("mode must be 'prepare-cache', 'train', or 'config'")
     (
         resolved_gradient_accumulation_steps,

 from __future__ import annotations
+import json
 import os
 import pathlib
 import subprocess
 PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
 PUBLIC_REPO_BRANCH = "master"
 DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
+GRPO_TRAINING_TIMEOUT_SECONDS = 24 * 60 * 60
 _IMAGE_NOTICE_PRINTED = False
         )
+def _extract_first_json_object(text: str) -> dict[str, Any] | None:
+    stripped = text.strip()
+    candidates = [stripped]
+    if "```" in stripped:
+        for part in stripped.split("```"):
+            part = part.strip()
+            if part.startswith("json"):
+                part = part[4:].strip()
+            candidates.append(part)
+    for candidate in candidates:
+        try:
+            loaded = json.loads(candidate)
+        except Exception:
+            continue
+        if isinstance(loaded, dict):
+            return loaded
+    start = stripped.find("{")
+    while start >= 0:
+        depth = 0
+        in_string = False
+        escaped = False
+        for index in range(start, len(stripped)):
+            char = stripped[index]
+            if in_string:
+                if escaped:
+                    escaped = False
+                elif char == "\\":
+                    escaped = True
+                elif char == '"':
+                    in_string = False
+                continue
+            if char == '"':
+                in_string = True
+            elif char == "{":
+                depth += 1
+            elif char == "}":
+                depth -= 1
+                if depth == 0:
+                    try:
+                        loaded = json.loads(stripped[start : index + 1])
+                    except Exception:
+                        break
+                    if isinstance(loaded, dict):
+                        return loaded
+        start = stripped.find("{", start + 1)
+    return None
 def _configure_modal_cache_env() -> dict[str, str]:
     values = {
         "HF_HOME": str(HF_HOME_DIR),
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
     from CyberSecurity_OWASP.server.curriculum import CurriculumController
     from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
     volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
     secrets=secrets,
 )
+def run_cybersecurity_owasp_baseline(
+    max_steps: int = 50,
+    dataset_size: int = 1,
+    difficulty: int = 0,
+    split: str = "train",
+    model_name: str = DEFAULT_GEMMA_MODEL,
+    max_seq_length: int = 4096,
+    max_completion_length: int = 768,
+    trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
+    trackio_project: str = "CyberSecurity_OWASP-grpo",
+    num_generations: int = 1,
+    trace_log_every: int = 1,
+    seed_start: int = 0,
+    git_sha: str = "nogit",
+    run_name: str = "baseline",
+    source_mode: str = "local",
+    repo_url: str = PUBLIC_REPO_URL,
+    repo_branch: str = PUBLIC_REPO_BRANCH,
+) -> dict[str, str | int | float]:
+    import statistics
+    import time
+    import torch
+    from huggingface_hub import snapshot_download, whoami
+    from unsloth import FastVisionModel
+    import transformers.utils.hub as transformers_hub
+    from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
+    from CyberSecurity_OWASP.config import load_scenario_authoring_config
+    from CyberSecurity_OWASP.reward_config import load_reward_settings
+    from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
+        CybersecurityOwaspEnvironment,
+    )
+    from CyberSecurity_OWASP.server.curriculum import CurriculumController
+    from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
+    from training.trackio_utils import (
+        aggregate_episode_metrics,
+        episode_record_from_state,
+        log_reward_config,
+        log_trace_table,
+        log_trackio_metrics,
+        reward_config_trackio_config,
+        trackio_run,
+    )
+    model_name = _ensure_gemma4_model(model_name)
+    if int(num_generations) != 1:
+        raise ValueError("Baseline mode runs the untrained model with --num-generations 1.")
+    cache_env = _configure_modal_cache_env()
+    scenario_cache_env = _configure_scenario_cache_env(required=True)
+    transformers_hub.TRANSFORMERS_CACHE = cache_env["HF_HUB_CACHE"]
+    hf_token = os.environ.get("HF_TOKEN")
+    if not hf_token:
+        raise RuntimeError(f"HF_TOKEN is missing from the Modal secret {SECRET_NAME}.")
+    try:
+        whoami(token=hf_token)
+    except Exception as exc:
+        raise RuntimeError("HF_TOKEN could not be validated before baseline run.") from exc
+    os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
+    os.environ["TRACKIO_PROJECT"] = trackio_project
+    reward_settings = load_reward_settings()
+    reward_tracking_config = reward_config_trackio_config(reward_settings)
+    run_name = run_name or "baseline"
+    output_dir = RUNS_DIR / run_name
+    output_dir.mkdir(parents=True, exist_ok=True)
+    try:
+        cache_volume.reload()
+        print(f"Reloaded Modal model cache volume: {CACHE_VOLUME_NAME}")
+    except Exception as exc:
+        print(f"Model cache volume reload skipped: {exc!r}")
+    try:
+        scenario_cache_volume.reload()
+        print(f"Reloaded Modal scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
+    except Exception as exc:
+        print(f"Scenario cache volume reload skipped: {exc!r}")
+    settings = load_scenario_authoring_config()
+    scenario_profile = CurriculumController(settings=settings).select_profile(
+        seed=seed_start,
+        split=split,
+        requested_difficulty=difficulty,
+    )
+    resolved_difficulty = int(scenario_profile["difficulty"])
+    scenario_cache = ScenarioCache(SCENARIO_CACHE_DIR, settings=settings)
+    coverage = scenario_cache.assert_coverage(
+        split=split,
+        difficulty=resolved_difficulty,
+    )
+    entries = scenario_cache.validated_entries(
+        split=split,
+        difficulty=resolved_difficulty,
+    ) or scenario_cache.validated_entries(split=split)
+    if not entries:
+        raise RuntimeError(f"No validated scenario cache entries found for split={split!r}.")
+    print(f"Baseline run name: {run_name}")
+    print(f"Source mode: {source_mode}")
+    if source_mode == "public":
+        print(f"Installed CyberSecurity_OWASP from public repo: {repo_url}@{repo_branch}")
+    else:
+        print("Packaged local CyberSecurity_OWASP repo.")
+    print(f"Trackio Space: {trackio_space_id}")
+    print(f"Trackio Project: {trackio_project}")
+    print(f"Reward config: {reward_tracking_config['reward_config_id']}")
+    print(f"Reward config hash: {reward_tracking_config['reward_config_hash']}")
+    print(f"Scenario cache dir: {scenario_cache_env['CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR']}")
+    print(f"Scenario cache coverage: {coverage}")
+    print(
+        "Baseline generation config: "
+        f"episodes={dataset_size}, max_episode_steps={max_steps}, "
+        f"num_generations={num_generations}, max_completion_length={max_completion_length}, "
+        f"trace_log_every={trace_log_every}"
+    )
+    expected_model_cache = _hf_model_cache_path(model_name)
+    print(f"Expected HF model cache path: {expected_model_cache}")
+    print(f"Model cache hit before load: {expected_model_cache.exists()}")
+    try:
+        snapshot_path = snapshot_download(
+            repo_id=model_name,
+            cache_dir=str(HF_HUB_CACHE_DIR),
+            token=hf_token,
+        )
+        print(f"Model snapshot ready: {snapshot_path}")
+        cache_volume.commit()
+    except Exception as exc:
+        print(f"Explicit model snapshot prefetch failed; loading directly. Error: {exc!r}")
+    model_api = FastVisionModel
+    model, tokenizer = model_api.from_pretrained(
+        model_name=model_name,
+        max_seq_length=max_seq_length,
+        load_in_4bit=False,
+        fast_inference=False,
+        cache_dir=str(HF_HUB_CACHE_DIR),
+        token=hf_token,
+    )
+    if hasattr(model_api, "for_inference"):
+        model_api.for_inference(model)
+    model.eval()
+    cache_volume.commit()
+    device = next(model.parameters()).device
+    text_tokenizer = getattr(tokenizer, "tokenizer", tokenizer)
+    def render_prompt(observation, actions: list[dict[str, Any]]) -> str:
+        recent_actions = actions[-8:]
+        return (
+            "You are the untrained baseline model for a defensive local AppSec "
+            "repair environment. Use only the listed local tools. Return exactly "
+            "one JSON object and no markdown.\n\n"
+            f"{observation.scenario_prompt}\n\n"
+            f"Current phase: {observation.phase}\n"
+            f"Available actions: {observation.available_actions}\n"
+            f"Last tool result: {observation.last_tool_result}\n"
+            f"Recent actions: {json.dumps(recent_actions, sort_keys=True)}\n\n"
+            'Required format: {"tool_name":"inspect_policy_graph","arguments":{}}'
+        )
+    def generate_action_text(prompt: str) -> tuple[str, list[int], list[int]]:
+        messages = [{"role": "user", "content": prompt}]
+        prompt_text = prompt
+        for candidate in (tokenizer, text_tokenizer):
+            if hasattr(candidate, "apply_chat_template"):
+                try:
+                    prompt_text = candidate.apply_chat_template(
+                        messages,
+                        tokenize=False,
+                        add_generation_prompt=True,
+                    )
+                    break
+                except Exception:
+                    prompt_text = prompt
+        encode = tokenizer
+        try:
+            inputs = encode(
+                prompt_text,
+                return_tensors="pt",
+                truncation=True,
+                max_length=max_seq_length,
+            )
+        except Exception:
+            inputs = text_tokenizer(
+                prompt_text,
+                return_tensors="pt",
+                truncation=True,
+                max_length=max_seq_length,
+            )
+        if hasattr(inputs, "to"):
+            inputs = inputs.to(device)
+        else:
+            inputs = {
+                key: value.to(device) if hasattr(value, "to") else value
+                for key, value in inputs.items()
+            }
+        input_ids = inputs.get("input_ids")
+        input_len = int(input_ids.shape[-1]) if input_ids is not None else 0
+        pad_token_id = getattr(text_tokenizer, "pad_token_id", None)
+        if pad_token_id is None:
+            pad_token_id = getattr(text_tokenizer, "eos_token_id", None)
+        with torch.inference_mode():
+            generated = model.generate(
+                **inputs,
+                max_new_tokens=max_completion_length,
+                do_sample=False,
+                pad_token_id=pad_token_id,
+            )
+        output_ids = generated[0]
+        completion_ids = output_ids[input_len:]
+        decode = getattr(text_tokenizer, "decode", None) or getattr(tokenizer, "decode")
+        text = decode(completion_ids, skip_special_tokens=True)
+        prompt_ids = (
+            [int(item) for item in input_ids[0].detach().cpu().tolist()]
+            if input_ids is not None
+            else []
+        )
+        return text, prompt_ids, [int(item) for item in completion_ids.detach().cpu().tolist()]
+    def action_from_completion(raw_text: str) -> tuple[CyberSecurityOWASPAction, str | None]:
+        loaded = _extract_first_json_object(raw_text)
+        if loaded is None:
+            return CyberSecurityOWASPAction(tool_name="noop", arguments={}), "invalid_json"
+        arguments = loaded.get("arguments", {})
+        if not isinstance(arguments, dict):
+            arguments = {}
+        payload = {
+            "tool_name": loaded.get("tool_name", "noop"),
+            "arguments": arguments,
+        }
+        try:
+            return CyberSecurityOWASPAction(**payload), None
+        except Exception as exc:
+            return (
+                CyberSecurityOWASPAction(tool_name="noop", arguments={}),
+                f"invalid_action_schema: {exc}",
+            )
+    episode_records: list[dict[str, Any]] = []
+    raw_traces: list[dict[str, Any]] = []
+    invalid_model_outputs = 0
+    generation_started = time.monotonic()
+    config = {
+        "base_model": model_name,
+        "algo": "baseline",
+        "difficulty": difficulty,
+        "split": split,
+        "max_episode_steps": max_steps,
+        "dataset_size": dataset_size,
+        "num_generations": num_generations,
+        "max_completion_length": max_completion_length,
+        "git_sha": git_sha,
+        **reward_tracking_config,
+    }
+    with trackio_run(
+        run_name=run_name,
+        run_type="baseline",
+        config=config,
+        project=trackio_project,
+        space_id=trackio_space_id,
+        group="baseline",
+        auto_log_gpu=True,
+    ):
+        log_reward_config(reward_settings, step=0)
+        for episode_index in range(max(1, int(dataset_size))):
+            entry = entries[(seed_start + episode_index) % len(entries)]
+            env = CybersecurityOwaspEnvironment()
+            try:
+                observation = env.reset(
+                    seed=int(entry["seed"]),
+                    split=str(entry["split"]),
+                    difficulty=int(entry["difficulty"]),
+                )
+                env.state.max_steps = int(max_steps)
+                actions: list[dict[str, Any]] = []
+                model_steps: list[dict[str, Any]] = []
+                prompt_token_count = 0
+                completion_token_count = 0
+                for step_index in range(int(max_steps)):
+                    if observation.done:
+                        break
+                    prompt = render_prompt(observation, actions)
+                    raw_text, prompt_ids, completion_ids = generate_action_text(prompt)
+                    prompt_token_count += len(prompt_ids)
+                    completion_token_count += len(completion_ids)
+                    action, invalid_reason = action_from_completion(raw_text)
+                    if invalid_reason:
+                        invalid_model_outputs += 1
+                    observation = env.step(action)
+                    action_dump = action.model_dump()
+                    actions.append(action_dump)
+                    model_steps.append(
+                        {
+                            "step": step_index + 1,
+                            "raw_completion": raw_text,
+                            "action": action_dump,
+                            "invalid_model_output": invalid_reason,
+                            "observation_message": observation.message,
+                            "reward": observation.reward,
+                            "done": observation.done,
+                        }
+                    )
+                env.state.completion_tokens = completion_token_count
+                env.state.metrics["prompt_tokens"] = prompt_token_count
+                env.state.metrics["completion_tokens"] = completion_token_count
+                final_observation = observation.model_dump()
+                record = episode_record_from_state(
+                    env.state,
+                    run_context={
+                        "base_model": model_name,
+                        "algo": "baseline",
+                        "reward_version": "reward_v2",
+                        "env_version": "0.1.0",
+                        **reward_tracking_config,
+                    },
+                    final_observation=final_observation,
+                )
+                record.update(
+                    {
+                        "reward_total": float(env.state.accumulated_reward),
+                        "success": bool(env.state.success),
+                        "episode_length": int(env.state.step_count),
+                        "invalid_model_output_count": sum(
+                            1 for item in model_steps if item["invalid_model_output"]
+                        ),
+                        "prompt_tokens": prompt_token_count,
+                        "completion_tokens": completion_token_count,
+                    }
+                )
+                episode_records.append(record)
+                raw_traces.append(
+                    {
+                        "episode_index": episode_index,
+                        "task_id": env.state.task_id,
+                        "seed": env.state.seed,
+                        "split": env.state.split,
+                        "difficulty": env.state.difficulty,
+                        "domain": env.state.domain,
+                        "bug_family": env.state.bug_family,
+                        "steps": model_steps,
+                    }
+                )
+            finally:
+                env.close()
+            metrics = aggregate_episode_metrics(episode_records)
+            metrics.update(
+                {
+                    "baseline/episode_count": float(len(episode_records)),
+                    "baseline/reward_total_mean": statistics.mean(
+                        float(item.get("reward_total", 0.0)) for item in episode_records
+                    ),
+                    "baseline/success_rate": statistics.mean(
+                        1.0 if item.get("success") else 0.0 for item in episode_records
+                    ),
+                    "baseline/invalid_model_output_rate": invalid_model_outputs
+                    / max(1.0, sum(float(item.get("episode_length", 0)) for item in episode_records)),
+                    "baseline/num_generations": float(num_generations),
+                    "baseline/max_episode_steps": float(max_steps),
+                    "baseline/max_completion_length": float(max_completion_length),
+                }
+            )
+            log_trackio_metrics(metrics, step=episode_index + 1)
+            if trace_log_every > 0 and (
+                episode_index == 0 or (episode_index + 1) % trace_log_every == 0
+            ):
+                log_trace_table(
+                    [episode_records[-1]],
+                    table_name="baseline_traces",
+                    step=episode_index + 1,
+                )
+    elapsed_s = time.monotonic() - generation_started
+    summary = {
+        "run_name": run_name,
+        "trackio_space_id": trackio_space_id,
+        "trackio_project": trackio_project,
+        "model_name": model_name,
+        "dataset_size": len(episode_records),
+        "max_episode_steps": int(max_steps),
+        "difficulty": int(difficulty),
+        "split": split,
+        "num_generations": int(num_generations),
+        "max_completion_length": int(max_completion_length),
+        "mean_reward": (
+            statistics.mean(float(item.get("reward_total", 0.0)) for item in episode_records)
+            if episode_records
+            else 0.0
+        ),
+        "success_rate": (
+            statistics.mean(1.0 if item.get("success") else 0.0 for item in episode_records)
+            if episode_records
+            else 0.0
+        ),
+        "invalid_model_output_count": int(invalid_model_outputs),
+        "elapsed_s": elapsed_s,
+        **reward_tracking_config,
+    }
+    artifact_path = output_dir / "baseline_rollouts.json"
+    artifact_path.write_text(
+        json.dumps(
+            {
+                "summary": summary,
+                "episodes": episode_records,
+                "raw_traces": raw_traces,
+            },
+            indent=2,
+            sort_keys=True,
+            default=str,
+        ),
+        encoding="utf-8",
+    )
+    volume.commit()
+    cache_volume.commit()
+    scenario_cache_volume.commit()
+    print(f"Baseline artifact saved to {artifact_path}")
+    return {**summary, "artifact_path": str(artifact_path)}
+@app.function(
+    image=training_image,
+    gpu="L4",
+    timeout=GRPO_TRAINING_TIMEOUT_SECONDS,
+    volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
+    secrets=secrets,
+)
 def train_cybersecurity_owasp_grpo(
     env_repo_id: str = "",
     output_repo_id: str = "",
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
+    from CyberSecurity_OWASP.reward_config import (
+        compute_token_penalty,
+        load_reward_settings,
+    )
     from CyberSecurity_OWASP.server.curriculum import CurriculumController
     from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
     from training.trackio_utils import (
         aggregate_episode_metrics,
         episode_record_from_state,
         episode_trace_fingerprint,
+        log_reward_config,
         log_gpu_metrics,
         log_trace_table,
         log_trackio_metrics,
+        reward_config_trackio_config,
         train_metric_aliases,
     )
     from training.grpo_curriculum import (
     os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
     os.environ["TRACKIO_PROJECT"] = trackio_project
     os.environ.setdefault("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    reward_settings = load_reward_settings()
+    reward_tracking_config = reward_config_trackio_config(reward_settings)
     model_slug = model_name.replace("/", "-")
     stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
                     "scenario_group_id": self.scenario_group_id,
                     "scenario_assignment": dict(self.scenario_assignment),
                     "scenario_prompt_length": len(obs.scenario_prompt),
+                    "reward_config_id": reward_tracking_config["reward_config_id"],
+                    "reward_config_hash": reward_tracking_config["reward_config_hash"],
+                    "reward_stage": reward_tracking_config["reward_stage"],
+                    "reward_mode": reward_tracking_config["reward_mode"],
                 }
             )
             return obs.scenario_prompt
                     "algo": "grpo",
                     "reward_version": "reward_v2",
                     "env_version": "0.1.0",
+                    **reward_tracking_config,
                 },
             )
             record.update(
                         "trace_fingerprint": fingerprint,
                         "num_generations": num_generations,
                         "run_name": run_name,
+                        "reward_config_id": reward_tracking_config["reward_config_id"],
+                        "reward_config_hash": reward_tracking_config["reward_config_hash"],
+                        "reward_stage": reward_tracking_config["reward_stage"],
+                        "reward_mode": reward_tracking_config["reward_mode"],
                     }
                 )
                 try:
     class TrackioSystemMetricsCallback(TrainerCallback):
         def on_train_begin(self, args, state, control, **kwargs):
             try:
+                reward_summary = log_reward_config(reward_settings, step=int(state.global_step or 0))
                 metrics = log_gpu_metrics(step=int(state.global_step or 0))
                 log_trackio_metrics(
                     {
                     },
                     step=int(state.global_step or 0),
                 )
+                print(
+                    "Trackio reward config logged: "
+                    f"{reward_summary['reward_config_id']} "
+                    f"({reward_summary['reward_config_hash']})"
+                )
             except Exception as exc:
+                print(f"Trackio initialization metrics skipped: {exc!r}")
                 return control
             if metrics:
                 system_summary = ", ".join(
     print(f"Trackio Project: {trackio_project}")
     print(f"Output repo: {output_repo_id}")
     print(f"Run name: {run_name}")
+    print(f"Reward config: {reward_tracking_config['reward_config_id']}")
+    print(f"Reward config hash: {reward_tracking_config['reward_config_hash']}")
     print(f"Model cache volume: {CACHE_VOLUME_NAME}")
     print(f"Scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
     print(f"Scenario cache dir: {scenario_cache_env['CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR']}")
         "push_to_hub": push_to_hub,
         "scenario_cache_volume": SCENARIO_CACHE_VOLUME_NAME,
         "scenario_cache_mode": "require",
+        **reward_tracking_config,
     }
         result = check_training_imports.remote()
         print(result)
         return
+    if mode == "baseline":
+        if int(num_generations) != 1:
+            raise ValueError("baseline mode expects --num-generations 1.")
+        trace_log_every = max(0, int(trace_log_every))
+        run_name = run_name or "baseline"
+        preflight = verify_modal_scenario_cache_for_training.remote(
+            split=split,
+            difficulty=difficulty,
+            dataset_size=dataset_size,
+            seed_start=seed_start,
+        )
+        print(f"CPU scenario cache preflight passed: {preflight}")
+        kwargs = dict(
+            max_steps=max_steps,
+            dataset_size=dataset_size,
+            difficulty=difficulty,
+            split=split,
+            model_name=model_name,
+            max_seq_length=max_seq_length,
+            max_completion_length=max_completion_length,
+            trackio_space_id=trackio_space_id,
+            trackio_project=trackio_project,
+            num_generations=num_generations,
+            trace_log_every=trace_log_every,
+            seed_start=seed_start,
+            git_sha=git_sha,
+            run_name=run_name,
+            source_mode=source_mode,
+            repo_url=repo_url,
+            repo_branch=repo_branch,
+        )
+        if detach:
+            call = run_cybersecurity_owasp_baseline.spawn(**kwargs)
+            print(f"Spawned Modal baseline call: {call.object_id}")
+        else:
+            result = run_cybersecurity_owasp_baseline.remote(**kwargs)
+            print(f"Baseline result: {result}")
+        return
     if mode != "train":
+        raise ValueError("mode must be 'prepare-cache', 'train', 'baseline', or 'config'")
     (
         resolved_gradient_accumulation_steps,

tests/test_reward_config.py CHANGED Viewed

@@ -4,7 +4,11 @@ import pytest
 from CyberSecurity_OWASP.reward_config import (
     compute_token_penalty,
     load_reward_settings,
 )
@@ -32,6 +36,38 @@ def test_reward_config_env_overrides(monkeypatch):
     assert compute_token_penalty(850, settings) == -0.5
 def test_reward_config_rejects_missing_descriptions(monkeypatch):
     config_path = Path("outputs/test_reward_config_bad.yaml")
     config_path.parent.mkdir(parents=True, exist_ok=True)

 from CyberSecurity_OWASP.reward_config import (
     compute_token_penalty,
+    flatten_reward_config,
     load_reward_settings,
+    reward_config_hash,
+    reward_config_run_config,
+    reward_config_summary,
 )
     assert compute_token_penalty(850, settings) == -0.5
+def test_reward_config_hash_and_flattened_values_are_deterministic(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "middle")
+    settings = load_reward_settings()
+    first_hash = reward_config_hash(settings)
+    second_hash = reward_config_hash(load_reward_settings())
+    summary = reward_config_summary(settings)
+    run_config = reward_config_run_config(settings)
+    rows = {row["key"]: row for row in flatten_reward_config(settings)}
+    assert first_hash == second_hash
+    assert len(first_hash) == 64
+    assert summary["reward_config_hash"] == first_hash
+    assert summary["reward_config_id"].endswith(first_hash[:12])
+    assert run_config["reward_config_hash"] == first_hash
+    assert run_config["reward_mode"] == "dense_train"
+    assert run_config["reward_stage"] == "middle"
+    assert run_config["reward_config_values"]["policy_inspected"]["value"] == 0.30
+    assert run_config["reward_config_values"]["shaping_weight"]["stage_value"] == 0.7
+    assert run_config["reward_config__policy_inspected__value"] == 0.30
+    assert run_config["reward_config__shaping_weight__stage_value"] == 0.7
+    assert "policy_inspected" in run_config["reward_config_values_json"]
+    assert rows["policy_inspected"]["value"] == 0.30
+    assert rows["shaping_weight"]["stage_value"] == 0.7
+    assert rows["shaping_weight"]["resolved"] == 0.7
+    assert rows["step_penalty"]["stage_value"] == -0.01
+    assert rows["oversized_patch"]["threshold"] == 80
+    assert rows["oversized_patch"]["severe_threshold"] == 180
+    assert rows["hidden_file_probe"]["terminate"] is True
 def test_reward_config_rejects_missing_descriptions(monkeypatch):
     config_path = Path("outputs/test_reward_config_bad.yaml")
     config_path.parent.mkdir(parents=True, exist_ok=True)

tests/test_trackio_utils.py CHANGED Viewed

@@ -1,14 +1,20 @@
 import json
 from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
 from training.trackio_utils import (
     CANONICAL_TRACKIO_SIGNALS,
     DERIVED_TRACKIO_METRICS,
     aggregate_episode_metrics,
     episode_record_from_state,
     episode_trace_fingerprint,
     episode_to_trace_row,
     episode_to_tracking_fields,
 )
 from .helpers import apply_secure_patch, make_env, secure_invoice_source, submit_valid_finding
@@ -129,3 +135,69 @@ def test_trace_fingerprint_ignores_episode_id_but_tracks_action_changes():
     assert episode_trace_fingerprint(base_record) == episode_trace_fingerprint(token_only_reward_change)
     assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(changed_trace)
     assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(different_scenario)

 import json
+import sys
+import types
 from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
+from CyberSecurity_OWASP.reward_config import load_reward_settings
 from training.trackio_utils import (
     CANONICAL_TRACKIO_SIGNALS,
     DERIVED_TRACKIO_METRICS,
+    REWARD_CONFIG_TABLE_COLUMNS,
     aggregate_episode_metrics,
     episode_record_from_state,
     episode_trace_fingerprint,
     episode_to_trace_row,
     episode_to_tracking_fields,
+    log_reward_config,
+    reward_config_scalar_metrics,
 )
 from .helpers import apply_secure_patch, make_env, secure_invoice_source, submit_valid_finding
     assert episode_trace_fingerprint(base_record) == episode_trace_fingerprint(token_only_reward_change)
     assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(changed_trace)
     assert episode_trace_fingerprint(base_record) != episode_trace_fingerprint(different_scenario)
+def test_log_reward_config_emits_scalar_values_and_table(monkeypatch):
+    logged: list[tuple[dict, int | None]] = []
+    class FakeTable:
+        def __init__(self, *, columns, data=None, rows=None, allow_mixed_types=False):
+            self.columns = columns
+            self.rows = data if data is not None else rows
+            self.data = self.rows
+            self.allow_mixed_types = allow_mixed_types
+    fake_trackio = types.SimpleNamespace(config={}, Table=FakeTable)
+    def fake_log(payload, step=None):
+        logged.append((payload, step))
+    fake_trackio.log = fake_log
+    monkeypatch.setitem(sys.modules, "trackio", fake_trackio)
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "early")
+    settings = load_reward_settings()
+    summary = log_reward_config(settings, step=0)
+    assert fake_trackio.config["reward_config_hash"] == summary["reward_config_hash"]
+    assert fake_trackio.config["reward_config_values"]["policy_inspected"]["value"] == 0.30
+    assert fake_trackio.config["reward_config__policy_inspected__value"] == 0.30
+    scalar_payload = next(payload for payload, _step in logged if "reward_config/policy_inspected/value" in payload)
+    assert scalar_payload["reward_config/policy_inspected/value"] == 0.30
+    assert scalar_payload["reward_config/shaping_weight/resolved"] == 1.0
+    assert scalar_payload["reward_config/invalid_action/value"] == -0.20
+    assert scalar_payload["reward_config/progressive_cap/value"] == 5.0
+    assert scalar_payload["reward_config/oversized_patch/severe_value"] == -1.0
+    table = next(payload["reward_config"] for payload, _step in logged if "reward_config" in payload)
+    assert table.columns == list(REWARD_CONFIG_TABLE_COLUMNS)
+    assert table.allow_mixed_types is True
+    rows = {row[0]: row for row in table.rows}
+    assert rows["policy_inspected"][1] == 0.30
+    assert rows["shaping_weight"][2] == 1.0
+    assert rows["hidden_file_probe"][6] is True
+    logged_text = json.dumps(
+        {
+            "summary": summary,
+            "scalar_payload": scalar_payload,
+            "table_rows": table.rows,
+        },
+        sort_keys=True,
+        default=str,
+    )
+    assert "owner_invoice_id" not in logged_text
+    assert "foreign_invoice_id" not in logged_text
+def test_reward_config_scalar_metrics_uses_stage_resolved_values(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "late")
+    metrics = reward_config_scalar_metrics(load_reward_settings())
+    assert metrics["reward_config/shaping_weight/resolved"] == 0.4
+    assert metrics["reward_config/shaping_weight/stage_value"] == 0.4
+    assert metrics["reward_config/step_penalty/stage_value"] == -0.02
+    assert metrics["reward_config/token_penalty/target_tokens"] == 350.0

training/trackio_utils.py CHANGED Viewed

@@ -174,6 +174,19 @@ TRACE_TABLE_COLUMNS = (
     "terminal_reason",
 )
 SENSITIVE_TEXT_PATTERNS = (
     re.compile(r"hf_[A-Za-z0-9_]+"),
     re.compile(r"(?i)(secret|token|password|api[_-]?key)\s*[:=]\s*[^,\s}]+"),
@@ -265,6 +278,10 @@ def _stable_hash(value: Any, length: int = 16) -> str:
     return hashlib.sha256(text.encode("utf-8")).hexdigest()[:length]
 def _redact_text(value: Any, limit: int = 800) -> str:
     text = str(value)
     for pattern in SENSITIVE_TEXT_PATTERNS:
@@ -524,6 +541,10 @@ def episode_record_from_state(
         "run/base_model": context.get("base_model", context.get("run/base_model", "")),
         "run/algo": context.get("algo", context.get("run/algo", "")),
         "run/reward_version": context.get("reward_version", "reward_v2"),
         "run/env_version": context.get("env_version", "0.1.0"),
         "episode_id": getattr(state, "episode_id", ""),
         "task_id": getattr(state, "task_id", ""),
@@ -926,7 +947,7 @@ def log_trace_table(
     rows = trace_table_rows(episodes)
     table = trackio.Table(
         columns=list(TRACE_TABLE_COLUMNS),
-        rows=[[row.get(column, "") for column in TRACE_TABLE_COLUMNS] for row in rows],
         allow_mixed_types=True,
     )
     if step is None:
@@ -1053,6 +1074,132 @@ def log_trackio_metrics(metrics: dict[str, Any], step: int | None = None) -> Non
         trackio.log(numeric, step=step)
 def collect_torch_gpu_metrics() -> dict[str, float]:
     """Collect explicit torch CUDA metrics for Trackio scalar dashboards."""

     "terminal_reason",
 )
+REWARD_CONFIG_TABLE_COLUMNS = (
+    "key",
+    "value",
+    "stage_value",
+    "cap",
+    "threshold",
+    "severe_threshold",
+    "terminate",
+    "description",
+)
+REWARD_STAGES_FOR_TRACKING = ("early", "middle", "late", "final")
 SENSITIVE_TEXT_PATTERNS = (
     re.compile(r"hf_[A-Za-z0-9_]+"),
     re.compile(r"(?i)(secret|token|password|api[_-]?key)\s*[:=]\s*[^,\s}]+"),
     return hashlib.sha256(text.encode("utf-8")).hexdigest()[:length]
+def _metric_safe(value: str) -> str:
+    return re.sub(r"[^A-Za-z0-9_.-]+", "_", value).strip("_")
 def _redact_text(value: Any, limit: int = 800) -> str:
     text = str(value)
     for pattern in SENSITIVE_TEXT_PATTERNS:
         "run/base_model": context.get("base_model", context.get("run/base_model", "")),
         "run/algo": context.get("algo", context.get("run/algo", "")),
         "run/reward_version": context.get("reward_version", "reward_v2"),
+        "run/reward_config_id": context.get("reward_config_id", ""),
+        "run/reward_config_hash": context.get("reward_config_hash", ""),
+        "run/reward_mode": context.get("reward_mode", ""),
+        "run/reward_stage": context.get("reward_stage", ""),
         "run/env_version": context.get("env_version", "0.1.0"),
         "episode_id": getattr(state, "episode_id", ""),
         "task_id": getattr(state, "task_id", ""),
     rows = trace_table_rows(episodes)
     table = trackio.Table(
         columns=list(TRACE_TABLE_COLUMNS),
+        data=[[row.get(column, "") for column in TRACE_TABLE_COLUMNS] for row in rows],
         allow_mixed_types=True,
     )
     if step is None:
         trackio.log(numeric, step=step)
+def reward_config_trackio_config(settings: Any | None = None) -> dict[str, Any]:
+    """Return nonnumeric reward config identity fields for Trackio run config."""
+    try:
+        from CyberSecurity_OWASP.reward_config import (
+            load_reward_settings,
+            reward_config_run_config,
+        )
+    except ImportError:  # pragma: no cover
+        from reward_config import load_reward_settings, reward_config_run_config
+    settings = settings or load_reward_settings()
+    return reward_config_run_config(settings)
+def reward_config_scalar_metrics(settings: Any | None = None) -> dict[str, float]:
+    """Return numeric reward config values as scalar Trackio metrics."""
+    try:
+        from CyberSecurity_OWASP.reward_config import (
+            load_reward_settings,
+            reward_config_summary,
+        )
+    except ImportError:  # pragma: no cover
+        from reward_config import load_reward_settings, reward_config_summary
+    settings = settings or load_reward_settings()
+    summary = reward_config_summary(settings)
+    metrics = {
+        "reward_config/shaping_weight/resolved": _float(
+            summary.get("reward_shaping_weight")
+        )
+    }
+    for row in summary.get("reward_entries", []):
+        key = _metric_safe(str(row.get("key", "")))
+        if not key:
+            continue
+        for field in (
+            "value",
+            "stage_value",
+            "resolved",
+            "cap",
+            "threshold",
+            "severe_threshold",
+            "terminate",
+        ):
+            value = row.get(field)
+            if isinstance(value, (int, float, bool)):
+                metrics[f"reward_config/{key}/{field}"] = _float(value)
+        raw_entry = settings.entry(str(row.get("key", "")))
+        for extra_key, value in raw_entry.items():
+            if extra_key in {
+                "description",
+                "value",
+                "cap",
+                "threshold",
+                "threshold_lines",
+                "severe_threshold",
+                "severe_threshold_lines",
+                "terminate",
+                *REWARD_STAGES_FOR_TRACKING,
+            }:
+                continue
+            if isinstance(value, (int, float, bool)):
+                metrics[
+                    f"reward_config/{key}/{_metric_safe(str(extra_key))}"
+                ] = _float(value)
+    return metrics
+def log_reward_config(
+    settings: Any | None = None,
+    *,
+    step: int | None = 0,
+    table_name: str = "reward_config",
+) -> dict[str, Any]:
+    """Log reward config scalar values and a Trackio table for one run."""
+    try:
+        from CyberSecurity_OWASP.reward_config import (
+            load_reward_settings,
+            reward_config_summary,
+        )
+    except ImportError:  # pragma: no cover
+        from reward_config import load_reward_settings, reward_config_summary
+    settings = settings or load_reward_settings()
+    summary = reward_config_summary(settings)
+    trackio = _load_trackio()
+    config_payload = reward_config_trackio_config(settings)
+    active_config = getattr(trackio, "config", None)
+    if isinstance(active_config, dict):
+        active_config.update(config_payload)
+    context_vars = getattr(trackio, "context_vars", None)
+    current_run_var = getattr(context_vars, "current_run", None)
+    if current_run_var is not None:
+        current_run = current_run_var.get()
+        if current_run is not None and isinstance(getattr(current_run, "config", None), dict):
+            current_run.config.update(config_payload)
+            # Force Trackio to persist the enriched run config even if the
+            # trainer or auto GPU logger emitted an earlier config-only log.
+            current_run._config_logged = False
+    log_trackio_metrics(reward_config_scalar_metrics(settings), step=step)
+    rows = []
+    for entry in summary.get("reward_entries", []):
+        rows.append(
+            [
+                entry.get(column, "")
+                for column in REWARD_CONFIG_TABLE_COLUMNS
+            ]
+        )
+    table = trackio.Table(
+        columns=list(REWARD_CONFIG_TABLE_COLUMNS),
+        data=rows,
+        allow_mixed_types=True,
+    )
+    if step is None:
+        trackio.log({table_name: table})
+    else:
+        trackio.log({table_name: table}, step=step)
+    return summary
 def collect_torch_gpu_metrics() -> dict[str, float]:
     """Collect explicit torch CUDA metrics for Trackio scalar dashboards."""