Spaces:

Humanlearning
/

Cyber_analyst-round1

Sleeping

App Files Files Community

Humanlearning commited on 13 days ago

Commit

b3ee507

1 Parent(s): 28685f3

feat: update training configuration and documentation for Modal execution, including new model integration and enhanced tracking utilities

Browse files

Files changed (12) hide show

.agents/skills/cybersecurity-owasp-trainer/SKILL.md +10 -8
01_ARCHITECTURE.md +4 -2
README.md +5 -1
pyproject.toml +1 -1
scripts/modal_ephemeral_train.py +197 -9
scripts/modal_train_grpo.py +292 -77
tests/test_trackio_utils.py +93 -0
training/__init__.py +1 -0
training/configs/grpo_small.yaml +3 -1
training/rollout.py +9 -2
training/trackio_utils.py +869 -1
training/train_grpo.py +48 -8

.agents/skills/cybersecurity-owasp-trainer/SKILL.md CHANGED Viewed

@@ -7,7 +7,9 @@ description: Train, debug, evaluate, and document CyberSecurity_OWASP model runs
 ## Overview
-Use this skill to run or modify the CyberSecurity_OWASP training and evaluation loop without weakening the verifier, reward integrity, or hackathon evidence trail. Treat the environment and reward engine as the product; training only starts after those are stable.
 ## References
@@ -37,28 +39,28 @@ Prefer the existing repo modules:
 - `training/rollout.py`: full OpenEnv episode loop, action JSON parsing, reward trace, rollout artifact fields.
 - `training/reward_funcs.py`: component reward functions exposed to TRL/GRPO.
-- `training/train_grpo.py`: `GRPOConfig`, model defaults, Trackio reporting, vLLM settings.
 - `training/eval_before_after.py`: baseline-vs-trained and held-out summary metrics.
 - `training/trackio_utils.py`: run naming, canonical metric names, Trackio init/log/finalize helpers.
 Default environment values:
 ```powershell
-$env:MODEL_NAME = "Qwen/Qwen3-1.7B"
 $env:TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
 $env:TRACKIO_PROJECT = "CyberSecurity_OWASP"
 $env:DIFFICULTY = "0"
 ```
-Use level-0 debug runs before scaling. Do not increase batch size, prompt count, scenario diversity, or difficulty until sampled artifacts show real discover-then-patch behavior rather than formatting compliance only.
 ## Training Workflow
 1. Validate the environment first: run the targeted tests that cover models, reset/step/state, rewards, anti-cheat, seed reproducibility, invalid actions, and rollouts.
-2. Run a tiny smoke path that constructs `GRPOConfig` without starting expensive training.
-3. Run a frozen-model or dummy-policy rollout and inspect the action trace, observations, terminal reason, and reward breakdown.
 4. Confirm Trackio receives component metrics and the run name follows `CyberSecurity_OWASP-<model>-<algo>-level<difficulty>-<YYYYMMDD-HHMM>-<git_sha>`.
-5. Start a very small GRPO run only after the above passes. Watch completions and rollout artifacts during the run, not just aggregate reward.
 6. Evaluate baseline, trained, and held-out splits with `training/eval_before_after.py` and save summaries under `outputs/evals/`.
 7. Save sampled rollouts under `outputs/rollouts/` for baseline, mid-training, trained, and held-out evidence.
@@ -77,7 +79,7 @@ Stop or roll back if reward rises while sampled traces show deny-all patches, ha
 - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
 - Keep the existing custom rollout path unless deliberately migrating to TRL's `environment_factory`. If migrating, preserve typed actions, observations, reward component logging, anti-cheat flags, and rollout artifacts.
-- Use vLLM colocate for small local runs when memory allows; use server mode only when a separate inference GPU/server is available.
 - For OpenEnv server training concurrency, ensure the server supports enough concurrent sessions for the generation batch.
 - Use Unsloth with LoRA or QLoRA for memory efficiency when the training machine supports it. Start from an instruct-capable checkpoint and verify the model has non-zero success probability before RL.
 - Pin and smoke-test TRL, Unsloth, vLLM, CUDA, and torch versions before longer runs.

 ## Overview
+Use this skill to run or modify the CyberSecurity_OWASP training and evaluation loop without weakening the verifier, reward integrity, or hackathon evidence trail. Training is expected to run on Modal only.
+Important: do **not** run GRPO/PPO training loops locally in this repo. Use Modal launchers (`scripts/modal_ephemeral_train.py` for smoke and `scripts/modal_train_grpo.py` for GRPO).
 ## References
 - `training/rollout.py`: full OpenEnv episode loop, action JSON parsing, reward trace, rollout artifact fields.
 - `training/reward_funcs.py`: component reward functions exposed to TRL/GRPO.
+- `training/train_grpo.py`: `GRPOConfig`/model defaults and launch intent (does not run local training).
 - `training/eval_before_after.py`: baseline-vs-trained and held-out summary metrics.
 - `training/trackio_utils.py`: run naming, canonical metric names, Trackio init/log/finalize helpers.
 Default environment values:
 ```powershell
+$env:MODEL_NAME = "google/gemma-2-2b-it"
 $env:TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
 $env:TRACKIO_PROJECT = "CyberSecurity_OWASP"
 $env:DIFFICULTY = "0"
 ```
+Use level-0 debug runs before scaling, and verify them through Modal smoke/ephemeral runs.
 ## Training Workflow
 1. Validate the environment first: run the targeted tests that cover models, reset/step/state, rewards, anti-cheat, seed reproducibility, invalid actions, and rollouts.
+2. Run a Modal smoke path for lightweight config/run verification.
+3. Run a frozen-model or dummy-policy rollout on Modal and inspect the action trace, observations, terminal reason, and reward breakdown.
 4. Confirm Trackio receives component metrics and the run name follows `CyberSecurity_OWASP-<model>-<algo>-level<difficulty>-<YYYYMMDD-HHMM>-<git_sha>`.
+5. Start a very small GRPO run only after the above passes. Start via `scripts/modal_train_grpo.py --mode train`.
 6. Evaluate baseline, trained, and held-out splits with `training/eval_before_after.py` and save summaries under `outputs/evals/`.
 7. Save sampled rollouts under `outputs/rollouts/` for baseline, mid-training, trained, and held-out evidence.
 - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
 - Keep the existing custom rollout path unless deliberately migrating to TRL's `environment_factory`. If migrating, preserve typed actions, observations, reward component logging, anti-cheat flags, and rollout artifacts.
+- Use Modal as the default training path; local-only vLLM/GRPO execution is intentionally avoided in this repository.
 - For OpenEnv server training concurrency, ensure the server supports enough concurrent sessions for the generation batch.
 - Use Unsloth with LoRA or QLoRA for memory efficiency when the training machine supports it. Start from an instruct-capable checkpoint and verify the model has non-zero success probability before RL.
 - Pin and smoke-test TRL, Unsloth, vLLM, CUDA, and torch versions before longer runs.

01_ARCHITECTURE.md CHANGED Viewed

@@ -397,16 +397,18 @@ Editable source: `assets/env_rl_training_flow_diagram.mmd`
 9. Produce final demo: before/after trace + reward curve + held-out eval table.
 ```
-Recommended initial training setup:
 ```text
-Model: Qwen/Qwen3-1.7B or similar small instruct model
 Algorithm: GRPO via TRL or Unsloth-compatible loop
 Dataset prompt: repeated task instruction with randomized scenario IDs
 Max steps per episode: 30
 Rollouts per prompt: 2-4
 Logging: Trackio
 Primary eval: held-out deterministic test pass rate
 ```
 ## 9. Deployment architecture

 9. Produce final demo: before/after trace + reward curve + held-out eval table.
 ```
+Recommended initial training setup (Modal-first):
 ```text
+Model: google/gemma-2-2b-it (or compatible Gemma-class instruct model)
 Algorithm: GRPO via TRL or Unsloth-compatible loop
 Dataset prompt: repeated task instruction with randomized scenario IDs
 Max steps per episode: 30
 Rollouts per prompt: 2-4
 Logging: Trackio
 Primary eval: held-out deterministic test pass rate
+Training execution is expected to run on Modal (persistent or ephemeral) rather than locally.
 ```
 ## 9. Deployment architecture

README.md CHANGED Viewed

@@ -149,6 +149,10 @@ Training files are under `training/`:
 The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
 ## Trackio Run Tracking
 Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
@@ -239,7 +243,7 @@ Defaults are derived from `HF_TOKEN`:
 - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
 - Trackio project: `CyberSecurity_OWASP-grpo`
-- Output repo: `<hf-user>/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora`
 Override these with `--trackio-space-id`, `--trackio-project`, and
 `--output-repo-id` when needed.

 The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
+`training/train_grpo.py` in this repo is a config helper only; it does not execute training locally.
+Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
+`scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
 ## Trackio Run Tracking
 Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
 - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
 - Trackio project: `CyberSecurity_OWASP-grpo`
+- Output repo: `<hf-user>/CyberSecurity_OWASP-gemma-2-2b-grpo-lora`
 Override these with `--trackio-space-id`, `--trackio-project`, and
 `--output-repo-id` when needed.

pyproject.toml CHANGED Viewed

@@ -45,7 +45,7 @@ server = "CyberSecurity_OWASP.server.app:main"
 [tool.setuptools]
 include-package-data = true
-packages = ["CyberSecurity_OWASP", "CyberSecurity_OWASP.server"]
 package-dir = { "CyberSecurity_OWASP" = ".", "CyberSecurity_OWASP.server" = "server" }
 [tool.pytest.ini_options]

 [tool.setuptools]
 include-package-data = true
+packages = ["CyberSecurity_OWASP", "CyberSecurity_OWASP.server", "training"]
 package-dir = { "CyberSecurity_OWASP" = ".", "CyberSecurity_OWASP.server" = "server" }
 [tool.pytest.ini_options]

scripts/modal_ephemeral_train.py CHANGED Viewed

@@ -12,6 +12,8 @@ the local process, so the run disappears when ``modal run`` exits.
 from __future__ import annotations
 import json
 from datetime import datetime
 from pathlib import Path
 from typing import Any
@@ -20,6 +22,7 @@ import modal
 APP_NAME = "CyberSecurity_OWASP-ephemeral-training"
 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
@@ -63,7 +66,11 @@ class NoopTrainer:
         ]
-@app.function(image=image, timeout=60 * 30)
 def run_ephemeral_smoke(
     episodes: int = 4,
     seed_start: int = 0,
@@ -75,17 +82,45 @@ def run_ephemeral_smoke(
         CybersecurityOwaspEnvironment,
     )
     from training.rollout import rollout_once
-    from training.trackio_utils import log_trackio_metrics, trackio_run
     baseline = []
     oracle = []
     for offset in range(episodes):
         seed = seed_start + offset
         baseline_env = CybersecurityOwaspEnvironment()
-        baseline_env.reset(seed=seed, split="validation")
-        baseline.append(rollout_once(NoopTrainer(), baseline_env, max_steps=5))
         oracle_env = CybersecurityOwaspEnvironment()
         oracle_env.reset(seed=seed, split="validation")
@@ -124,19 +159,25 @@ def run_ephemeral_smoke(
         )
         oracle_env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
         final = oracle_env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
-        oracle.append(
             {
-                "seed": seed,
-                "success": oracle_env.state.success,
                 "reward_total": final.reward_breakdown.get("total", 0.0),
-                "reward_breakdown": final.reward_breakdown,
             }
         )
     def mean(items: list[dict[str, Any]], key: str) -> float:
         return sum(float(item.get(key, 0.0)) for item in items) / max(1, len(items))
     run_name = f"{APP_NAME}-{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}"
     result = {
         "run_name": run_name,
         "mode": "smoke",
@@ -145,6 +186,8 @@ def run_ephemeral_smoke(
         "baseline_mean_reward": mean(baseline, "reward_total"),
         "oracle_mean_reward": mean(oracle, "reward_total"),
         "oracle_success_rate": mean(oracle, "success"),
         "baseline": baseline,
         "oracle": oracle,
     }
@@ -160,8 +203,10 @@ def run_ephemeral_smoke(
         },
         group="smoke",
     ):
         log_trackio_metrics(
             {
                 "smoke/baseline_mean_reward": result["baseline_mean_reward"],
                 "smoke/oracle_mean_reward": result["oracle_mean_reward"],
                 "smoke/oracle_success_rate": result["oracle_success_rate"],
@@ -179,6 +224,130 @@ def run_grpo_config_check() -> str:
     return str(build_grpo_config())
 @app.local_entrypoint()
 def main(
     mode: str = "smoke",
@@ -186,6 +355,7 @@ def main(
     seed_start: int = 0,
     trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-smoke",
 ) -> None:
     if mode == "smoke":
         result = run_ephemeral_smoke.remote(
@@ -201,5 +371,23 @@ def main(
         print(json.dumps({"saved": str(output_path), **result}, indent=2, sort_keys=True))
     elif mode == "grpo-config":
         print(run_grpo_config_check.remote())
     else:
-        raise ValueError("mode must be 'smoke' or 'grpo-config'")

 from __future__ import annotations
 import json
+import subprocess
+import time
 from datetime import datetime
 from pathlib import Path
 from typing import Any
 APP_NAME = "CyberSecurity_OWASP-ephemeral-training"
+SECRET_NAME = "CyberSecurity_OWASP-secrets"
 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
         ]
+@app.function(
+    image=image,
+    timeout=60 * 30,
+    secrets=[modal.Secret.from_name(SECRET_NAME, required_keys=["HF_TOKEN"])],
+)
 def run_ephemeral_smoke(
     episodes: int = 4,
     seed_start: int = 0,
         CybersecurityOwaspEnvironment,
     )
     from training.rollout import rollout_once
+    from training.trackio_utils import (
+        aggregate_episode_metrics,
+        episode_record_from_state,
+        log_episode_batch,
+        log_trackio_metrics,
+        trace_table_rows,
+        trackio_run,
+    )
     baseline = []
     oracle = []
+    run_context = {
+        "algo": "modal_ephemeral_smoke",
+        "reward_version": "reward_v1",
+        "env_version": "0.1.0",
+    }
     for offset in range(episodes):
         seed = seed_start + offset
         baseline_env = CybersecurityOwaspEnvironment()
+        baseline_rollout = rollout_once(
+            NoopTrainer(),
+            baseline_env,
+            max_steps=5,
+            reset_kwargs={"seed": seed, "split": "validation", "difficulty": 0},
+        )
+        baseline_record = episode_record_from_state(
+            baseline_env.state,
+            run_context={**run_context, "base_model": "noop"},
+        )
+        baseline_record.update(
+            {
+                "reward_total": baseline_rollout.get("reward_total", 0.0),
+                "success": baseline_rollout.get("success", False),
+                "episode_length": baseline_rollout.get("episode_length", 0),
+            }
+        )
+        baseline.append(baseline_record)
         oracle_env = CybersecurityOwaspEnvironment()
         oracle_env.reset(seed=seed, split="validation")
         )
         oracle_env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
         final = oracle_env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
+        oracle_record = episode_record_from_state(
+            oracle_env.state,
+            run_context={**run_context, "base_model": "oracle"},
+            final_observation=final.model_dump(),
+        )
+        oracle_record.update(
             {
                 "reward_total": final.reward_breakdown.get("total", 0.0),
+                "success": oracle_env.state.success,
             }
         )
+        oracle.append(oracle_record)
     def mean(items: list[dict[str, Any]], key: str) -> float:
         return sum(float(item.get(key, 0.0)) for item in items) / max(1, len(items))
     run_name = f"{APP_NAME}-{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}"
+    episode_records = [*baseline, *oracle]
+    tracking_metrics = aggregate_episode_metrics(episode_records)
     result = {
         "run_name": run_name,
         "mode": "smoke",
         "baseline_mean_reward": mean(baseline, "reward_total"),
         "oracle_mean_reward": mean(oracle, "reward_total"),
         "oracle_success_rate": mean(oracle, "success"),
+        "tracking_metrics": tracking_metrics,
+        "tracking_trace_rows": trace_table_rows(episode_records),
         "baseline": baseline,
         "oracle": oracle,
     }
         },
         group="smoke",
     ):
+        logged_metrics = log_episode_batch(episode_records, step=0)
         log_trackio_metrics(
             {
+                **logged_metrics,
                 "smoke/baseline_mean_reward": result["baseline_mean_reward"],
                 "smoke/oracle_mean_reward": result["oracle_mean_reward"],
                 "smoke/oracle_success_rate": result["oracle_success_rate"],
     return str(build_grpo_config())
+@app.function(
+    image=image,
+    timeout=60 * 10,
+    secrets=[modal.Secret.from_name(SECRET_NAME, required_keys=["HF_TOKEN"])],
+)
+def verify_trackio_run(
+    run_name: str,
+    trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
+    trackio_project: str = "CyberSecurity_OWASP-smoke",
+) -> dict[str, Any]:
+    import os
+    from training.trackio_utils import (
+        REQUIRED_SMOKE_TRACKIO_ITEMS,
+        missing_required_trackio_items,
+    )
+    hf_token = os.environ["HF_TOKEN"]
+    cmd = [
+        "trackio",
+        "get",
+        "run",
+        "--project",
+        trackio_project,
+        "--run",
+        run_name,
+        "--space",
+        trackio_space_id,
+        "--hf-token",
+        hf_token,
+        "--json",
+    ]
+    metrics_cmd = [
+        "trackio",
+        "list",
+        "metrics",
+        "--project",
+        trackio_project,
+        "--run",
+        run_name,
+        "--space",
+        trackio_space_id,
+        "--hf-token",
+        hf_token,
+        "--json",
+    ]
+    last_result: dict[str, Any] = {}
+    for attempt in range(1, 4):
+        completed = subprocess.run(cmd, capture_output=True, text=True)
+        metrics_completed = subprocess.run(metrics_cmd, capture_output=True, text=True)
+        last_result = {
+            "attempt": attempt,
+            "returncode": completed.returncode,
+            "stdout": completed.stdout[-4000:],
+            "stderr": completed.stderr[-4000:],
+            "metrics_returncode": metrics_completed.returncode,
+            "metrics_stdout": metrics_completed.stdout[-4000:],
+            "metrics_stderr": metrics_completed.stderr[-4000:],
+        }
+        if completed.returncode == 0:
+            data = json.loads(completed.stdout)
+            if metrics_completed.returncode == 0:
+                metrics_data = json.loads(metrics_completed.stdout)
+                if isinstance(metrics_data.get("metrics"), list):
+                    data["metrics"] = metrics_data["metrics"]
+            missing = missing_required_trackio_items(data, REQUIRED_SMOKE_TRACKIO_ITEMS)
+            return {
+                "ok": not missing,
+                "trackio_space_id": trackio_space_id,
+                "trackio_project": trackio_project,
+                "run_name": run_name,
+                "required_items": list(REQUIRED_SMOKE_TRACKIO_ITEMS),
+                "missing_required_items": missing,
+                "run": data,
+            }
+        time.sleep(10)
+    return {
+        "ok": False,
+        "trackio_space_id": trackio_space_id,
+        "trackio_project": trackio_project,
+        "run_name": run_name,
+        "last_result": last_result,
+    }
+@app.function(
+    image=image,
+    timeout=60 * 10,
+    secrets=[modal.Secret.from_name(SECRET_NAME, required_keys=["HF_TOKEN"])],
+)
+def inspect_trackio_space(
+    trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
+) -> dict[str, Any]:
+    import os
+    hf_token = os.environ["HF_TOKEN"]
+    def run_trackio(args: list[str]) -> dict[str, Any]:
+        completed = subprocess.run(
+            ["trackio", *args, "--space", trackio_space_id, "--hf-token", hf_token, "--json"],
+            capture_output=True,
+            text=True,
+        )
+        result = {
+            "returncode": completed.returncode,
+            "stdout": completed.stdout[-8000:],
+            "stderr": completed.stderr[-4000:],
+        }
+        if completed.returncode == 0:
+            result["json"] = json.loads(completed.stdout)
+        return result
+    projects_result = run_trackio(["list", "projects"])
+    projects = (projects_result.get("json") or {}).get("projects", [])
+    runs_by_project = {
+        project: run_trackio(["list", "runs", "--project", project])
+        for project in projects
+    }
+    return {
+        "trackio_space_id": trackio_space_id,
+        "projects": projects_result,
+        "runs_by_project": runs_by_project,
+    }
 @app.local_entrypoint()
 def main(
     mode: str = "smoke",
     seed_start: int = 0,
     trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-smoke",
+    run_name: str = "",
 ) -> None:
     if mode == "smoke":
         result = run_ephemeral_smoke.remote(
         print(json.dumps({"saved": str(output_path), **result}, indent=2, sort_keys=True))
     elif mode == "grpo-config":
         print(run_grpo_config_check.remote())
+    elif mode == "verify-trackio":
+        if not run_name:
+            raise ValueError("--run-name is required for verify-trackio mode")
+        result = verify_trackio_run.remote(
+            run_name=run_name,
+            trackio_space_id=trackio_space_id
+            or "Humanlearning/CyberSecurity_OWASP-trackio",
+            trackio_project=trackio_project,
+        )
+        print(json.dumps(result, indent=2, sort_keys=True))
+    elif mode == "inspect-trackio":
+        result = inspect_trackio_space.remote(
+            trackio_space_id=trackio_space_id
+            or "Humanlearning/CyberSecurity_OWASP-trackio",
+        )
+        print(json.dumps(result, indent=2, sort_keys=True))
     else:
+        raise ValueError(
+            "mode must be 'smoke', 'grpo-config', 'verify-trackio', or 'inspect-trackio'"
+        )

scripts/modal_train_grpo.py CHANGED Viewed

@@ -28,12 +28,61 @@ import modal
 APP_NAME = "CyberSecurity_OWASP-grpo"
 VOLUME_NAME = "CyberSecurity_OWASP-grpo-runs"
 SECRET_NAME = "CyberSecurity_OWASP-secrets"
 RUNS_DIR = pathlib.Path("/runs")
 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
 PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
 PUBLIC_REPO_BRANCH = "master"
 def _load_local_env_file() -> None:
@@ -114,6 +163,7 @@ def _training_image() -> modal.Image:
             "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo",
             "unsloth[base] @ git+https://github.com/unslothai/unsloth",
         )
         .uv_pip_install("pydantic==2.10.6")
         .uv_pip_install("mergekit", "immutables==0.21", extra_options="--no-deps")
         .uv_pip_install("llm-blender", "weave")
@@ -159,22 +209,25 @@ def _training_image() -> modal.Image:
 app = modal.App(APP_NAME)
 volume = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
 secrets = _modal_secrets()
 @app.function(
     image=_training_image(),
-    gpu=["L4", "A10G"],
     timeout=4 * 60 * 60,
-    volumes={RUNS_DIR: volume},
     secrets=secrets,
 )
 def check_training_imports() -> dict[str, str]:
     import torch
     import trackio
     from datasets import Dataset
     from trl import GRPOConfig, GRPOTrainer
-    from unsloth import FastLanguageModel
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
@@ -189,16 +242,19 @@ def check_training_imports() -> dict[str, str]:
         "grpo_config": GRPOConfig.__name__,
         "grpo_trainer": GRPOTrainer.__name__,
         "unsloth_model": FastLanguageModel.__name__,
         "env": CybersecurityOwaspEnvironment.__name__,
         "reset_phase": obs.phase,
     }
 @app.function(
     image=_training_image(),
-    gpu=["L4", "A10G"],
     timeout=4 * 60 * 60,
-    volumes={RUNS_DIR: volume},
     secrets=secrets,
 )
 def train_cybersecurity_owasp_grpo(
@@ -208,11 +264,11 @@ def train_cybersecurity_owasp_grpo(
     dataset_size: int = 16,
     difficulty: int = 0,
     split: str = "train",
-    model_name: str = "Qwen/Qwen3-1.7B",
     max_seq_length: int = 4096,
     max_completion_length: int = 768,
     lora_rank: int = 32,
-    trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
     num_generations: int = 2,
     seed_start: int = 0,
@@ -221,15 +277,18 @@ def train_cybersecurity_owasp_grpo(
     source_mode: str = "local",
     repo_url: str = PUBLIC_REPO_URL,
     repo_branch: str = PUBLIC_REPO_BRANCH,
 ) -> dict[str, str | int | float]:
     import inspect
     import statistics
     import torch
-    from unsloth import FastLanguageModel
     import transformers.utils.hub as transformers_hub
     from datasets import Dataset
-    from huggingface_hub import whoami
     from transformers import TrainerCallback
     from trl import GRPOConfig, GRPOTrainer, clone_chat_template
     from trl.chat_template_utils import add_response_schema
@@ -240,14 +299,16 @@ def train_cybersecurity_owasp_grpo(
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
-    if not hasattr(transformers_hub, "TRANSFORMERS_CACHE"):
-        transformers_hub.TRANSFORMERS_CACHE = os.path.join(
-            os.path.expanduser("~"),
-            ".cache",
-            "huggingface",
-            "hub",
-        )
     hf_token = os.environ.get("HF_TOKEN")
     if not hf_token:
@@ -257,8 +318,20 @@ def train_cybersecurity_owasp_grpo(
     user = whoami(token=hf_token)["name"]
     env_repo_id = env_repo_id or f"{user}/CyberSecurity_OWASP"
-    output_repo_id = output_repo_id or f"{user}/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora"
-    trackio_space_id = trackio_space_id or f"{user}/CyberSecurity_OWASP-trackio"
     os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
     os.environ["TRACKIO_PROJECT"] = trackio_project
@@ -271,6 +344,13 @@ def train_cybersecurity_owasp_grpo(
     output_dir = RUNS_DIR / run_name
     output_dir.mkdir(parents=True, exist_ok=True)
     training_prompt = (
         "You are a defensive AppSec repair agent in the local CyberSecurity_OWASP "
         "OpenEnv environment. Use only the provided local tools. Do not target real "
@@ -570,49 +650,48 @@ def train_cybersecurity_owasp_grpo(
         completions = kwargs.get("completions") or kwargs.get("completion") or []
         trace_step["value"] += 1
-        breakdowns = [getattr(env, "reward_breakdown", {}) or {} for env in environments]
         metrics = {
-            "train/reward_total_mean": _mean(rewards),
-            "train/reward_discovery_mean": _mean(
-                [float(item.get("discovery", 0.0)) for item in breakdowns]
-            ),
-            "train/reward_security_mean": _mean(
-                [float(item.get("security", 0.0)) for item in breakdowns]
-            ),
-            "train/reward_regression_mean": _mean(
-                [float(item.get("regression", 0.0)) for item in breakdowns]
-            ),
-            "train/reward_public_routes_mean": _mean(
-                [float(item.get("public_routes", 0.0)) for item in breakdowns]
-            ),
-            "train/reward_patch_quality_mean": _mean(
-                [float(item.get("patch_quality", 0.0)) for item in breakdowns]
-            ),
-            "train/reward_visible_tests_mean": _mean(
-                [float(item.get("visible_tests", 0.0)) for item in breakdowns]
-            ),
-            "train/reward_anti_cheat_mean": _mean(
-                [float(item.get("anti_cheat", 0.0)) for item in breakdowns]
-            ),
-            "train/success_rate": _mean(
-                [1.0 if bool(getattr(env, "success", False)) else 0.0 for env in environments]
-            ),
-            "train/invalid_action_rate": _mean(
-                [float(getattr(env, "invalid_actions", 0)) for env in environments]
-            ),
-            "train/episode_length_mean": _mean(
-                [
-                    float(getattr(env, "trace_metadata", {}).get("step_count", 0))
-                    for env in environments
-                ]
-            ),
         }
         try:
-            trackio.log(metrics, step=trace_step["value"])
         except Exception as exc:
             print(f"Trackio metric logging skipped: {exc!r}")
         for index, env in enumerate(environments):
             messages = list(getattr(env, "trace_messages", []))
             if index < len(completions):
@@ -655,9 +734,24 @@ def train_cybersecurity_owasp_grpo(
         return rewards
     class TrackioSystemMetricsCallback(TrainerCallback):
         def on_log(self, args, state, control, logs=None, **kwargs):
             try:
-                metrics = trackio.log_gpu()
             except Exception as exc:
                 print(f"Trackio GPU metrics skipped: {exc!r}")
                 return control
@@ -666,6 +760,13 @@ def train_cybersecurity_owasp_grpo(
                 print(f"Trackio GPU metrics logged at step {state.global_step}: {summary}")
             return control
     print(f"CUDA available: {torch.cuda.is_available()}")
     if source_mode == "public":
         print(f"Installed CyberSecurity_OWASP from public repo: {repo_url}@{repo_branch}")
@@ -675,27 +776,114 @@ def train_cybersecurity_owasp_grpo(
     print(f"Trackio Project: {trackio_project}")
     print(f"Output repo: {output_repo_id}")
     print(f"Run name: {run_name}")
-    model, tokenizer = FastLanguageModel.from_pretrained(
         model_name=model_name,
         max_seq_length=max_seq_length,
         load_in_4bit=False,
         fast_inference=False,
         token=hf_token,
     )
     try:
         tokenizer = add_response_schema(tokenizer)
     except Exception as exc:
-        print(f"Tokenizer response schema add failed before cloning: {exc!r}")
-        model, tokenizer, added_tokens = clone_chat_template(
-            model,
-            tokenizer,
-            "Qwen/Qwen3-0.6B",
-        )
-        print(f"Cloned Qwen3 chat template; added {len(added_tokens)} tokens.")
-        tokenizer = add_response_schema(tokenizer)
-    model = FastLanguageModel.get_peft_model(
         model,
         r=lora_rank,
         target_modules=[
@@ -711,7 +899,9 @@ def train_cybersecurity_owasp_grpo(
         use_gradient_checkpointing="unsloth",
         random_state=3407,
     )
-    FastLanguageModel.for_training(model)
     grpo_config_values = {
         "temperature": 1.0,
@@ -732,7 +922,7 @@ def train_cybersecurity_owasp_grpo(
         "trackio_space_id": trackio_space_id,
         "run_name": run_name,
         "output_dir": str(output_dir),
-        "push_to_hub": True,
         "hub_model_id": output_repo_id,
         "hub_private_repo": True,
         "hub_strategy": "every_save",
@@ -742,7 +932,7 @@ def train_cybersecurity_owasp_grpo(
         "epsilon_high": 0.28,
         "delta": 1.5,
         "loss_type": "bnpo",
-        "mask_truncated_completions": False,
     }
     grpo_config_parameters = set(inspect.signature(GRPOConfig).parameters)
     skipped_config_keys = sorted(set(grpo_config_values) - grpo_config_parameters)
@@ -776,9 +966,23 @@ def train_cybersecurity_owasp_grpo(
             if key in trainer_parameters
         }
     )
     trainer.train()
-    trainer.push_to_hub()
     volume.commit()
     return {
         "run_name": run_name,
@@ -796,6 +1000,7 @@ def train_cybersecurity_owasp_grpo(
         "source_mode": source_mode,
         "repo_url": repo_url,
         "repo_branch": repo_branch,
     }
@@ -808,11 +1013,11 @@ def main(
     dataset_size: int = 16,
     difficulty: int = 0,
     split: str = "train",
-    model_name: str = "Qwen/Qwen3-1.7B",
     max_seq_length: int = 4096,
     max_completion_length: int = 768,
     lora_rank: int = 32,
-    trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
     num_generations: int = 2,
     seed_start: int = 0,
@@ -821,6 +1026,7 @@ def main(
     repo_url: str = PUBLIC_REPO_URL,
     repo_branch: str = PUBLIC_REPO_BRANCH,
     detach: bool = False,
 ) -> None:
     if mode == "config":
         result = check_training_imports.remote()
@@ -829,7 +1035,10 @@ def main(
     if mode != "train":
         raise ValueError("mode must be 'train' or 'config'")
-    trackio_space_id = trackio_space_id or os.environ.get("TRACKIO_SPACE_ID", "")
     trackio_project = trackio_project or os.environ.get(
         "TRACKIO_PROJECT", "CyberSecurity_OWASP-grpo"
     )
@@ -842,12 +1051,15 @@ def main(
                 from huggingface_hub import whoami
                 user = whoami(token=hf_token)["name"]
-                resolved_trackio_space_id = (
-                    resolved_trackio_space_id or f"{user}/CyberSecurity_OWASP-trackio"
-                )
                 resolved_output_repo_id = (
                     resolved_output_repo_id
-                    or f"{user}/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora"
                 )
             except Exception as exc:
                 print(f"Could not resolve Hugging Face defaults locally: {exc!r}")
@@ -883,8 +1095,10 @@ def main(
     else:
         print(
             "Output model repo: derived remotely from HF_TOKEN as "
-            "<hf-user>/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora"
         )
     kwargs = dict(
         env_repo_id=env_repo_id,
@@ -906,6 +1120,7 @@ def main(
         source_mode=source_mode,
         repo_url=repo_url,
         repo_branch=repo_branch,
     )
     if detach:
         call = train_cybersecurity_owasp_grpo.spawn(**kwargs)

 APP_NAME = "CyberSecurity_OWASP-grpo"
 VOLUME_NAME = "CyberSecurity_OWASP-grpo-runs"
+CACHE_VOLUME_NAME = "CyberSecurity_OWASP-model-cache"
 SECRET_NAME = "CyberSecurity_OWASP-secrets"
 RUNS_DIR = pathlib.Path("/runs")
+CACHE_DIR = pathlib.Path("/cache")
+HF_HOME_DIR = CACHE_DIR / "huggingface"
+HF_HUB_CACHE_DIR = HF_HOME_DIR / "hub"
+TORCH_HOME_DIR = CACHE_DIR / "torch"
+XDG_CACHE_DIR = CACHE_DIR / "xdg"
+UNSLOTH_CACHE_DIR = CACHE_DIR / "unsloth"
+TRITON_CACHE_DIR = CACHE_DIR / "triton"
 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
 PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
 PUBLIC_REPO_BRANCH = "master"
+DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
+def _model_repo_slug(model_name: str) -> str:
+    return (
+        model_name.replace("/", "-")
+        .replace("_", "-")
+        .replace(".", "-")
+        .lower()
+    )
+def _hf_model_cache_path(model_name: str) -> pathlib.Path:
+    return HF_HUB_CACHE_DIR / f"models--{model_name.replace('/', '--')}"
+def _configure_modal_cache_env() -> dict[str, str]:
+    values = {
+        "HF_HOME": str(HF_HOME_DIR),
+        "HF_HUB_CACHE": str(HF_HUB_CACHE_DIR),
+        "TRANSFORMERS_CACHE": str(HF_HUB_CACHE_DIR),
+        "TORCH_HOME": str(TORCH_HOME_DIR),
+        "XDG_CACHE_HOME": str(XDG_CACHE_DIR),
+        "UNSLOTH_CACHE_DIR": str(UNSLOTH_CACHE_DIR),
+        "UNSLOTH_COMPILE_CACHE": str(UNSLOTH_CACHE_DIR / "compile"),
+        "TRITON_CACHE_DIR": str(TRITON_CACHE_DIR),
+    }
+    for key, value in values.items():
+        os.environ[key] = value
+    for path in {
+        CACHE_DIR,
+        HF_HOME_DIR,
+        HF_HUB_CACHE_DIR,
+        TORCH_HOME_DIR,
+        XDG_CACHE_DIR,
+        UNSLOTH_CACHE_DIR,
+        UNSLOTH_CACHE_DIR / "compile",
+        TRITON_CACHE_DIR,
+    }:
+        path.mkdir(parents=True, exist_ok=True)
+    return values
 def _load_local_env_file() -> None:
             "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo",
             "unsloth[base] @ git+https://github.com/unslothai/unsloth",
         )
+        .uv_pip_install("timm", extra_options="--no-deps")
         .uv_pip_install("pydantic==2.10.6")
         .uv_pip_install("mergekit", "immutables==0.21", extra_options="--no-deps")
         .uv_pip_install("llm-blender", "weave")
 app = modal.App(APP_NAME)
 volume = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
+cache_volume = modal.Volume.from_name(CACHE_VOLUME_NAME, create_if_missing=True)
 secrets = _modal_secrets()
 @app.function(
     image=_training_image(),
+    gpu="L4",
     timeout=4 * 60 * 60,
+    volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume},
     secrets=secrets,
 )
 def check_training_imports() -> dict[str, str]:
+    cache_env = _configure_modal_cache_env()
     import torch
     import trackio
     from datasets import Dataset
     from trl import GRPOConfig, GRPOTrainer
+    from unsloth import FastLanguageModel, FastVisionModel
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
         "grpo_config": GRPOConfig.__name__,
         "grpo_trainer": GRPOTrainer.__name__,
         "unsloth_model": FastLanguageModel.__name__,
+        "unsloth_vision_model": FastVisionModel.__name__,
         "env": CybersecurityOwaspEnvironment.__name__,
         "reset_phase": obs.phase,
+        "hf_home": cache_env["HF_HOME"],
+        "hf_hub_cache": cache_env["HF_HUB_CACHE"],
     }
 @app.function(
     image=_training_image(),
+    gpu="L4",
     timeout=4 * 60 * 60,
+    volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume},
     secrets=secrets,
 )
 def train_cybersecurity_owasp_grpo(
     dataset_size: int = 16,
     difficulty: int = 0,
     split: str = "train",
+    model_name: str = DEFAULT_GEMMA_MODEL,
     max_seq_length: int = 4096,
     max_completion_length: int = 768,
     lora_rank: int = 32,
+    trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
     num_generations: int = 2,
     seed_start: int = 0,
     source_mode: str = "local",
     repo_url: str = PUBLIC_REPO_URL,
     repo_branch: str = PUBLIC_REPO_BRANCH,
+    push_to_hub: bool = False,
 ) -> dict[str, str | int | float]:
     import inspect
     import statistics
+    cache_env = _configure_modal_cache_env()
     import torch
+    from unsloth import FastLanguageModel, FastVisionModel
     import transformers.utils.hub as transformers_hub
     from datasets import Dataset
+    from huggingface_hub import snapshot_download, whoami
     from transformers import TrainerCallback
     from trl import GRPOConfig, GRPOTrainer, clone_chat_template
     from trl.chat_template_utils import add_response_schema
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
+    from training.trackio_utils import (
+        aggregate_episode_metrics,
+        episode_record_from_state,
+        log_gpu_metrics,
+        log_trace_table,
+        log_trackio_metrics,
+        train_metric_aliases,
+    )
+    transformers_hub.TRANSFORMERS_CACHE = cache_env["HF_HUB_CACHE"]
     hf_token = os.environ.get("HF_TOKEN")
     if not hf_token:
     user = whoami(token=hf_token)["name"]
     env_repo_id = env_repo_id or f"{user}/CyberSecurity_OWASP"
+    output_repo_id = output_repo_id or (
+        f"{user}/CyberSecurity_OWASP-{_model_repo_slug(model_name)}-grpo-lora"
+    )
+    if not trackio_space_id:
+        trackio_space_id = "Humanlearning/CyberSecurity_OWASP-trackio"
+        if hf_token:
+            try:
+                from huggingface_hub import whoami
+                user = whoami(token=hf_token)["name"]
+                if user == "humandotlearning":
+                    trackio_space_id = f"{user}/CyberSecurity_OWASP-trackio"
+            except Exception:
+                pass
     os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
     os.environ["TRACKIO_PROJECT"] = trackio_project
     output_dir = RUNS_DIR / run_name
     output_dir.mkdir(parents=True, exist_ok=True)
+    try:
+        cache_volume.reload()
+        print(f"Reloaded Modal model cache volume: {CACHE_VOLUME_NAME}")
+    except Exception as exc:
+        print(f"Model cache volume reload skipped: {exc!r}")
+    cache_env = _configure_modal_cache_env()
     training_prompt = (
         "You are a defensive AppSec repair agent in the local CyberSecurity_OWASP "
         "OpenEnv environment. Use only the provided local tools. Do not target real "
         completions = kwargs.get("completions") or kwargs.get("completion") or []
         trace_step["value"] += 1
+        episode_records = []
+        for env, reward in zip(environments, rewards):
+            record = episode_record_from_state(
+                env._env.state,
+                run_context={
+                    "base_model": model_name,
+                    "algo": "grpo",
+                    "reward_version": "reward_v1",
+                    "env_version": "0.1.0",
+                },
+            )
+            record.update(
+                {
+                    "reward_total": reward,
+                    "success": bool(getattr(env, "success", False)),
+                }
+            )
+            episode_records.append(record)
+        canonical_metrics = aggregate_episode_metrics(episode_records)
         metrics = {
+            **canonical_metrics,
+            **train_metric_aliases(canonical_metrics),
         }
+        if rewards:
+            metrics["train/reward_mean"] = _mean(rewards)
+            metrics["train/reward_std"] = statistics.pstdev(rewards) if len(rewards) > 1 else 0.0
         try:
+            log_trackio_metrics(metrics, step=trace_step["value"])
         except Exception as exc:
             print(f"Trackio metric logging skipped: {exc!r}")
+        try:
+            log_trace_table(
+                episode_records[: min(4, len(episode_records))],
+                table_name="sample_traces",
+                step=trace_step["value"],
+            )
+        except Exception as exc:
+            print(f"Trackio sample trace table logging skipped: {exc!r}")
         for index, env in enumerate(environments):
             messages = list(getattr(env, "trace_messages", []))
             if index < len(completions):
         return rewards
     class TrackioSystemMetricsCallback(TrainerCallback):
+        def on_train_begin(self, args, state, control, **kwargs):
+            try:
+                metrics = log_gpu_metrics(step=int(state.global_step or 0))
+            except Exception as exc:
+                print(f"Trackio GPU metrics initialization skipped: {exc!r}")
+                return control
+            if metrics:
+                system_summary = ", ".join(
+                    f"{key}={value}"
+                    for key, value in sorted(metrics.items())
+                    if key.startswith("system/")
+                )
+                print(f"Trackio GPU metrics initialized: {system_summary}")
+            return control
         def on_log(self, args, state, control, logs=None, **kwargs):
             try:
+                metrics = log_gpu_metrics(step=int(state.global_step or 0))
             except Exception as exc:
                 print(f"Trackio GPU metrics skipped: {exc!r}")
                 return control
                 print(f"Trackio GPU metrics logged at step {state.global_step}: {summary}")
             return control
+        def on_train_end(self, args, state, control, **kwargs):
+            try:
+                log_gpu_metrics(step=int(state.global_step or 0))
+            except Exception as exc:
+                print(f"Trackio final GPU metrics skipped: {exc!r}")
+            return control
     print(f"CUDA available: {torch.cuda.is_available()}")
     if source_mode == "public":
         print(f"Installed CyberSecurity_OWASP from public repo: {repo_url}@{repo_branch}")
     print(f"Trackio Project: {trackio_project}")
     print(f"Output repo: {output_repo_id}")
     print(f"Run name: {run_name}")
+    print(f"Model cache volume: {CACHE_VOLUME_NAME}")
+    print(f"HF_HOME: {cache_env['HF_HOME']}")
+    print(f"HF_HUB_CACHE: {cache_env['HF_HUB_CACHE']}")
+    print(f"Torch cache: {cache_env['TORCH_HOME']}")
+    print(f"Unsloth cache: {cache_env['UNSLOTH_CACHE_DIR']}")
+    print(f"Triton cache: {cache_env['TRITON_CACHE_DIR']}")
+    print(f"Hub push enabled: {push_to_hub}")
+    trackio.init(
+        project=trackio_project,
+        name=run_name,
+        group="grpo",
+        space_id=trackio_space_id,
+        auto_log_gpu=True,
+        gpu_log_interval=10.0,
+        config={
+            "environment": "CyberSecurity_OWASP",
+            "run_type": "modal_grpo",
+            "model_name": model_name,
+            "difficulty": difficulty,
+            "split": split,
+            "dataset_size": dataset_size,
+            "max_steps": max_steps,
+            "num_generations": num_generations,
+            "max_seq_length": max_seq_length,
+            "max_completion_length": max_completion_length,
+            "lora_rank": lora_rank,
+            "gpu_requested": "L4",
+            "load_in_4bit": False,
+            "fast_inference": False,
+            "gradient_checkpointing": "unsloth",
+            "optim": "adamw_8bit",
+        },
+    )
+    log_gpu_metrics(step=0)
+    expected_model_cache = _hf_model_cache_path(model_name)
+    cache_hit = expected_model_cache.exists()
+    print(f"Expected HF model cache path: {expected_model_cache}")
+    print(f"Model cache hit before load: {cache_hit}")
+    if cache_hit:
+        print("Using cached model snapshot from the persistent Modal volume when valid.")
+    else:
+        print(
+            "Model cache miss. Downloading model weights once into the persistent "
+            "Modal cache volume; Hugging Face progress output should follow."
+        )
+    try:
+        snapshot_path = snapshot_download(
+            repo_id=model_name,
+            cache_dir=str(HF_HUB_CACHE_DIR),
+            token=hf_token,
+        )
+        print(f"Model snapshot ready: {snapshot_path}")
+        cache_volume.commit()
+        print(f"Committed Modal model cache volume after snapshot download: {CACHE_VOLUME_NAME}")
+    except Exception as exc:
+        print(
+            "Explicit model snapshot prefetch failed; Unsloth will attempt the "
+            f"model load directly. Error: {exc!r}"
+        )
+    print(f"Loading model with Unsloth from_pretrained: {model_name}")
+    model_api = FastVisionModel if "gemma-4" in model_name.lower() else FastLanguageModel
+    model, tokenizer = model_api.from_pretrained(
         model_name=model_name,
         max_seq_length=max_seq_length,
         load_in_4bit=False,
         fast_inference=False,
+        cache_dir=str(HF_HUB_CACHE_DIR),
         token=hf_token,
     )
+    print("Model load complete.")
+    cache_volume.commit()
+    print(f"Committed Modal model cache volume after model load: {CACHE_VOLUME_NAME}")
     try:
         tokenizer = add_response_schema(tokenizer)
     except Exception as exc:
+        if "gemma-4" in model_name.lower():
+            print(
+                "Tokenizer response schema add skipped for Gemma 4 processor, "
+                "matching the Unsloth Gemma 4 GRPO notebook pattern: "
+                f"{exc!r}"
+            )
+        else:
+            print(f"Tokenizer response schema add failed before cloning: {exc!r}")
+            for template_source in ("Qwen/Qwen3-0.6B", "Qwen/Qwen2.5-0.5B-Instruct"):
+                try:
+                    model, tokenizer, added_tokens = clone_chat_template(
+                        model,
+                        tokenizer,
+                        template_source,
+                    )
+                    print(
+                        "Cloned response-schema-capable chat template "
+                        f"from {template_source}; added {len(added_tokens)} tokens."
+                    )
+                    tokenizer = add_response_schema(tokenizer)
+                    break
+                except Exception as clone_exc:
+                    print(
+                        "Tokenizer response schema fallback failed for "
+                        f"{template_source}: {clone_exc!r}"
+                    )
+            else:
+                raise
+    model = model_api.get_peft_model(
         model,
         r=lora_rank,
         target_modules=[
         use_gradient_checkpointing="unsloth",
         random_state=3407,
     )
+    if hasattr(model_api, "for_training"):
+        model_api.for_training(model)
+    print("LoRA adapter attached and model switched to training mode.")
     grpo_config_values = {
         "temperature": 1.0,
         "trackio_space_id": trackio_space_id,
         "run_name": run_name,
         "output_dir": str(output_dir),
+        "push_to_hub": push_to_hub,
         "hub_model_id": output_repo_id,
         "hub_private_repo": True,
         "hub_strategy": "every_save",
         "epsilon_high": 0.28,
         "delta": 1.5,
         "loss_type": "bnpo",
+        "mask_truncated_completions": True,
     }
     grpo_config_parameters = set(inspect.signature(GRPOConfig).parameters)
     skipped_config_keys = sorted(set(grpo_config_values) - grpo_config_parameters)
             if key in trainer_parameters
         }
     )
+    print("Starting GRPO trainer.train().")
     trainer.train()
+    print("GRPO trainer.train() complete.")
+    if push_to_hub:
+        print(f"Pushing LoRA adapter to Hugging Face Hub: {output_repo_id}")
+        trainer.push_to_hub()
+        print("Hub push complete.")
+    else:
+        print("Skipping Hub push for this run. Pass --push-to-hub to upload adapters.")
     volume.commit()
+    cache_volume.commit()
+    print(f"Committed run volume: {VOLUME_NAME}")
+    print(f"Committed model cache volume: {CACHE_VOLUME_NAME}")
+    try:
+        trackio.finish()
+    except RuntimeError as exc:
+        print(f"Trackio finish skipped because the trainer already finalized it: {exc}")
     return {
         "run_name": run_name,
         "source_mode": source_mode,
         "repo_url": repo_url,
         "repo_branch": repo_branch,
+        "push_to_hub": push_to_hub,
     }
     dataset_size: int = 16,
     difficulty: int = 0,
     split: str = "train",
+    model_name: str = DEFAULT_GEMMA_MODEL,
     max_seq_length: int = 4096,
     max_completion_length: int = 768,
     lora_rank: int = 32,
+    trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
     num_generations: int = 2,
     seed_start: int = 0,
     repo_url: str = PUBLIC_REPO_URL,
     repo_branch: str = PUBLIC_REPO_BRANCH,
     detach: bool = False,
+    push_to_hub: bool = False,
 ) -> None:
     if mode == "config":
         result = check_training_imports.remote()
     if mode != "train":
         raise ValueError("mode must be 'train' or 'config'")
+    trackio_space_id = trackio_space_id or os.environ.get(
+        "TRACKIO_SPACE_ID",
+        "Humanlearning/CyberSecurity_OWASP-trackio",
+    )
     trackio_project = trackio_project or os.environ.get(
         "TRACKIO_PROJECT", "CyberSecurity_OWASP-grpo"
     )
                 from huggingface_hub import whoami
                 user = whoami(token=hf_token)["name"]
+                if not resolved_trackio_space_id:
+                    resolved_trackio_space_id = (
+                        f"{user}/CyberSecurity_OWASP-trackio"
+                        if user == "humandotlearning"
+                        else "Humanlearning/CyberSecurity_OWASP-trackio"
+                    )
                 resolved_output_repo_id = (
                     resolved_output_repo_id
+                    or f"{user}/CyberSecurity_OWASP-{_model_repo_slug(model_name)}-grpo-lora"
                 )
             except Exception as exc:
                 print(f"Could not resolve Hugging Face defaults locally: {exc!r}")
     else:
         print(
             "Output model repo: derived remotely from HF_TOKEN as "
+            f"<hf-user>/CyberSecurity_OWASP-{_model_repo_slug(model_name)}-grpo-lora"
         )
+    print(f"Hub push enabled: {push_to_hub}")
+    print(f"Model cache volume: {CACHE_VOLUME_NAME}")
     kwargs = dict(
         env_repo_id=env_repo_id,
         source_mode=source_mode,
         repo_url=repo_url,
         repo_branch=repo_branch,
+        push_to_hub=push_to_hub,
     )
     if detach:
         call = train_cybersecurity_owasp_grpo.spawn(**kwargs)

tests/test_trackio_utils.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import json
+from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
+from training.trackio_utils import (
+    CANONICAL_TRACKIO_SIGNALS,
+    DERIVED_TRACKIO_METRICS,
+    aggregate_episode_metrics,
+    episode_record_from_state,
+    episode_to_trace_row,
+    episode_to_tracking_fields,
+)
+from .helpers import apply_secure_patch, make_env, secure_invoice_source, submit_valid_finding
+def test_canonical_tracking_fields_exist_and_are_numeric_where_expected():
+    assert len(CANONICAL_TRACKIO_SIGNALS) == 57
+    env = make_env(70)
+    try:
+        submit_valid_finding(env)
+        apply_secure_patch(env)
+        env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
+        final = env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
+        record = episode_record_from_state(env.state, final_observation=final.model_dump())
+        fields = episode_to_tracking_fields(record)
+        for key in (*CANONICAL_TRACKIO_SIGNALS, *DERIVED_TRACKIO_METRICS):
+            assert key in fields
+        assert isinstance(fields["reward/hidden_authz_pass_rate"], float)
+        assert isinstance(fields["reward/normal_flow_pass_rate"], float)
+        assert isinstance(fields["reward/public_hidden_gap"], float)
+        assert isinstance(fields["skill/exploit_to_patch_alignment"], float)
+        metrics = aggregate_episode_metrics([record])
+        assert metrics["reward/hidden_authz_pass_rate"] == 1.0
+        assert metrics["skill/patch_success"] == 1.0
+    finally:
+        env.close()
+def test_tracking_cheat_score_flags_hardcoded_identifier():
+    env = make_env(71)
+    try:
+        submit_valid_finding(env)
+        hidden = env.state.hidden_facts
+        source = secure_invoice_source(env) + f"\n# hardcoded {hidden['owner_invoice_id']}\n"
+        env.step(
+            CyberSecurityOWASPAction(
+                tool_name="patch_file",
+                arguments={"path": "app/routes/invoices.py", "content": source},
+            )
+        )
+        final = env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
+        record = episode_record_from_state(env.state, final_observation=final.model_dump())
+        fields = episode_to_tracking_fields(record)
+        assert fields["cheat/hardcoded_scenario_id"] == 1.0
+        assert fields["cheat/score"] >= 50.0
+    finally:
+        env.close()
+def test_trace_rows_redact_hidden_values_from_action_arguments():
+    env = make_env(72)
+    try:
+        hidden = dict(env.state.hidden_facts)
+        submit_valid_finding(env)
+        apply_secure_patch(env)
+        env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
+        final = env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
+        record = episode_record_from_state(env.state, final_observation=final.model_dump())
+        row = episode_to_trace_row(record)
+        row_text = json.dumps(row, sort_keys=True)
+        for key in (
+            "owner_user_id",
+            "intruder_user_id",
+            "admin_user_id",
+            "owner_invoice_id",
+            "other_invoice_id",
+            "foreign_invoice_id",
+            "tenant_a",
+            "tenant_b",
+        ):
+            value = str(hidden.get(key, ""))
+            assert not value or value not in row_text
+    finally:
+        env.close()

training/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Training and tracking utilities for CyberSecurity_OWASP."""

training/configs/grpo_small.yaml CHANGED Viewed

@@ -1,9 +1,11 @@
-model_name: Qwen/Qwen3-1.7B
 algo: grpo
 environment: CyberSecurity_OWASP
 max_steps: 40
 num_generations: 2
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 32
 learning_rate: 0.000005
 report_to: trackio

+model_name: unsloth/gemma-4-E2B-it
 algo: grpo
 environment: CyberSecurity_OWASP
 max_steps: 40
+episodes: 10
 num_generations: 2
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 32
 learning_rate: 0.000005
 report_to: trackio
+trackio_space_id: Humanlearning/CyberSecurity_OWASP-trackio

training/rollout.py CHANGED Viewed

@@ -38,8 +38,15 @@ def generate_rollout_completions(trainer, prompts: list[str]) -> list[dict[str,
     ]
-def rollout_once(trainer, env, tokenizer=None, dataset_prompt: str = "", max_steps: int = 40) -> dict:
-    result = env.reset()
     observation = result.observation if hasattr(result, "observation") else result
     prompt_ids = []

     ]
+def rollout_once(
+    trainer,
+    env,
+    tokenizer=None,
+    dataset_prompt: str = "",
+    max_steps: int = 40,
+    reset_kwargs: dict[str, Any] | None = None,
+) -> dict:
+    result = env.reset(**(reset_kwargs or {}))
     observation = result.observation if hasattr(result, "observation") else result
     prompt_ids = []

training/trackio_utils.py CHANGED Viewed

@@ -2,12 +2,167 @@
 from __future__ import annotations
 import os
 import subprocess
 from contextlib import contextmanager
 from datetime import datetime
 from pathlib import Path
-from typing import Any, Iterator
 TRAIN_METRICS = [
@@ -59,6 +214,657 @@ EVAL_METRICS = [
 ]
 def build_run_name(model: str, algo: str, difficulty: int, git_sha: str = "nogit") -> str:
     stamp = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
     model_slug = model.replace("/", "-")
@@ -98,6 +904,8 @@ def init_trackio_run(
     project: str | None = None,
     space_id: str | None = None,
     group: str | None = None,
 ):
     trackio = _load_trackio()
     project = project or os.getenv("TRACKIO_PROJECT", "CyberSecurity_OWASP")
@@ -116,6 +924,10 @@ def init_trackio_run(
         kwargs["space_id"] = space_id
     if group:
         kwargs["group"] = group
     return trackio.init(**kwargs)
@@ -132,6 +944,57 @@ def log_trackio_metrics(metrics: dict[str, Any], step: int | None = None) -> Non
         trackio.log(numeric, step=step)
 def finish_trackio_run() -> None:
     trackio = _load_trackio()
     trackio.finish()
@@ -146,6 +1009,8 @@ def trackio_run(
     project: str | None = None,
     space_id: str | None = None,
     group: str | None = None,
 ) -> Iterator[Any]:
     run = init_trackio_run(
         run_name=run_name,
@@ -154,6 +1019,8 @@ def trackio_run(
         project=project,
         space_id=space_id,
         group=group,
     )
     try:
         yield run
@@ -167,5 +1034,6 @@ def log_eval_summary(run_name: str, summary: dict[str, Any], config: dict[str, A
         for key, value in summary.items()
         if isinstance(value, (int, float, bool))
     }
     with trackio_run(run_name=run_name, run_type="eval", config=config, group="eval"):
         log_trackio_metrics(metrics, step=0)

 from __future__ import annotations
+import hashlib
+import json
 import os
+import re
 import subprocess
 from contextlib import contextmanager
 from datetime import datetime
 from pathlib import Path
+from typing import Any, Iterator, Mapping, Sequence
+RUN_SCENARIO_FIELDS = (
+    "run/base_model",
+    "run/algo",
+    "run/reward_version",
+    "run/env_version",
+    "scenario/seed",
+    "scenario/template_id",
+    "scenario/split",
+    "scenario/difficulty",
+    "scenario/bug_type",
+)
+REWARD_DECOMPOSITION_FIELDS = (
+    "reward/total",
+    "reward/exploit_reproduced_pre_patch",
+    "reward/bug_classification_correct",
+    "reward/patch_blocks_submitted_exploit",
+    "reward/hidden_authz_pass_rate",
+    "reward/public_tests_pass_rate",
+    "reward/normal_flow_pass_rate",
+    "reward/minimality_score",
+    "reward/cheat_penalty",
+)
+BEHAVIOR_SKILL_FIELDS = (
+    "skill/valid_action_rate",
+    "skill/discovery_success",
+    "skill/patch_success",
+    "skill/first_valid_exploit_step",
+    "skill/exploit_to_patch_alignment",
+    "skill/policy_graph_coverage",
+    "skill/route_coverage",
+    "skill/auth_boundary_cases_tested",
+    "skill/files_read_count",
+    "skill/files_modified_count",
+    "skill/security_relevant_edit_ratio",
+    "skill/tests_run_count",
+)
+ANTI_CHEAT_FIELDS = (
+    "cheat/hidden_file_read_attempt",
+    "cheat/reward_file_read_attempt",
+    "cheat/test_file_modified",
+    "cheat/reward_file_modified",
+    "cheat/env_var_secret_probe",
+    "cheat/hardcoded_scenario_id",
+    "cheat/catch_all_deny_patch",
+    "cheat/route_deleted_or_stubbed",
+    "cheat/auth_disabled_or_bypassed",
+    "cheat/database_reset_or_mutation_abuse",
+    "cheat/timeout_or_resource_abuse",
+    "cheat/observation_leak_detected",
+)
+GENERALIZATION_EVAL_FIELDS = (
+    "eval/train_success_rate",
+    "eval/seen_template_success_rate",
+    "eval/unseen_template_success_rate",
+    "eval/unseen_framework_success_rate",
+    "eval/unseen_policy_graph_success_rate",
+    "eval/hidden_success_rate",
+    "eval/train_hidden_gap",
+)
+TRAINING_SYSTEM_FIELDS = (
+    "train/loss",
+    "train/kl",
+    "train/entropy",
+    "train/grad_norm",
+    "train/reward_mean",
+    "train/reward_std",
+    "train/completion_length_mean",
+    "system/episodes_per_sec",
+)
+GPU_SYSTEM_METRICS = (
+    "system/gpu_available",
+    "system/gpu_count",
+    "system/gpu_current_device",
+    "system/gpu_memory_allocated_mb",
+    "system/gpu_memory_reserved_mb",
+    "system/gpu_memory_max_allocated_mb",
+    "system/gpu_memory_total_mb",
+    "system/gpu_memory_allocated_fraction",
+)
+CANONICAL_TRACKIO_SIGNAL_GROUPS = {
+    "run_scenario": RUN_SCENARIO_FIELDS,
+    "reward": REWARD_DECOMPOSITION_FIELDS,
+    "skill": BEHAVIOR_SKILL_FIELDS,
+    "anti_cheat": ANTI_CHEAT_FIELDS,
+    "eval": GENERALIZATION_EVAL_FIELDS,
+    "training_system": TRAINING_SYSTEM_FIELDS,
+}
+CANONICAL_TRACKIO_SIGNALS = tuple(
+    field
+    for group in CANONICAL_TRACKIO_SIGNAL_GROUPS.values()
+    for field in group
+)
+DERIVED_TRACKIO_METRICS = (
+    "reward/public_hidden_gap",
+    "cheat/score",
+)
+REQUIRED_SMOKE_TRACKIO_ITEMS = (
+    "reward/total",
+    "reward/hidden_authz_pass_rate",
+    "skill/exploit_to_patch_alignment",
+    "cheat/score",
+    "sample_traces",
+)
+TRACE_TABLE_COLUMNS = (
+    "episode_id",
+    "scenario_id_hash",
+    "split",
+    "difficulty",
+    "bug_type",
+    "visible_observation_summary",
+    "action_sequence",
+    "tool_calls",
+    "files_read",
+    "files_modified",
+    "exploit_summary",
+    "patch_diff_summary",
+    "public_test_summary",
+    "hidden_test_summary_redacted",
+    "reward_breakdown",
+    "cheat_flags",
+    "terminal_reason",
+)
+SENSITIVE_TEXT_PATTERNS = (
+    re.compile(r"hf_[A-Za-z0-9_]+"),
+    re.compile(r"(?i)(secret|token|password|api[_-]?key)\s*[:=]\s*[^,\s}]+"),
+)
+AUTH_RELEVANT_TERMS = (
+    "auth",
+    "tenant",
+    "owner",
+    "role",
+    "permission",
+    "billing_admin",
+    "forbidden",
+    "policy",
+    "principal",
+)
 TRAIN_METRICS = [
 ]
+def _float(value: Any, default: float = 0.0) -> float:
+    if isinstance(value, bool):
+        return 1.0 if value else 0.0
+    try:
+        return float(value)
+    except (TypeError, ValueError):
+        return default
+def _mean(values: Sequence[float]) -> float:
+    return sum(values) / len(values) if values else 0.0
+def _stable_hash(value: Any, length: int = 16) -> str:
+    text = json.dumps(value, sort_keys=True, default=str)
+    return hashlib.sha256(text.encode("utf-8")).hexdigest()[:length]
+def _redact_text(value: Any, limit: int = 800) -> str:
+    text = str(value)
+    for pattern in SENSITIVE_TEXT_PATTERNS:
+        text = pattern.sub("[redacted]", text)
+    return text[:limit]
+def _as_dict(value: Any) -> dict[str, Any]:
+    if value is None:
+        return {}
+    if isinstance(value, dict):
+        return value
+    if hasattr(value, "model_dump"):
+        return value.model_dump()
+    return dict(getattr(value, "__dict__", {}) or {})
+def _as_action_list(record: Mapping[str, Any]) -> list[dict[str, Any]]:
+    actions = record.get("action_history") or record.get("actions") or []
+    return [_as_dict(item) for item in actions]
+def _as_observation_list(record: Mapping[str, Any]) -> list[dict[str, Any]]:
+    observations = record.get("observation_history") or record.get("observations") or []
+    return [_as_dict(item) for item in observations]
+def _safe_action(action: Mapping[str, Any]) -> dict[str, Any]:
+    tool_name = str(action.get("tool_name", ""))
+    args = _as_dict(action.get("arguments"))
+    safe_args: dict[str, Any] = {}
+    if tool_name in {"read_file", "patch_file"} and args.get("path"):
+        safe_args["path"] = _redact_text(args["path"], limit=160)
+    elif tool_name == "search_code":
+        query = str(args.get("query", ""))
+        safe_args["query_hash"] = _stable_hash(query)
+        safe_args["query_length"] = len(query)
+    elif tool_name in {"send_local_request", "compare_identities"}:
+        safe_args["method"] = args.get("method", "GET")
+        safe_args["path"] = _redact_text(args.get("path", ""), limit=160)
+        if args.get("user_id"):
+            safe_args["user_id_hash"] = _stable_hash(args["user_id"])
+        if args.get("first_user_id"):
+            safe_args["first_user_id_hash"] = _stable_hash(args["first_user_id"])
+        if args.get("second_user_id"):
+            safe_args["second_user_id_hash"] = _stable_hash(args["second_user_id"])
+    elif tool_name == "submit_finding":
+        safe_args["summary_length"] = len(str(args.get("summary", "")))
+        safe_args["evidence_length"] = len(str(args.get("evidence", "")))
+        safe_args["policy_rule_length"] = len(str(args.get("policy_rule", "")))
+    elif tool_name == "patch_file":
+        safe_args["content_hash"] = _stable_hash(args.get("content", ""))
+        safe_args["diff_hash"] = _stable_hash(args.get("diff", ""))
+    return {"tool_name": tool_name, "arguments": safe_args}
+def _check_pass_rate(result: Any) -> float:
+    result_dict = _as_dict(result)
+    checks = result_dict.get("checks")
+    if isinstance(checks, dict) and checks:
+        return _mean([1.0 if bool(value) else 0.0 for value in checks.values()])
+    if "passed" in result_dict:
+        return 1.0 if bool(result_dict.get("passed")) else 0.0
+    return 0.0
+def _check_summary(result: Any) -> dict[str, Any]:
+    result_dict = _as_dict(result)
+    checks = result_dict.get("checks")
+    return {
+        "passed": bool(result_dict.get("passed", False)),
+        "pass_rate": _check_pass_rate(result_dict),
+        "num_checks": len(checks) if isinstance(checks, dict) else 0,
+    }
+def _reward_history(record: Mapping[str, Any]) -> list[dict[str, float]]:
+    history = record.get("reward_history") or record.get("reward_breakdown_by_step") or []
+    if not history:
+        observations = _as_observation_list(record)
+        history = [
+            obs.get("reward_breakdown", {})
+            for obs in observations
+            if isinstance(obs.get("reward_breakdown"), dict)
+        ]
+    return [
+        {str(key): _float(value) for key, value in _as_dict(item).items()}
+        for item in history
+    ]
+def _final_reward_breakdown(record: Mapping[str, Any]) -> dict[str, float]:
+    for key in ("final_reward_breakdown", "reward_breakdown"):
+        if isinstance(record.get(key), dict):
+            return {str(k): _float(v) for k, v in record[key].items()}
+    history = _reward_history(record)
+    return dict(history[-1]) if history else {}
+def _reward_component_sum(record: Mapping[str, Any], key: str) -> float:
+    return sum(item.get(key, 0.0) for item in _reward_history(record))
+def _verification(record: Mapping[str, Any]) -> dict[str, Any]:
+    return _as_dict(record.get("verification_summary") or record.get("verifier") or {})
+def _tool_names(actions: Sequence[Mapping[str, Any]]) -> list[str]:
+    return [str(action.get("tool_name", "")) for action in actions]
+def _first_tool_step(
+    actions: Sequence[Mapping[str, Any]],
+    tools: set[str],
+    observations: Sequence[Mapping[str, Any]] | None = None,
+) -> float:
+    for index, action in enumerate(actions, start=1):
+        if str(action.get("tool_name", "")) not in tools:
+            continue
+        if observations and index - 1 < len(observations):
+            if observations[index - 1].get("last_action_valid") is False:
+                continue
+        return float(index)
+    return -1.0
+def _has_tool_before(actions: Sequence[Mapping[str, Any]], tools: set[str], before_tool: str) -> bool:
+    for action in actions:
+        tool_name = str(action.get("tool_name", ""))
+        if tool_name == before_tool:
+            return False
+        if tool_name in tools:
+            return True
+    return False
+def _patch_diff(record: Mapping[str, Any]) -> str:
+    return str(record.get("patch_diff") or "")
+def _diff_lines(diff: str) -> list[str]:
+    return [
+        line
+        for line in diff.splitlines()
+        if (line.startswith("+") or line.startswith("-"))
+        and not line.startswith("+++")
+        and not line.startswith("---")
+    ]
+def _security_relevant_edit_ratio(diff: str) -> float:
+    lines = _diff_lines(diff)
+    if not lines:
+        return 0.0
+    relevant = [
+        line
+        for line in lines
+        if any(term in line.lower() for term in AUTH_RELEVANT_TERMS)
+    ]
+    return len(relevant) / len(lines)
+def _quality_score(quality: Mapping[str, Any]) -> float:
+    quality = _as_dict(quality)
+    if quality.get("passed"):
+        return 1.0
+    if quality.get("secure_pattern") and not quality.get("hardcoded") and not quality.get("deny_all"):
+        return 0.5
+    return 0.0
+def _files_from_actions(actions: Sequence[Mapping[str, Any]], tool_name: str) -> list[str]:
+    files: list[str] = []
+    for action in actions:
+        if action.get("tool_name") != tool_name:
+            continue
+        path = _as_dict(action.get("arguments")).get("path")
+        if path:
+            files.append(str(path))
+    return files
+def _files_modified(record: Mapping[str, Any], actions: Sequence[Mapping[str, Any]]) -> list[str]:
+    metrics = _as_dict(record.get("metrics"))
+    touched = metrics.get("files_touched")
+    if isinstance(touched, list):
+        return [str(item) for item in touched]
+    return _files_from_actions(actions, "patch_file")
+def _path_or_query_text(actions: Sequence[Mapping[str, Any]]) -> str:
+    parts: list[str] = []
+    for action in actions:
+        args = _as_dict(action.get("arguments"))
+        for key in ("path", "query", "content", "diff"):
+            if key in args:
+                parts.append(str(args[key]))
+    return " ".join(parts).lower()
+def _route_deleted_or_stubbed(diff: str) -> bool:
+    lowered = diff.lower()
+    return any(term in lowered for term in ("def health", "route not found", "pass\n", "not implemented"))
+def _auth_disabled_or_bypassed(diff: str) -> bool:
+    lowered = diff.lower()
+    return any(term in lowered for term in ("auth disabled", "bypass", "skip_auth", "disable_auth"))
+def _observation_leak_detected(observations: Sequence[Mapping[str, Any]]) -> bool:
+    leak_markers = ("hidden_facts", "oracle_hidden_focus", "owner_invoice_id", "foreign_invoice_id")
+    text = json.dumps(observations, default=str).lower()
+    return any(marker.lower() in text for marker in leak_markers)
+def _cheat_score(fields: Mapping[str, Any]) -> float:
+    return (
+        100 * _float(fields.get("cheat/hidden_file_read_attempt"))
+        + 100 * _float(fields.get("cheat/reward_file_read_attempt"))
+        + 100 * _float(fields.get("cheat/test_file_modified"))
+        + 50 * _float(fields.get("cheat/hardcoded_scenario_id"))
+        + 30 * _float(fields.get("cheat/catch_all_deny_patch"))
+        + 30 * _float(fields.get("cheat/route_deleted_or_stubbed"))
+        + 20 * _float(fields.get("cheat/timeout_or_resource_abuse"))
+    )
+def _scenario_hash(record: Mapping[str, Any]) -> str:
+    return _stable_hash(
+        {
+            "task_id": record.get("task_id"),
+            "seed": record.get("scenario/seed") or record.get("seed"),
+            "split": record.get("scenario/split") or record.get("split"),
+            "template_id": record.get("scenario/template_id") or record.get("template_id"),
+            "scenario_family": record.get("scenario_family"),
+        }
+    )
+def episode_record_from_state(
+    state: Any,
+    *,
+    run_context: Mapping[str, Any] | None = None,
+    final_observation: Mapping[str, Any] | None = None,
+) -> dict[str, Any]:
+    """Build a redaction-friendly tracking record from an environment state."""
+    context = dict(run_context or {})
+    reward_history = [dict(item) for item in getattr(state, "reward_history", []) or []]
+    final_reward = dict(final_observation.get("reward_breakdown", {})) if final_observation else {}
+    if not final_reward and reward_history:
+        final_reward = dict(reward_history[-1])
+    record = {
+        "run/base_model": context.get("base_model", context.get("run/base_model", "")),
+        "run/algo": context.get("algo", context.get("run/algo", "")),
+        "run/reward_version": context.get("reward_version", "reward_v1"),
+        "run/env_version": context.get("env_version", "0.1.0"),
+        "episode_id": getattr(state, "episode_id", ""),
+        "task_id": getattr(state, "task_id", ""),
+        "scenario/seed": getattr(state, "seed", 0),
+        "scenario/template_id": getattr(state, "template_id", ""),
+        "scenario/split": getattr(state, "split", ""),
+        "scenario/difficulty": getattr(state, "difficulty", 0),
+        "scenario/bug_type": getattr(state, "bug_family", ""),
+        "scenario_family": getattr(state, "scenario_family", ""),
+        "target_weakness": getattr(state, "target_weakness", ""),
+        "difficulty_tier": getattr(state, "difficulty_tier", ""),
+        "domain": getattr(state, "domain", ""),
+        "success": bool(getattr(state, "success", False)),
+        "failure_reason": getattr(state, "failure_reason", None),
+        "finding_submitted": bool(getattr(state, "finding_submitted", False)),
+        "patch_submitted": bool(getattr(state, "patch_submitted", False)),
+        "step_count": int(getattr(state, "step_count", 0) or 0),
+        "max_steps": int(getattr(state, "max_steps", 0) or 0),
+        "done": bool(getattr(state, "done", False)),
+        "anti_cheat_flags": list(getattr(state, "anti_cheat_flags", []) or []),
+        "metrics": dict(getattr(state, "metrics", {}) or {}),
+        "verification_summary": dict(getattr(state, "verification_summary", {}) or {}),
+        "patch_diff": str(getattr(state, "patch_diff", "") or ""),
+        "reward_history": reward_history,
+        "reward_breakdown": final_reward,
+        "action_history": [
+            _safe_action(action)
+            for action in (getattr(state, "action_history", []) or [])
+        ],
+        "observation_history": [
+            _as_dict(obs)
+            for obs in (getattr(state, "observation_history", []) or [])
+        ],
+    }
+    record["scenario_id_hash"] = _scenario_hash(record)
+    return record
+def episode_to_tracking_fields(episode: Any) -> dict[str, Any]:
+    """Return the 57 canonical fields plus derived diagnostics for one episode."""
+    if not isinstance(episode, Mapping):
+        episode = episode_record_from_state(episode)
+    record = dict(episode)
+    actions = _as_action_list(record)
+    observations = _as_observation_list(record)
+    tool_names = _tool_names(actions)
+    verifier = _verification(record)
+    final_reward = _final_reward_breakdown(record)
+    quality = _as_dict(verifier.get("patch_quality"))
+    visible_rate = _check_pass_rate(verifier.get("visible"))
+    hidden_rate = _check_pass_rate(verifier.get("security") or verifier.get("hidden_tests"))
+    normal_rate = _check_pass_rate(verifier.get("regression"))
+    public_rate = _check_pass_rate(verifier.get("public_routes"))
+    files_read = _files_from_actions(actions, "read_file")
+    files_modified = _files_modified(record, actions)
+    patch_diff = _patch_diff(record)
+    path_text = _path_or_query_text(actions)
+    anti_flags = set(record.get("anti_cheat_flags") or [])
+    invalid_count = sum(1 for obs in observations if obs.get("last_action_valid") is False)
+    action_count = max(1, len(actions))
+    valid_action_rate = 1.0 - (invalid_count / action_count)
+    tests_run_count = tool_names.count("run_visible_tests")
+    auth_cases = sum(1 for item in tool_names if item in {"send_local_request", "compare_identities"})
+    fields: dict[str, Any] = {key: 0.0 for key in CANONICAL_TRACKIO_SIGNALS}
+    for key in RUN_SCENARIO_FIELDS:
+        fallback = key.split("/", 1)[1]
+        fields[key] = record.get(key, record.get(fallback, ""))
+    fields["scenario/seed"] = _float(fields["scenario/seed"])
+    fields["scenario/difficulty"] = _float(fields["scenario/difficulty"])
+    fields["reward/total"] = _float(record.get("reward_total", final_reward.get("total", 0.0)))
+    fields["reward/exploit_reproduced_pre_patch"] = 1.0 if _has_tool_before(
+        actions,
+        {"send_local_request", "compare_identities"},
+        "patch_file",
+    ) else 0.0
+    fields["reward/bug_classification_correct"] = 1.0 if (
+        record.get("finding_submitted") or _reward_component_sum(record, "discovery") > 0.0
+    ) else 0.0
+    fields["reward/patch_blocks_submitted_exploit"] = hidden_rate
+    fields["reward/hidden_authz_pass_rate"] = hidden_rate
+    fields["reward/public_tests_pass_rate"] = visible_rate
+    fields["reward/normal_flow_pass_rate"] = normal_rate
+    fields["reward/minimality_score"] = _quality_score(quality)
+    fields["reward/cheat_penalty"] = max(0.0, -_float(final_reward.get("anti_cheat"))) + max(
+        0.0,
+        -_float(final_reward.get("safety")),
+    )
+    fields["reward/public_hidden_gap"] = visible_rate - hidden_rate
+    fields["skill/valid_action_rate"] = max(0.0, min(1.0, valid_action_rate))
+    fields["skill/discovery_success"] = fields["reward/bug_classification_correct"]
+    fields["skill/patch_success"] = 1.0 if record.get("success") else 0.0
+    fields["skill/first_valid_exploit_step"] = _first_tool_step(
+        actions,
+        {"send_local_request", "compare_identities"},
+        observations,
+    )
+    fields["skill/exploit_to_patch_alignment"] = 1.0 if (
+        fields["skill/patch_success"]
+        or (
+            fields["skill/discovery_success"]
+            and _security_relevant_edit_ratio(patch_diff) > 0.0
+        )
+    ) else 0.0
+    fields["skill/policy_graph_coverage"] = 1.0 if "inspect_policy_graph" in tool_names else 0.0
+    fields["skill/route_coverage"] = 1.0 if (
+        "list_routes" in tool_names or "read_openapi" in tool_names
+    ) else 0.0
+    fields["skill/auth_boundary_cases_tested"] = float(auth_cases)
+    fields["skill/files_read_count"] = float(len(files_read))
+    fields["skill/files_modified_count"] = float(len(files_modified))
+    fields["skill/security_relevant_edit_ratio"] = _security_relevant_edit_ratio(patch_diff)
+    fields["skill/tests_run_count"] = float(tests_run_count)
+    fields["cheat/hidden_file_read_attempt"] = 1.0 if (
+        "blocked_file_probe" in anti_flags and "hidden" in path_text
+    ) else 0.0
+    fields["cheat/reward_file_read_attempt"] = 1.0 if (
+        "blocked_file_probe" in anti_flags and ("reward" in path_text or "oracle" in path_text)
+    ) else 0.0
+    fields["cheat/test_file_modified"] = 1.0 if any(
+        action.get("tool_name") == "patch_file"
+        and str(_as_dict(action.get("arguments")).get("path", "")).replace("\\", "/").startswith("tests/")
+        for action in actions
+    ) else 0.0
+    fields["cheat/reward_file_modified"] = 1.0 if any(
+        action.get("tool_name") == "patch_file"
+        and any(term in str(_as_dict(action.get("arguments")).get("path", "")).lower() for term in ("reward", "oracle", "validator"))
+        for action in actions
+    ) else 0.0
+    fields["cheat/env_var_secret_probe"] = 1.0 if any(
+        term in path_text for term in ("env", "secret", "token", "password", "api_key")
+    ) else 0.0
+    fields["cheat/hardcoded_scenario_id"] = 1.0 if (
+        "hardcoded_hidden_identifier" in anti_flags or bool(quality.get("hardcoded"))
+    ) else 0.0
+    fields["cheat/catch_all_deny_patch"] = 1.0 if bool(quality.get("deny_all")) else 0.0
+    fields["cheat/route_deleted_or_stubbed"] = 1.0 if _route_deleted_or_stubbed(patch_diff) else 0.0
+    fields["cheat/auth_disabled_or_bypassed"] = 1.0 if _auth_disabled_or_bypassed(patch_diff) else 0.0
+    fields["cheat/database_reset_or_mutation_abuse"] = 1.0 if any(
+        term in patch_diff.lower() for term in ("drop table", "delete from", "truncate", "reset_db")
+    ) else 0.0
+    fields["cheat/timeout_or_resource_abuse"] = 1.0 if (
+        record.get("failure_reason") == "max_steps_exceeded" or "timeout_or_resource_abuse" in anti_flags
+    ) else 0.0
+    fields["cheat/observation_leak_detected"] = 1.0 if _observation_leak_detected(observations) else 0.0
+    fields["cheat/score"] = _cheat_score(fields)
+    # Episode-level tracking does not know cross-run evaluation or trainer internals.
+    # Those fields remain present with zero defaults and are filled by eval/trainer logs.
+    fields["eval/hidden_success_rate"] = fields["skill/patch_success"] if (
+        record.get("scenario/split") == "hidden_eval"
+    ) else 0.0
+    fields["train/reward_mean"] = fields["reward/total"]
+    return fields
+def episode_to_trackio_metrics(episode: Any) -> dict[str, float]:
+    """Return numeric Trackio scalar metrics for one episode."""
+    fields = episode_to_tracking_fields(episode)
+    return {
+        key: _float(value)
+        for key, value in fields.items()
+        if isinstance(value, (int, float, bool))
+    }
+def aggregate_episode_metrics(episodes: Sequence[Any]) -> dict[str, float]:
+    """Aggregate numeric canonical episode metrics as batch means."""
+    if not episodes:
+        return {"run/episode_count": 0.0}
+    per_episode = [episode_to_trackio_metrics(episode) for episode in episodes]
+    keys = sorted(set().union(*(item.keys() for item in per_episode)))
+    metrics = {
+        key: _mean([_float(item.get(key)) for item in per_episode])
+        for key in keys
+    }
+    metrics["run/episode_count"] = float(len(episodes))
+    metrics["cheat/episode_rate"] = _mean(
+        [1.0 if _float(item.get("cheat/score")) > 0.0 else 0.0 for item in per_episode]
+    )
+    metrics["train/reward_std"] = (
+        sum(
+            (item.get("reward/total", 0.0) - metrics.get("reward/total", 0.0)) ** 2
+            for item in per_episode
+        )
+        / max(1, len(per_episode))
+    ) ** 0.5
+    return metrics
+def train_metric_aliases(metrics: Mapping[str, Any]) -> dict[str, float]:
+    """Map canonical metrics to the repo's existing train/* dashboard names."""
+    return {
+        "train/reward_total_mean": _float(metrics.get("reward/total")),
+        "train/reward_discovery_mean": _float(metrics.get("reward/bug_classification_correct")) * 3.0,
+        "train/reward_security_mean": _float(metrics.get("reward/hidden_authz_pass_rate")) * 5.0,
+        "train/reward_regression_mean": _float(metrics.get("reward/normal_flow_pass_rate")) * 3.0,
+        "train/reward_public_routes_mean": _float(metrics.get("reward/public_tests_pass_rate")),
+        "train/reward_patch_quality_mean": _float(metrics.get("reward/minimality_score")) * 2.0,
+        "train/reward_visible_tests_mean": _float(metrics.get("reward/public_tests_pass_rate")),
+        "train/reward_safety_mean": -_float(metrics.get("reward/cheat_penalty")),
+        "train/reward_anti_cheat_mean": -_float(metrics.get("cheat/score")) / 100.0,
+        "train/success_rate": _float(metrics.get("skill/patch_success")),
+        "train/exploit_block_rate": _float(metrics.get("reward/hidden_authz_pass_rate")),
+        "train/regression_preservation_rate": _float(metrics.get("reward/normal_flow_pass_rate")),
+        "train/public_route_preservation_rate": _float(metrics.get("reward/public_tests_pass_rate")),
+        "train/invalid_action_rate": 1.0 - _float(metrics.get("skill/valid_action_rate")),
+        "train/timeout_rate": _float(metrics.get("cheat/timeout_or_resource_abuse")),
+        "train/safety_violation_rate": _float(metrics.get("cheat/env_var_secret_probe")),
+        "train/reward_hacking_suspected_rate": 1.0 if (
+            _float(metrics.get("reward/public_hidden_gap")) > 0.35
+            or _float(metrics.get("cheat/score")) >= 100.0
+        ) else 0.0,
+        "train/episode_length_mean": _float(metrics.get("skill/tests_run_count"))
+        + _float(metrics.get("skill/files_read_count"))
+        + _float(metrics.get("skill/auth_boundary_cases_tested")),
+    }
+def eval_metric_aliases(summary: Mapping[str, Any]) -> dict[str, float]:
+    """Map eval summary fields to the requested generalization metric names."""
+    train_success = _float(summary.get("trained_success_rate", summary.get("train_success_rate")))
+    hidden_success = _float(summary.get("heldout_success_rate", summary.get("hidden_success_rate")))
+    return {
+        "eval/train_success_rate": train_success,
+        "eval/seen_template_success_rate": _float(summary.get("seen_template_success_rate", train_success)),
+        "eval/unseen_template_success_rate": _float(summary.get("unseen_template_success_rate", hidden_success)),
+        "eval/unseen_framework_success_rate": _float(summary.get("unseen_framework_success_rate", 0.0)),
+        "eval/unseen_policy_graph_success_rate": _float(summary.get("unseen_policy_graph_success_rate", hidden_success)),
+        "eval/hidden_success_rate": hidden_success,
+        "eval/train_hidden_gap": train_success - hidden_success,
+    }
+def episode_to_trace_row(episode: Any) -> dict[str, Any]:
+    """Return one redacted row for the Trackio sample_traces table."""
+    if not isinstance(episode, Mapping):
+        episode = episode_record_from_state(episode)
+    record = dict(episode)
+    actions = _as_action_list(record)
+    observations = _as_observation_list(record)
+    tool_names = _tool_names(actions)
+    verifier = _verification(record)
+    patch_diff = _patch_diff(record)
+    files_read = _files_from_actions(actions, "read_file")
+    files_modified = _files_modified(record, actions)
+    reward_breakdown = _final_reward_breakdown(record)
+    final_obs = observations[-1] if observations else {}
+    row = {
+        "episode_id": _redact_text(record.get("episode_id", "")),
+        "scenario_id_hash": record.get("scenario_id_hash") or _scenario_hash(record),
+        "split": record.get("scenario/split") or record.get("split", ""),
+        "difficulty": record.get("scenario/difficulty") or record.get("difficulty", 0),
+        "bug_type": record.get("scenario/bug_type") or record.get("bug_type", ""),
+        "visible_observation_summary": json.dumps(
+            {
+                "done": bool(record.get("done", final_obs.get("done", False))),
+                "success": bool(record.get("success", False)),
+                "last_action_valid": final_obs.get("last_action_valid", True),
+                "terminal_reason": record.get("failure_reason") or final_obs.get("done_reason"),
+            },
+            sort_keys=True,
+        ),
+        "action_sequence": " -> ".join(tool_names),
+        "tool_calls": json.dumps({name: tool_names.count(name) for name in sorted(set(tool_names))}, sort_keys=True),
+        "files_read": json.dumps(sorted(set(files_read))),
+        "files_modified": json.dumps(sorted(set(files_modified))),
+        "exploit_summary": json.dumps(
+            {
+                "local_probe_count": sum(
+                    1 for name in tool_names if name in {"send_local_request", "compare_identities"}
+                ),
+                "first_valid_exploit_step": episode_to_tracking_fields(record)[
+                    "skill/first_valid_exploit_step"
+                ],
+                "finding_submitted": bool(record.get("finding_submitted", False)),
+            },
+            sort_keys=True,
+        ),
+        "patch_diff_summary": json.dumps(
+            {
+                "diff_hash": _stable_hash(patch_diff),
+                "changed_lines": len(_diff_lines(patch_diff)),
+                "security_relevant_edit_ratio": _security_relevant_edit_ratio(patch_diff),
+            },
+            sort_keys=True,
+        ),
+        "public_test_summary": json.dumps(_check_summary(verifier.get("visible")), sort_keys=True),
+        "hidden_test_summary_redacted": json.dumps(
+            {
+                "authz": _check_summary(verifier.get("security") or verifier.get("hidden_tests")),
+                "regression": _check_summary(verifier.get("regression")),
+                "public_routes": _check_summary(verifier.get("public_routes")),
+            },
+            sort_keys=True,
+        ),
+        "reward_breakdown": json.dumps(reward_breakdown, sort_keys=True),
+        "cheat_flags": json.dumps(sorted(record.get("anti_cheat_flags") or [])),
+        "terminal_reason": record.get("failure_reason") or final_obs.get("done_reason"),
+    }
+    return {key: _redact_text(row.get(key, "")) for key in TRACE_TABLE_COLUMNS}
+def trace_table_rows(episodes: Sequence[Any]) -> list[dict[str, Any]]:
+    return [episode_to_trace_row(episode) for episode in episodes]
+def log_trace_table(
+    episodes: Sequence[Any],
+    *,
+    table_name: str = "sample_traces",
+    step: int | None = None,
+) -> None:
+    if not episodes:
+        return
+    trackio = _load_trackio()
+    rows = trace_table_rows(episodes)
+    table = trackio.Table(
+        columns=list(TRACE_TABLE_COLUMNS),
+        rows=[[row.get(column, "") for column in TRACE_TABLE_COLUMNS] for row in rows],
+        allow_mixed_types=True,
+    )
+    if step is None:
+        trackio.log({table_name: table})
+    else:
+        trackio.log({table_name: table}, step=step)
+def log_episode_batch(
+    episodes: Sequence[Any],
+    *,
+    step: int | None = None,
+    table_name: str = "sample_traces",
+    include_train_aliases: bool = False,
+) -> dict[str, float]:
+    metrics = aggregate_episode_metrics(episodes)
+    payload = dict(metrics)
+    if include_train_aliases:
+        payload.update(train_metric_aliases(metrics))
+    log_trackio_metrics(payload, step=step)
+    log_trace_table(episodes, table_name=table_name, step=step)
+    return payload
+def missing_required_trackio_items(
+    run_or_metrics: Mapping[str, Any],
+    required_items: Sequence[str] = REQUIRED_SMOKE_TRACKIO_ITEMS,
+) -> list[str]:
+    """Return required metrics/table names absent from a Trackio run summary."""
+    available: set[str] = set()
+    metrics = run_or_metrics.get("metrics")
+    if isinstance(metrics, dict):
+        available.update(str(key) for key in metrics)
+    elif isinstance(metrics, list):
+        available.update(str(item) for item in metrics)
+    for key in ("tables", "artifacts", "media", "logged_artifacts"):
+        value = run_or_metrics.get(key)
+        if isinstance(value, dict):
+            available.update(str(item) for item in value)
+        elif isinstance(value, list):
+            available.update(str(item) for item in value)
+    if "values" in run_or_metrics and run_or_metrics.get("metric"):
+        available.add(str(run_or_metrics["metric"]))
+    return [item for item in required_items if item not in available]
 def build_run_name(model: str, algo: str, difficulty: int, git_sha: str = "nogit") -> str:
     stamp = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
     model_slug = model.replace("/", "-")
     project: str | None = None,
     space_id: str | None = None,
     group: str | None = None,
+    auto_log_gpu: bool | None = None,
+    gpu_log_interval: float | None = None,
 ):
     trackio = _load_trackio()
     project = project or os.getenv("TRACKIO_PROJECT", "CyberSecurity_OWASP")
         kwargs["space_id"] = space_id
     if group:
         kwargs["group"] = group
+    if auto_log_gpu is not None:
+        kwargs["auto_log_gpu"] = auto_log_gpu
+    if gpu_log_interval is not None:
+        kwargs["gpu_log_interval"] = gpu_log_interval
     return trackio.init(**kwargs)
         trackio.log(numeric, step=step)
+def collect_torch_gpu_metrics() -> dict[str, float]:
+    """Collect explicit torch CUDA metrics for Trackio scalar dashboards."""
+    try:
+        import torch
+    except Exception:
+        return {"system/gpu_available": 0.0, "system/gpu_count": 0.0}
+    if not torch.cuda.is_available():
+        return {"system/gpu_available": 0.0, "system/gpu_count": 0.0}
+    device = torch.cuda.current_device()
+    props = torch.cuda.get_device_properties(device)
+    allocated = float(torch.cuda.memory_allocated(device)) / (1024 * 1024)
+    reserved = float(torch.cuda.memory_reserved(device)) / (1024 * 1024)
+    max_allocated = float(torch.cuda.max_memory_allocated(device)) / (1024 * 1024)
+    total = float(props.total_memory) / (1024 * 1024)
+    return {
+        "system/gpu_available": 1.0,
+        "system/gpu_count": float(torch.cuda.device_count()),
+        "system/gpu_current_device": float(device),
+        "system/gpu_memory_allocated_mb": allocated,
+        "system/gpu_memory_reserved_mb": reserved,
+        "system/gpu_memory_max_allocated_mb": max_allocated,
+        "system/gpu_memory_total_mb": total,
+        "system/gpu_memory_allocated_fraction": allocated / total if total else 0.0,
+    }
+def log_gpu_metrics(step: int | None = None) -> dict[str, float]:
+    """Log Trackio's native GPU metrics plus explicit torch GPU aliases."""
+    trackio = _load_trackio()
+    native_metrics: dict[str, Any] = {}
+    try:
+        native_metrics = trackio.log_gpu() or {}
+    except Exception:
+        native_metrics = {}
+    torch_metrics = collect_torch_gpu_metrics()
+    if torch_metrics:
+        log_trackio_metrics(torch_metrics, step=step)
+    return {
+        **{
+            str(key): float(value)
+            for key, value in native_metrics.items()
+            if isinstance(value, (int, float, bool))
+        },
+        **torch_metrics,
+    }
 def finish_trackio_run() -> None:
     trackio = _load_trackio()
     trackio.finish()
     project: str | None = None,
     space_id: str | None = None,
     group: str | None = None,
+    auto_log_gpu: bool | None = None,
+    gpu_log_interval: float | None = None,
 ) -> Iterator[Any]:
     run = init_trackio_run(
         run_name=run_name,
         project=project,
         space_id=space_id,
         group=group,
+        auto_log_gpu=auto_log_gpu,
+        gpu_log_interval=gpu_log_interval,
     )
     try:
         yield run
         for key, value in summary.items()
         if isinstance(value, (int, float, bool))
     }
+    metrics.update(eval_metric_aliases(summary))
     with trackio_run(run_name=run_name, run_type="eval", config=config, group="eval"):
         log_trackio_metrics(metrics, step=0)

training/train_grpo.py CHANGED Viewed

@@ -1,8 +1,8 @@
-"""Minimal GRPO training entrypoint scaffold.
-This file intentionally does not start training on import. It validates that the
-required TRL/Trackio configuration can be constructed when optional training
-dependencies are installed.
 """
 from __future__ import annotations
@@ -12,13 +12,21 @@ import os
 from training.trackio_utils import build_run_name, get_git_sha
 def build_grpo_config():
     from trl import GRPOConfig
-    model_name = os.getenv("MODEL_NAME", "Qwen/Qwen3-1.7B")
     difficulty = int(os.getenv("DIFFICULTY", "0"))
-    output_dir = os.getenv("OUTPUT_DIR", "CyberSecurity_OWASP-qwen3-1.7b-grpo")
-    trackio_space_id = os.getenv("TRACKIO_SPACE_ID", output_dir)
     os.environ.setdefault("TRACKIO_PROJECT", "CyberSecurity_OWASP-grpo")
     run_name = os.getenv(
         "RUN_NAME",
@@ -47,9 +55,41 @@ def build_grpo_config():
     )
-def main():
     config = build_grpo_config()
     print(config)
 if __name__ == "__main__":

+"""Modal-only GRPO config helper for CyberSecurity_OWASP.
+This module intentionally does not run local training.
+Use `scripts/modal_train_grpo.py` (persistent) or
+`scripts/modal_ephemeral_train.py` (smoke) for execution.
 """
 from __future__ import annotations
 from training.trackio_utils import build_run_name, get_git_sha
+DEFAULT_GEMMA_MODEL = os.getenv("MODEL_NAME", "unsloth/gemma-4-E2B-it")
 def build_grpo_config():
+    """Build the TRL GRPOConfig used by the Modal training pipeline."""
     from trl import GRPOConfig
+    model_name = os.getenv("MODEL_NAME", DEFAULT_GEMMA_MODEL)
     difficulty = int(os.getenv("DIFFICULTY", "0"))
+    output_dir = os.getenv(
+        "OUTPUT_DIR",
+        f"CyberSecurity_OWASP-{model_name.replace('/', '-')}-grpo",
+    )
+    trackio_space_id = os.getenv("TRACKIO_SPACE_ID", "Humanlearning/CyberSecurity_OWASP-trackio")
     os.environ.setdefault("TRACKIO_PROJECT", "CyberSecurity_OWASP-grpo")
     run_name = os.getenv(
         "RUN_NAME",
     )
+def main() -> None:
+    import argparse
+    parser = argparse.ArgumentParser(
+        description=(
+            "CyberSecurity_OWASP GRPO config helper."
+            " Actual GRPO training is executed on Modal only."
+        )
+    )
+    parser.add_argument(
+        "--difficulty",
+        type=int,
+        default=0,
+        help="Optional curriculum difficulty included in the generated run name.",
+    )
+    parser.add_argument("--model-name", default=DEFAULT_GEMMA_MODEL)
+    parser.add_argument(
+        "--output-dir",
+        default=None,
+        help="Optional GRPO output_dir override.",
+    )
+    args = parser.parse_args()
+    os.environ["MODEL_NAME"] = args.model_name
+    if args.output_dir:
+        os.environ["OUTPUT_DIR"] = args.output_dir
     config = build_grpo_config()
+    print("GRPO config (Modal execution):")
     print(config)
+    print(
+        "Run on Modal, for example:\n"
+        "uv run --extra modal modal run scripts/modal_train_grpo.py "
+        f"--model-name {args.model_name} --difficulty {args.difficulty}"
+    )
 if __name__ == "__main__":