Spaces:

Humanlearning
/

Cyber_analyst-round1

Sleeping

App Files Files Community

Humanlearning commited on 12 days ago

Commit

be8eade

1 Parent(s): 448eddd

feat: enhance scenario authoring and caching mechanisms, update action submission terminology, and improve reward configuration for CyberSecurity_OWASP environment

Browse files

Files changed (45) hide show

.agents/skills/cybersecurity-owasp-trainer/SKILL.md +26 -9
.agents/skills/cybersecurity-owasp-trainer/references/trl-unsloth-openenv-notes.md +1 -1
00_PROJECT_BRIEF.md +1 -2
01_ARCHITECTURE.md +126 -68
AGENTS.md +5 -5
README.md +54 -11
assets/architecture_diagram.mmd +55 -48
assets/architecture_diagram.svg +172 -81
assets/env_rl_training_flow_diagram.mmd +36 -18
assets/env_rl_training_flow_diagram.svg +146 -88
config.py +180 -0
configs/scenario_authoring.small.json +34 -0
evals.py +6 -4
models.py +15 -1
pyproject.toml +5 -0
reward_config.py +119 -0
rewards.py +391 -40
scripts/generate_scenario_cache.py +56 -0
scripts/generate_scenarios.sh +1 -1
scripts/modal_ephemeral_train.py +110 -11
scripts/modal_train_grpo.py +325 -63
server/CyberSecurity_OWASP_environment.py +96 -10
server/__init__.py +2 -0
server/app_sandbox.py +59 -3
server/curriculum.py +13 -5
server/episode_logger.py +7 -0
server/scenario_cache.py +525 -0
server/scenario_factory.py +1 -1
server/verifier.py +3 -2
tests/helpers.py +19 -7
tests/test_closed_loop_runtime.py +2 -2
tests/test_invalid_actions.py +4 -0
tests/test_modal_scenario_cache_static.py +39 -0
tests/test_reward_config.py +48 -0
tests/test_rewards.py +63 -1
tests/test_scenario_authoring_config.py +72 -0
tests/test_scenario_cache.py +148 -0
tests/test_trackio_utils.py +1 -1
training/configs/grpo_small.yaml +142 -1
training/reward_funcs.py +20 -0
training/rollout.py +15 -2
training/trackio_utils.py +84 -7
training/train_grpo.py +12 -3
uv.lock +2 -0
validators.py +52 -2

.agents/skills/cybersecurity-owasp-trainer/SKILL.md CHANGED Viewed

@@ -29,6 +29,8 @@ Do not start real training until all checks below are true:
 - A local server or Docker server can run, and at least one manual episode completes.
 - Scripted random, bad, and oracle policies run without crashing; oracle gets high reward on easy seeds.
 - At least 10 validation rollouts complete and sampled rollout artifacts look behaviorally plausible.
 - Trackio run config is set and can log a smoke metric locally or to the canonical Space.
 If any gate fails, fix the environment, verifier, reward engine, or rollout parser before touching trainer scale.
@@ -46,23 +48,33 @@ Prefer the existing repo modules:
 Default environment values:
 ```powershell
-$env:MODEL_NAME = "google/gemma-2-2b-it"
 $env:TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
 $env:TRACKIO_PROJECT = "CyberSecurity_OWASP"
 $env:DIFFICULTY = "0"
 ```
 Use level-0 debug runs before scaling, and verify them through Modal smoke/ephemeral runs.
 ## Training Workflow
-1. Validate the environment first: run the targeted tests that cover models, reset/step/state, rewards, anti-cheat, seed reproducibility, invalid actions, and rollouts.
-2. Run a Modal smoke path for lightweight config/run verification.
-3. Run a frozen-model or dummy-policy rollout on Modal and inspect the action trace, observations, terminal reason, and reward breakdown.
-4. Confirm Trackio receives component metrics and the run name follows `CyberSecurity_OWASP-<model>-<algo>-level<difficulty>-<YYYYMMDD-HHMM>-<git_sha>`.
-5. Start a very small GRPO run only after the above passes. Start via `scripts/modal_train_grpo.py --mode train`.
-6. Evaluate baseline, trained, and held-out splits with `training/eval_before_after.py` and save summaries under `outputs/evals/`.
-7. Save sampled rollouts under `outputs/rollouts/` for baseline, mid-training, trained, and held-out evidence.
 ## Reward And Monitoring
@@ -71,17 +83,22 @@ Track at least these behavior columns:
 - Reward components: total, discovery, security, regression, public routes, patch quality, visible tests, safety, anti-cheat.
 - Rates: success, exploit-block, regression preservation, public-route preservation, anti-cheat pass, invalid action, timeout, safety violation, reward-hacking suspected.
 - Efficiency: episode length mean/p95, rollouts per second, tokens per second, loss, learning rate, KL, grad norm.
-- Environment timing: reset, step, verifier, reward, scenario compile, error rate, difficulty, seed.
 Stop or roll back if reward rises while sampled traces show deny-all patches, hardcoded users/resources/tenants, fixture/test tampering, repeated invalid actions, public routes being locked, or visible-test-only optimization.
 ## TRL, OpenEnv, And Unsloth Guidance
 - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
 - Keep the existing custom rollout path unless deliberately migrating to TRL's `environment_factory`. If migrating, preserve typed actions, observations, reward component logging, anti-cheat flags, and rollout artifacts.
 - Use Modal as the default training path; local-only vLLM/GRPO execution is intentionally avoided in this repository.
 - For OpenEnv server training concurrency, ensure the server supports enough concurrent sessions for the generation batch.
 - Use Unsloth with LoRA or QLoRA for memory efficiency when the training machine supports it. Start from an instruct-capable checkpoint and verify the model has non-zero success probability before RL.
 - Pin and smoke-test TRL, Unsloth, vLLM, CUDA, and torch versions before longer runs.
 - Save LoRA adapters or use Unsloth-supported merged save paths. Do not naively upcast a 4-bit model and merge adapters manually.

 - A local server or Docker server can run, and at least one manual episode completes.
 - Scripted random, bad, and oracle policies run without crashing; oracle gets high reward on easy seeds.
 - At least 10 validation rollouts complete and sampled rollout artifacts look behaviorally plausible.
+- The validated scenario cache exists, is mounted, and meets the configured split/difficulty minimums.
+- Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; runtime `reset()` must not compile scenarios or call an LLM during training/eval.
 - Trackio run config is set and can log a smoke metric locally or to the canonical Space.
 If any gate fails, fix the environment, verifier, reward engine, or rollout parser before touching trainer scale.
 Default environment values:
 ```powershell
+$env:MODEL_NAME = "unsloth/gemma-4-E2B-it"
 $env:TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
 $env:TRACKIO_PROJECT = "CyberSecurity_OWASP"
 $env:DIFFICULTY = "0"
+$env:CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR = "scenario_cache"
+$env:CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE = "fallback"
 ```
 Use level-0 debug runs before scaling, and verify them through Modal smoke/ephemeral runs.
+Modal uses two persistent cache volumes:
+- `CyberSecurity_OWASP-model-cache`: Hugging Face, torch, Unsloth, Triton, and model artifacts.
+- `CyberSecurity_OWASP-scenario-cache`: validated executable scenario bundles for `reset()`.
+Scenario/curriculum authoring is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro`; this is not the RL training policy model. The RL training model is pinned to `unsloth/gemma-4-E2B-it`, matching the Unsloth Gemma 4 E2B RL notebook.
 ## Training Workflow
+1. Validate the environment first: run the targeted tests that cover models, reset/step/state, rewards, anti-cheat, seed reproducibility, invalid actions, rollouts, config, and scenario cache.
+2. Prepare the scenario cache once per generator/verifier version: `scripts/modal_train_grpo.py --mode prepare-cache` or `scripts/modal_ephemeral_train.py --mode prepare-cache`.
+3. Run the CPU-only Modal scenario-cache preflight before any GPU training. If cache hit rate or coverage is below config, stop and refill the cache instead of allocating a GPU.
+4. Run a frozen-model or dummy-policy rollout on Modal and inspect the action trace, observations, terminal reason, cache metadata, and reward breakdown.
+5. Confirm Trackio receives component metrics and the run name follows `CyberSecurity_OWASP-<model>-<algo>-level<difficulty>-<YYYYMMDD-HHMM>-<git_sha>`.
+6. Start a very small GRPO run only after the above passes. Start via `scripts/modal_train_grpo.py --mode train`.
+7. Evaluate baseline, trained, and held-out splits with `training/eval_before_after.py` and save summaries under `outputs/evals/`.
+8. Save sampled rollouts under `outputs/rollouts/` for baseline, mid-training, trained, and held-out evidence.
 ## Reward And Monitoring
 - Reward components: total, discovery, security, regression, public routes, patch quality, visible tests, safety, anti-cheat.
 - Rates: success, exploit-block, regression preservation, public-route preservation, anti-cheat pass, invalid action, timeout, safety violation, reward-hacking suspected.
 - Efficiency: episode length mean/p95, rollouts per second, tokens per second, loss, learning rate, KL, grad norm.
+- Environment timing: reset, step, verifier, reward, scenario cache hit/miss, scenario bundle load, scenario compile fallback, error rate, difficulty, seed.
 Stop or roll back if reward rises while sampled traces show deny-all patches, hardcoded users/resources/tenants, fixture/test tampering, repeated invalid actions, public routes being locked, or visible-test-only optimization.
+Stop or downgrade to local-dev only if Modal training/eval shows runtime scenario compilation, cache misses in required mode, or cache hit rate below the configured target.
 ## TRL, OpenEnv, And Unsloth Guidance
 - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
 - Keep the existing custom rollout path unless deliberately migrating to TRL's `environment_factory`. If migrating, preserve typed actions, observations, reward component logging, anti-cheat flags, and rollout artifacts.
 - Use Modal as the default training path; local-only vLLM/GRPO execution is intentionally avoided in this repository.
 - For OpenEnv server training concurrency, ensure the server supports enough concurrent sessions for the generation batch.
+- Keep scenario generation out of the rollout hot path. `reset()` should clone cached bundles; any LLM scenario authoring belongs to offline cache prep.
+- GPU training launchers must call the CPU-only scenario-cache preflight before spawning the L4 function, so missing cache coverage fails before GPU allocation.
 - Use Unsloth with LoRA or QLoRA for memory efficiency when the training machine supports it. Start from an instruct-capable checkpoint and verify the model has non-zero success probability before RL.
+- Do not swap the RL model away from `unsloth/gemma-4-E2B-it` for smoke runs. Cost-control should use `--max-steps`, `--dataset-size`, `--max-completion-length`, and cache preflight, not a different model.
 - Pin and smoke-test TRL, Unsloth, vLLM, CUDA, and torch versions before longer runs.
 - Save LoRA adapters or use Unsloth-supported merged save paths. Do not naively upcast a 4-bit model and merge adapters manually.

.agents/skills/cybersecurity-owasp-trainer/references/trl-unsloth-openenv-notes.md CHANGED Viewed

@@ -32,7 +32,7 @@ Recheck these pages before major dependency upgrades because TRL, OpenEnv integr
 - Start from a capable instruct model or lightly format-tuned model. If success probability is effectively zero, RL will not bootstrap.
 - Keep reward functions/verifiers simple and trustworthy first; add shaping only after sparse reward blocks learning.
 - Unsloth recipes commonly use Qwen, Gemma, Llama, Phi, Mistral, and gpt-oss variants. For this repo, prefer the configured `Qwen/Qwen3-1.7B` or another small instruct/coder checkpoint for smoke runs.
-- For Unsloth-specific GRPO recipes, use more than two generations per prompt when hardware allows. Keep the repo's small `num_generations=2` only as a low-cost smoke/debug default unless tests prove it is sufficient.
 - Pin torch, CUDA, vLLM, TRL, and Unsloth versions for any serious run, then run a short smoke test before scaling.
 ## Saving And Serving

 - Start from a capable instruct model or lightly format-tuned model. If success probability is effectively zero, RL will not bootstrap.
 - Keep reward functions/verifiers simple and trustworthy first; add shaping only after sparse reward blocks learning.
 - Unsloth recipes commonly use Qwen, Gemma, Llama, Phi, Mistral, and gpt-oss variants. For this repo, prefer the configured `Qwen/Qwen3-1.7B` or another small instruct/coder checkpoint for smoke runs.
+- For Unsloth-specific GRPO recipes, use more than two generations per prompt when hardware allows. In this repo, keep `num_generations=2` only for smoke/debug runs; for non-smoke training runs default to `num_generations>=6`.
 - Pin torch, CUDA, vLLM, TRL, and Unsloth versions for any serious run, then run a short smoke test before scaling.
 ## Saving And Serving

00_PROJECT_BRIEF.md CHANGED Viewed

@@ -56,7 +56,7 @@ This environment is useful because it targets a real gap between today’s scann
 - **Scanners detect patterns.** This environment trains policy-aware reasoning.
 - **Unit tests check known cases.** This environment includes hidden authorization invariants.
 - **Static repair can overfit.** This environment forces the model to preserve valid business behavior.
-- **One-app benchmarks are easy to memorize.** This environment compiles many equivalent-but-different apps from policy graphs, templates, route shapes, schema names, and hidden test seeds.
 The outcome is a model that becomes better at a practical DevSecOps workflow: safely reviewing and repairing authorization logic in small-to-medium web apps.
@@ -142,4 +142,3 @@ CyberSecurity_OWASP/
 | OpenEnv build/deploy docs | Defines the required OpenEnv structure: models, server, client, Docker, HF Spaces deployment. | 8.5/10 |
 | Hackathon judging criteria | Aligns deliverables with scoring: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
 | TRL/OpenEnv GRPO example | Shows a practical pattern for environment rollouts, reward functions, and Trackio logging. | 8/10 |

 - **Scanners detect patterns.** This environment trains policy-aware reasoning.
 - **Unit tests check known cases.** This environment includes hidden authorization invariants.
 - **Static repair can overfit.** This environment forces the model to preserve valid business behavior.
+- **One-app benchmarks are easy to memorize.** This environment prepares and caches many equivalent-but-different apps from policy graphs, templates, route shapes, schema names, and hidden test seeds, then keeps runtime `reset()` deterministic and fast.
 The outcome is a model that becomes better at a practical DevSecOps workflow: safely reviewing and repairing authorization logic in small-to-medium web apps.
 | OpenEnv build/deploy docs | Defines the required OpenEnv structure: models, server, client, Docker, HF Spaces deployment. | 8.5/10 |
 | Hackathon judging criteria | Aligns deliverables with scoring: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
 | TRL/OpenEnv GRPO example | Shows a practical pattern for environment rollouts, reward functions, and Trackio logging. | 8/10 |

01_ARCHITECTURE.md CHANGED Viewed

@@ -22,32 +22,30 @@ Editable source: `assets/architecture_diagram.mmd`
 ```mermaid
 flowchart TB
-    subgraph A[Scenario + Curriculum Factory]
-        A1[Policy Graph Generator\nroles, users, tenants, ownership]
-        A2[Curriculum Controller\nmastery, weak spots, difficulty tier]
-        A3[Bounded Adversarial Designer\nsafe local scenario targets]
-        A4[Template Renderer\nFastAPI routes, services, auth helpers]
-        A5[A01 Bug Mutator\nIDOR, tenant, role, public-route traps]
-        A6[ScenarioSpec + Oracle\nvisible hints + hidden policy tuples]
-        A1 --> A3
-        A2 --> A3
-        A3 --> A4 --> A5 --> A6
     end
-    subgraph B[CyberSecurity_OWASP OpenEnv Server]
-        B1[reset\(seed, difficulty\)\nselect curriculum profile]
-        B2[Episode State Store\nphase, history, metrics, weakness, patch diff]
-        B3[Typed Action Tools\ninspect, request, patch, visible tests]
-        B4[Ephemeral App Sandbox\ncode workspace + fixtures + local API model]
-        B5[Multi-layer Verifier\nvisible, hidden, oracle, regression]
-        B6[Deterministic Reward Engine\nstable components + penalties]
-        B7[Episode Artifact Logger\nJSONL transcript + verifier + diff]
-        B8[state\(\)\nstructured metadata for debugging/eval]
-        B1 --> B2 --> B3
-        B3 <--> B4
-        B4 --> B5 --> B6 --> B2
-        B2 --> B7 --> A2
-        B2 --> B8
     end
     subgraph C[Single LLM Agent]
@@ -57,42 +55,63 @@ flowchart TB
         C1 --> C2 --> C3
     end
-    subgraph D[Training + Evaluation]
-        D1[Parallel Rollout Loop\nreset → step* → terminal reward]
         D2[TRL GRPO + LoRA]
-        D3[Trackio Metrics\nreward curves, pass rates, failure modes]
         D4[Held-out Family Eval\nbase vs trained model]
         D5[Demo Artifacts\nbefore/after traces + JSONL]
         D1 --> D2 --> D3 --> D4 --> D5
     end
-    A6 --> B1
-    C3 -->|typed action| B3
-    B3 -->|observation + reward + done| C1
-    B6 --> D1
     D2 --> C1
-    B6 --> D4
 ```
 ## 3. Component responsibilities
-### 3.1 Scenario Factory
-The scenario factory generates many small but realistic web apps from a structured authorization policy.
-It should output:
-- application code;
-- route map;
-- database fixture;
-- user/session/token fixtures;
-- policy graph;
-- intentionally injected access-control bug;
-- public tests visible to the agent;
-- hidden tests invisible to the agent;
-- metadata for eval and debugging.
-The scenario compiler is the main anti-overfitting mechanism. It should vary:
 - route names;
 - schema names;
@@ -105,13 +124,30 @@ The scenario compiler is the main anti-overfitting mechanism. It should vary:
 - visible test coverage;
 - hidden invariant seeds.
-The runtime now treats curriculum and adversarial targeting as first-class scenario inputs:
 - `CurriculumController` tracks target weakness mastery, recent reward trend, failure counts, and difficulty tier.
-- `BoundedAdversarialDesigner` chooses safe synthetic lab targets such as same-role cross-object access, cross-tenant boundaries, public-route overlocking, alternate-service reachability, and visible-test-only traps.
-- `ScenarioFactory` combines the policy graph, curriculum profile, adversarial target, renderer, and hidden oracle metadata into one deterministic scenario spec.
 - Hidden-eval episodes hold out scenario families, not only seeds, by marking evaluation-only scenario-family metadata in state rather than observations.
 ### 3.2 Policy Graph Generator
 The policy graph is the ground truth for intended behavior.
@@ -167,7 +203,7 @@ MVP bug classes:
 The OpenEnv server should implement the standard lifecycle:
-- `reset()` — initialize a fresh scenario instance.
 - `step(action)` — execute one typed action and return observation, reward, and done.
 - `state()` — expose episode metadata for debugging and evaluation.
@@ -189,15 +225,19 @@ The agent should interact through typed actions. Keep the interface small enough
 ```python
 @dataclass
 class CyberSecurityOWASPAction(Action):
-    action_type: Literal[
-        "read_file",
-        "list_files",
         "list_routes",
-        "inspect_policy",
         "send_local_request",
-        "run_public_tests",
-        "apply_patch",
         "submit_fix",
     ]
     arguments: dict
 ```
@@ -206,12 +246,13 @@ Recommended actions:
 | Action | Purpose | Safety boundary |
 |---|---|---|
-| `inspect_policy` | Read intended authorization rules. | Only synthetic policy. |
 | `list_routes` | See local app route map. | No internet target. |
 | `read_file` | Inspect selected source file. | Sandbox allowlist only. |
 | `send_local_request` | Validate behavior against local app. | Local generated app only. |
-| `run_public_tests` | Run visible tests. | No hidden test disclosure. |
-| `apply_patch` | Modify source through unified diff. | Patch size and file allowlist limits. |
 | `submit_fix` | End episode and trigger hidden eval. | Final hidden score only, no leaked test details. |
 ### 3.6 Observation schema
@@ -263,9 +304,9 @@ class CyberSecurityOWASPState(State):
 ```text
 1. reset()
    - curriculum selects difficulty tier and target weakness
-   - bounded adversarial designer chooses a safe local scenario target
-   - scenario factory compiles app from policy graph + template + injected bug
-   - initialize ephemeral app sandbox and fixture state
    - return initial observation
 2. agent loop
@@ -290,6 +331,8 @@ class CyberSecurityOWASPState(State):
    - send metrics to Trackio during training/eval
 ```
 ## 5. Reward design
 The reward should be deterministic, decomposed, and resistant to reward hacking. The maximum terminal reward remains **15.0** and high reward requires deterministic verifier success, not explanation quality.
@@ -306,10 +349,22 @@ Stable reward keys:
     "visible_tests": 0.0,
     "safety": 0.0,
     "anti_cheat": 0.0,
     "total": 0.0,
 }
 ```
 ### Reward components
 | Component | Purpose |
@@ -387,26 +442,28 @@ Editable source: `assets/env_rl_training_flow_diagram.mmd`
 ```text
 1. Build CyberSecurity_OWASP OpenEnv server.
-2. Generate 600 MVP scenarios.
-3. Run baseline eval with the base model.
-4. Train with GRPO/TRL or Unsloth using rollout episodes.
-5. Log reward components to Trackio.
 6. Run held-out eval every N training steps.
-7. Inspect failure clusters.
-8. Add scenario mutations only if failures reveal overfitting.
 9. Produce final demo: before/after trace + reward curve + held-out eval table.
 ```
 Recommended initial training setup (Modal-first):
 ```text
-Model: google/gemma-2-2b-it (or compatible Gemma-class instruct model)
 Algorithm: GRPO via TRL or Unsloth-compatible loop
 Dataset prompt: repeated task instruction with randomized scenario IDs
 Max steps per episode: 30
 Rollouts per prompt: 2-4
 Logging: Trackio
 Primary eval: held-out deterministic test pass rate
 Training execution is expected to run on Modal (persistent or ephemeral) rather than locally.
 ```
@@ -501,3 +558,4 @@ Expected endpoints:
 | Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
 | TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 |
 | Kube SRE Gym README | Informs the closed-loop pattern: adversarial scenario design, curriculum mastery tracking, real tool interaction, verification, and artifact-driven storytelling. | 8/10 |

 ```mermaid
 flowchart TB
+    subgraph A[Async Scenario Authoring + Curriculum Factory]
+        A1[Config-guided LLM Scenario Author\nDeepSeek-V4-Pro default]
+        A2[ScenarioSpec JSON\npolicy, app family, bug target]
+        A3[Template + A01 Mutator\nFastAPI code variants]
+        A4[Deterministic Compiler\nexecutable bundle]
+        A5[Static + Dynamic Verifier\nsolvable, safe, hidden/visible tests]
+        A6[Difficulty Calibrator\nbaseline pass-rate buckets]
+        A7[Versioned Scenario Cache\nsplit, difficulty, family, hash]
+        A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
     end
+    subgraph B[CyberSecurity_OWASP OpenEnv Runtime]
+        B1[reset\(seed, difficulty, family_budget\)\ncache lookup only]
+        B2[Curriculum Sampler\nvalidated cache slice]
+        B3[Episode State Store\nphase, history, cache metadata, patch diff]
+        B4[Typed Action Tools\ninspect, request, patch, visible tests]
+        B5[Ephemeral App Sandbox\ncloned cached workspace + fixtures]
+        B6[Multi-layer Verifier\nvisible, hidden, oracle, regression]
+        B7[Deterministic Reward Engine\nstable components + penalties]
+        B8[Episode Artifact Logger\nJSONL transcript + verifier + diff]
+        B1 --> B2 --> B3 --> B4
+        B4 <--> B5
+        B5 --> B6 --> B7 --> B3
+        B3 --> B8
     end
     subgraph C[Single LLM Agent]
         C1 --> C2 --> C3
     end
+    subgraph D[Training + Evaluation + Demo]
+        D1[Parallel Rollouts\nfast cached reset]
         D2[TRL GRPO + LoRA]
+        D3[Trackio Curves\nreward, pass rates, cache metrics]
         D4[Held-out Family Eval\nbase vs trained model]
         D5[Demo Artifacts\nbefore/after traces + JSONL]
         D1 --> D2 --> D3 --> D4 --> D5
     end
+    subgraph E[Feedback / Adaptation Loop]
+        E1[Episode logs + failures]
+        E2[Mastery Model\nweakness and plateau tracking]
+        E3[Cache Sampling Weights\nnew generation queue]
+        E1 --> E2 --> E3
+    end
+    A7 --> B1
+    C3 -->|typed action| B4
+    B4 -->|observation + reward + done| C1
+    B7 --> D1
     D2 --> C1
+    B8 --> E1
+    E3 --> A1
 ```
 ## 3. Component responsibilities
+### 3.1 Async Scenario Authoring Plane
+Scenario generation is offline, asynchronous, validated, and cached. Runtime `reset()` must not call an LLM and must not compile a fresh app during Modal smoke, training, or evaluation runs.
+The scenario authoring plane outputs complete executable bundles:
+- `scenario.json`;
+- `app_source/`;
+- `policy_graph.json`;
+- `visible_tests.py`;
+- `hidden_tests.py`;
+- `oracle_tests.py`;
+- `expected_exploit_trace.json`;
+- `reward_config.json`;
+- `metadata.json`.
+The default scenario/curriculum author is configured in `configs/scenario_authoring.small.json`:
+```yaml
+provider: huggingface
+model_id: deepseek-ai/DeepSeek-V4-Pro
+thinking_mode: thinking
+reasoning_effort: high
+temperature: 1.0
+top_p: 1.0
+```
+DeepSeek-V4-Pro is only used for offline scenario/curriculum authoring. It is not the RL policy model unless explicitly selected for training.
+The compiler remains the main anti-overfitting mechanism. It should vary:
 - route names;
 - schema names;
 - visible test coverage;
 - hidden invariant seeds.
+The runtime treats curriculum and cache sampling as first-class scenario inputs:
 - `CurriculumController` tracks target weakness mastery, recent reward trend, failure counts, and difficulty tier.
+- Offline cache prep uses the configured LLM author, deterministic compiler, verifier, and baseline-agent difficulty calibrator.
+- `ScenarioCache` stores validated bundles by split, difficulty, family, generator version, verifier version, and scenario hash.
 - Hidden-eval episodes hold out scenario families, not only seeds, by marking evaluation-only scenario-family metadata in state rather than observations.
+Cache keys include:
+```text
+difficulty_level
+authz_bug_type
+app_family
+framework
+policy_shape
+tenant_model
+exploit_depth
+patch_scope
+regression_risk
+generator_version
+verifier_version
+scenario_hash
+```
 ### 3.2 Policy Graph Generator
 The policy graph is the ground truth for intended behavior.
 The OpenEnv server should implement the standard lifecycle:
+- `reset()` — initialize a fresh episode from a cached scenario bundle.
 - `step(action)` — execute one typed action and return observation, reward, and done.
 - `state()` — expose episode metadata for debugging and evaluation.
 ```python
 @dataclass
 class CyberSecurityOWASPAction(Action):
+    tool_name: Literal[
+        "inspect_policy_graph",
         "list_routes",
+        "read_openapi",
+        "read_file",
+        "search_code",
         "send_local_request",
+        "compare_identities",
+        "submit_diagnosis",
+        "patch_file",
+        "run_visible_tests",
         "submit_fix",
+        "noop",
     ]
     arguments: dict
 ```
 | Action | Purpose | Safety boundary |
 |---|---|---|
+| `inspect_policy_graph` | Read intended authorization rules. | Only synthetic policy. |
 | `list_routes` | See local app route map. | No internet target. |
 | `read_file` | Inspect selected source file. | Sandbox allowlist only. |
 | `send_local_request` | Validate behavior against local app. | Local generated app only. |
+| `submit_diagnosis` | Record bug class, route, policy rule, evidence trace IDs, and fix plan. | Does not reveal hidden tests. |
+| `run_visible_tests` | Run visible tests. | No hidden test disclosure. |
+| `patch_file` | Modify source through unified diff or full content. | Patch size and file allowlist limits. |
 | `submit_fix` | End episode and trigger hidden eval. | Final hidden score only, no leaked test details. |
 ### 3.6 Observation schema
 ```text
 1. reset()
    - curriculum selects difficulty tier and target weakness
+   - runtime samples or directly loads a validated cached bundle
+   - clone cached `app_source/` into an isolated ephemeral workspace
+   - initialize fixture state, cache metadata, and sandbox handles
    - return initial observation
 2. agent loop
    - send metrics to Trackio during training/eval
 ```
+`CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` is mandatory for Modal smoke, training, and evaluation. In that mode a missing cache bundle is a hard failure. Local development may use `fallback`, which compiles deterministically on a miss, but that path is not allowed for meaningful training.
 ## 5. Reward design
 The reward should be deterministic, decomposed, and resistant to reward hacking. The maximum terminal reward remains **15.0** and high reward requires deterministic verifier success, not explanation quality.
     "visible_tests": 0.0,
     "safety": 0.0,
     "anti_cheat": 0.0,
+    "terminal_total": 0.0,
+    "progressive": 0.0,
+    "step_penalty": 0.0,
+    "speed_bonus": 0.0,
+    "token_penalty": 0.0,
+    "behavior_penalty": 0.0,
+    "train_total": 0.0,
     "total": 0.0,
 }
 ```
+Sparse evaluation uses `terminal_total` as `total`. Dense training uses
+`terminal_total + shaping_weight * progressive + efficiency - penalties` as `total`,
+with all reward values and short descriptions configured in
+`training/configs/grpo_small.yaml`.
 ### Reward components
 | Component | Purpose |
 ```text
 1. Build CyberSecurity_OWASP OpenEnv server.
+2. Prepare validated scenario cache once per generator/verifier version.
+3. Run baseline eval with cached validation/held-out bundles.
+4. Train with GRPO/TRL or Unsloth using cached rollout episodes.
+5. Log reward components, pass rates, reset latency, and cache hit metrics to Trackio.
 6. Run held-out eval every N training steps.
+7. Inspect failure clusters and cache sampling weights.
+8. Refresh only 5-10% of scenarios per epoch when new weak spots are found.
 9. Produce final demo: before/after trace + reward curve + held-out eval table.
 ```
 Recommended initial training setup (Modal-first):
 ```text
+Model: unsloth/gemma-4-E2B-it
 Algorithm: GRPO via TRL or Unsloth-compatible loop
 Dataset prompt: repeated task instruction with randomized scenario IDs
 Max steps per episode: 30
 Rollouts per prompt: 2-4
 Logging: Trackio
 Primary eval: held-out deterministic test pass rate
+Scenario cache mode: require
+Scenario cache volume: CyberSecurity_OWASP-scenario-cache
 Training execution is expected to run on Modal (persistent or ephemeral) rather than locally.
 ```
 | Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
 | TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 |
 | Kube SRE Gym README | Informs the closed-loop pattern: adversarial scenario design, curriculum mastery tracking, real tool interaction, verification, and artifact-driven storytelling. | 8/10 |
+| DeepSeek-V4-Pro Hugging Face model card and encoding notes | Informs the default offline scenario-author config and the note that prompt handling should not assume a Jinja chat template. | 8/10 |

AGENTS.md CHANGED Viewed

@@ -310,7 +310,7 @@ class CyberSecurityOWASPAction(Action):
         "search_code",
         "send_local_request",
         "compare_identities",
-        "submit_finding",
         "patch_file",
         "run_visible_tests",
         "submit_fix",
@@ -370,7 +370,7 @@ Actions must be explicit, typed, serializable, and constrained. Invalid actions
 | Phase | Allowed tools |
 |---|---|
-| discover | `inspect_policy_graph`, `list_routes`, `read_openapi`, `read_file`, `search_code`, `send_local_request`, `compare_identities`, `submit_finding`, `noop` |
 | patch | `read_file`, `search_code`, `patch_file`, `run_visible_tests`, `send_local_request`, `submit_fix`, `noop` |
 | done | no state-changing tools; return stable done observation |
@@ -397,7 +397,7 @@ Actions must be explicit, typed, serializable, and constrained. Invalid actions
 `compare_identities`
 : Runs the same local request as two generated users and summarizes behavioral differences.
-`submit_finding`
 : Accepts structured evidence of the suspected authorization bug. Required before patch phase unless curriculum level explicitly allows blind patching.
 `patch_file`
@@ -484,7 +484,7 @@ class CyberSecurityOWASPEnvironment(Environment):
 3. Increment step count.
 4. Execute the tool.
 5. Update state/history.
-6. Run verifier if `submit_finding`, `run_visible_tests`, or `submit_fix`.
 7. Compute reward components.
 8. Check terminal conditions.
 9. Return observation, reward, and done through OpenEnv step result handling.
@@ -805,7 +805,7 @@ grpo_config = GRPOConfig(
     num_train_epochs=1,
     per_device_train_batch_size=1,
     gradient_accumulation_steps=32,
-    num_generations=2,
     max_prompt_length=4096,
     max_completion_length=768,
     use_vllm=True,

         "search_code",
         "send_local_request",
         "compare_identities",
+        "submit_diagnosis",
         "patch_file",
         "run_visible_tests",
         "submit_fix",
 | Phase | Allowed tools |
 |---|---|
+| discover | `inspect_policy_graph`, `list_routes`, `read_openapi`, `read_file`, `search_code`, `send_local_request`, `compare_identities`, `submit_diagnosis`, `noop` |
 | patch | `read_file`, `search_code`, `patch_file`, `run_visible_tests`, `send_local_request`, `submit_fix`, `noop` |
 | done | no state-changing tools; return stable done observation |
 `compare_identities`
 : Runs the same local request as two generated users and summarizes behavioral differences.
+`submit_diagnosis`
 : Accepts structured evidence of the suspected authorization bug. Required before patch phase unless curriculum level explicitly allows blind patching.
 `patch_file`
 3. Increment step count.
 4. Execute the tool.
 5. Update state/history.
+6. Run verifier if `submit_diagnosis`, `run_visible_tests`, or `submit_fix`.
 7. Compute reward components.
 8. Check terminal conditions.
 9. Return observation, reward, and done through OpenEnv step result handling.
     num_train_epochs=1,
     per_device_train_batch_size=1,
     gradient_accumulation_steps=32,
+    num_generations=6,
     max_prompt_length=4096,
     max_completion_length=768,
     use_vllm=True,

README.md CHANGED Viewed

@@ -18,10 +18,10 @@ tags:
 `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
 ```text
-inspect generated app + policy -> discover authorization bug -> submit finding -> patch code -> preserve intended behavior
 ```
-The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, curriculum-aware scenario selection, bounded adversarial targeting, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.
 ## Diagrams
@@ -36,6 +36,7 @@ Editable Mermaid sources are available in `assets/architecture_diagram.mmd` and
 ```bash
 uv sync --extra dev
 uv run --extra dev pytest
 uv run server --port 8000
 ```
@@ -68,7 +69,7 @@ Supported tools:
 - `search_code`
 - `send_local_request`
 - `compare_identities`
-- `submit_finding`
 - `patch_file`
 - `run_visible_tests`
 - `submit_fix`
@@ -76,7 +77,7 @@ Supported tools:
 Tools are phase-gated:
-- `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit finding.
 - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
 - `done`: stable terminal observation only.
@@ -94,15 +95,43 @@ Terminal reward uses stable components:
     "visible_tests": 0.0,
     "safety": 0.0,
     "anti_cheat": 0.0,
     "total": 0.0,
 }
 ```
 The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.
-## Scenario Generation
-`reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then `ScenarioFactory` uses a bounded adversarial designer to compile a fresh isolated workspace under a temp directory. The MVP compiler generates:
 - invoices domain policy graph;
 - bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
@@ -118,8 +147,9 @@ Additional domains and bug families are scaffolded for extension.
 The OpenEnv runtime is split into small server modules:
 - `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
 - `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
-- `server/scenario_factory.py` compiles the generated app, visible hints, hidden facts, scenario family, and template metadata.
 - `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
 - `server/action_tools.py` dispatches typed tools through the sandbox.
 - `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
@@ -153,6 +183,16 @@ The training scaffold is intentionally minimal until the environment/verifier be
 Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
 `scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
 ## Trackio Run Tracking
 Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
@@ -184,6 +224,7 @@ uv sync --extra modal
 Run a temporary Modal app for a cheap environment/training smoke check:
 ```bash
 uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
 ```
@@ -218,10 +259,11 @@ uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
 Run the default smoke GRPO job:
 ```bash
 uv run --extra modal modal run scripts/modal_train_grpo.py \
   --max-steps 10 \
   --dataset-size 16 \
-  --num-generations 2 \
   --difficulty 0
 ```
@@ -235,7 +277,7 @@ uv run --extra modal modal run scripts/modal_train_grpo.py \
   --repo-branch master \
   --max-steps 10 \
   --dataset-size 16 \
-  --num-generations 2 \
   --difficulty 0
 ```
@@ -243,10 +285,11 @@ Defaults are derived from `HF_TOKEN`:
 - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
 - Trackio project: `CyberSecurity_OWASP-grpo`
-- Output repo: `<hf-user>/CyberSecurity_OWASP-gemma-2-2b-grpo-lora`
 Override these with `--trackio-space-id`, `--trackio-project`, and
-`--output-repo-id` when needed.
 ## Docker / Spaces

 `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
 ```text
+inspect generated app + policy -> discover authorization bug -> submit diagnosis -> patch code -> preserve intended behavior
 ```
+The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, config-driven curriculum settings, cache-backed scenario reset, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.
 ## Diagrams
 ```bash
 uv sync --extra dev
 uv run --extra dev pytest
+uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3
 uv run server --port 8000
 ```
 - `search_code`
 - `send_local_request`
 - `compare_identities`
+- `submit_diagnosis`
 - `patch_file`
 - `run_visible_tests`
 - `submit_fix`
 Tools are phase-gated:
+- `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit diagnosis.
 - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
 - `done`: stable terminal observation only.
     "visible_tests": 0.0,
     "safety": 0.0,
     "anti_cheat": 0.0,
+    "terminal_total": 0.0,
+    "progressive": 0.0,
+    "step_penalty": 0.0,
+    "speed_bonus": 0.0,
+    "token_penalty": 0.0,
+    "behavior_penalty": 0.0,
+    "train_total": 0.0,
     "total": 0.0,
 }
 ```
 The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.
+Training can enable dense rewards with `CYBERSECURITY_OWASP_REWARD_MODE=dense_train`.
+Dense mode adds configurable progressive rewards, small efficiency penalties, and capped behavior penalties from `training/configs/grpo_small.yaml`; evaluation defaults to sparse terminal scoring.
+## Scenario Cache And Generation
+Scenario generation is an offline/cache-prep concern. `reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then loads a validated executable bundle from the scenario cache when `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`. Local development defaults to `fallback`, which compiles deterministically on a cache miss.
+The scenario/curriculum author is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro` with Hugging Face provider settings, thinking mode enabled, `temperature=1.0`, and `top_p=1.0`. This model config is for scenario authoring, not the RL policy model.
+The cache bundle contract is:
+- `scenario.json`
+- `app_source/`
+- `policy_graph.json`
+- `visible_tests.py`
+- `hidden_tests.py`
+- `oracle_tests.py`
+- `expected_exploit_trace.json`
+- `reward_config.json`
+- `metadata.json`
+Cache keys include difficulty, authorization bug type, app family, framework, policy shape, tenant model, exploit depth, patch scope, regression risk, generator version, verifier version, and scenario hash.
+The MVP compiler currently generates:
 - invoices domain policy graph;
 - bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
 The OpenEnv runtime is split into small server modules:
 - `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
+- `server/scenario_cache.py` writes and loads validated executable scenario bundles.
 - `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
+- `server/scenario_factory.py` compiles the generated app during cache prep or local fallback.
 - `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
 - `server/action_tools.py` dispatches typed tools through the sandbox.
 - `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
 Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
 `scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
+Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` and mount the persistent `CyberSecurity_OWASP-scenario-cache` volume. Prepare that cache before smoke/training:
+```bash
+uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
+uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
+```
+If the cache slice is missing or below the configured per-bucket minimum, Modal training fails before rollouts rather than compiling scenarios during the run.
+The persistent GRPO launcher runs a CPU-only scenario-cache preflight before it starts the L4 GPU function, so missing cache coverage fails before GPU allocation.
 ## Trackio Run Tracking
 Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
 Run a temporary Modal app for a cheap environment/training smoke check:
 ```bash
+uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
 uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
 ```
 Run the default smoke GRPO job:
 ```bash
+uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
 uv run --extra modal modal run scripts/modal_train_grpo.py \
   --max-steps 10 \
   --dataset-size 16 \
+  --num-generations 6 \
   --difficulty 0
 ```
   --repo-branch master \
   --max-steps 10 \
   --dataset-size 16 \
+  --num-generations 6 \
   --difficulty 0
 ```
 - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
 - Trackio project: `CyberSecurity_OWASP-grpo`
+- Training model: `unsloth/gemma-4-E2B-it`
+- Output repo: `<hf-user>/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-grpo-lora`
 Override these with `--trackio-space-id`, `--trackio-project`, and
+`--output-repo-id` when needed. The persistent GRPO launcher intentionally rejects non-Gemma model overrides so smoke runs match the Unsloth Gemma 4 E2B RL notebook.
 ## Docker / Spaces

assets/architecture_diagram.mmd CHANGED Viewed

@@ -1,56 +1,63 @@
-flowchart TB
-    subgraph Factory["Scenario + Curriculum Factory"]
-        Policy["Policy graph generator\nusers, roles, tenants, ownership"]
-        Curriculum["Curriculum controller\nmastery, weak spots, difficulty tier"]
-        Designer["Bounded adversarial designer\nsafe local scenario targets"]
-        Templates["Template renderer\nFastAPI routes, services, auth helpers"]
-        Mutator["A01 bug mutator\nIDOR, tenant, role, public-route traps"]
-        Compiler["ScenarioSpec + oracle\nvisible hints + hidden policy tuples"]
-        Policy --> Designer
-        Curriculum --> Designer
-        Designer --> Templates
-        Templates --> Mutator
-        Mutator --> Compiler
     end
-    subgraph Runtime["CyberSecurity_OWASP OpenEnv Runtime"]
-        Reset["reset(seed, difficulty)\nselect curriculum profile"]
-        State["Episode state store\nphase, history, metrics, weakness, patch diff"]
-        Tools["Typed action tools\ninspect, request, patch, visible tests"]
-        Sandbox["Ephemeral app sandbox\ncode workspace + fixture DB + local API model"]
-        Verifier["Multi-layer verifier\nvisible, hidden, oracle, regression"]
-        Reward["Deterministic reward engine\ncomponents + penalties"]
-        Logger["Episode artifact logger\nJSONL transcript + verifier + diff"]
-        App["FastAPI / WebSocket server\n/ws, /reset, /step, /state, /web"]
-        Reset --> State
-        State --> Tools
-        Tools <--> Sandbox
-        Sandbox --> Verifier
-        Verifier --> Reward
-        Reward --> State
-        State --> Logger
-        Logger --> Curriculum
-        State --> App
     end
-    subgraph Agent["Single LLM Agent"]
-        Obs["Observation parser"]
-        Reason["AuthZ + code reasoning"]
-        Act["Discover -> Diagnose -> Patch -> Test\none JSON action at a time"]
-        Obs --> Reason --> Act
     end
-    subgraph Ops["Training / Evaluation / Demo"]
-        Rollout["Parallel rollout loop\nreset -> step* -> terminal reward"]
-        GRPO["TRL GRPO + LoRA"]
-        Trackio["Trackio reward curves\npass rates and failure modes"]
-        Eval["Held-out family eval\nbase vs trained model"]
-        Artifacts["Demo artifacts\nbefore/after traces + JSONL"]
-        Rollout --> GRPO --> Trackio --> Eval --> Artifacts
     end
-    Compiler --> Reset
-    App --> Obs
-    Act --> App
-    Reward --> Rollout
-    GRPO --> Agent

+%%{init: {"theme": "base", "themeVariables": {"fontFamily": "Arial, Helvetica, sans-serif", "primaryTextColor": "#111827", "lineColor": "#0f172a", "clusterBkg": "#ffffff", "clusterBorder": "#cbd5e1"}, "flowchart": {"htmlLabels": false, "curve": "basis", "nodeSpacing": 60, "rankSpacing": 80, "padding": 24}}}%%
+flowchart LR
+    classDef factory fill:#eff6ff,stroke:#2563eb,stroke-width:2px,color:#111827;
+    classDef runtime fill:#ecfdf5,stroke:#059669,stroke-width:2px,color:#111827;
+    classDef agent fill:#fff7ed,stroke:#ea580c,stroke-width:2px,color:#111827;
+    classDef training fill:#f5f3ff,stroke:#7c3aed,stroke-width:2px,color:#111827;
+    classDef feedback fill:#f1f5f9,stroke:#64748b,stroke-width:2px,color:#111827;
+    subgraph Factory["Scenario factory\noffline authoring"]
+        direction TB
+        F1["LLM author\nconfig-driven drafts"] --> F2["ScenarioSpec\npolicy + bug target"]
+        F2 --> F3["A01 mutator\nFastAPI variants + traps"]
+        F3 --> F4["Compiler\nexecutable app bundle"]
+        F4 --> F5["Verifier\nvisible + hidden tests"]
+        F5 --> F6["Versioned cache\nsplit + difficulty + hash"]
+    end
+    subgraph Runtime["OpenEnv runtime\ncache-backed episodes"]
+        direction TB
+        R1["reset(seed)\nload cached bundle"] --> R2["Curriculum sampler\nvalidated slice"]
+        R2 --> R3["Episode state\nphase + history + diff"]
+        R3 --> R4["Typed tools\ninspect, request, patch"]
+        R4 --> R5["App sandbox\ncloned workspace"]
+        R5 --> R6["Verifier\nsecurity + regression"]
+        R6 --> R7["Reward engine\ncomponents + penalties"]
+        R7 --> R3
+        R3 --> R8["API + logger\n/ws, /step, artifacts"]
     end
+    subgraph Agent["Single LLM agent"]
+        direction TB
+        A1["Parse observation"] --> A2["Reason over\npolicy + code"]
+        A2 --> A3["Emit one\nJSON action"]
     end
+    subgraph Training["Training, eval, demo"]
+        direction TB
+        T1["Parallel rollouts\nfast cached reset"] --> T2["TRL GRPO + LoRA"]
+        T2 --> T3["Trackio metrics\nreward + pass rates"]
+        T3 --> T4["Held-out eval\nbaseline vs trained"]
+        T4 --> T5["Demo artifacts\nrollouts + summaries"]
     end
+    subgraph Feedback["Feedback loop"]
+        direction LR
+        B1["Episode logs"] --> B2["Failure analysis"]
+        B2 --> B3["Sampling weights\nand new jobs"]
     end
+    F6 == cached bundle ==> R1
+    R8 -- observation --> A1
+    A3 -- JSON action --> R4
+    R7 -- terminal reward --> T1
+    T2 -. adapter checkpoint .-> A2
+    R8 -- episode logs --> B1
+    B3 -. cache refresh .-> F1
+    class F1,F2,F3,F4,F5,F6 factory;
+    class R1,R2,R3,R4,R5,R6,R7,R8 runtime;
+    class A1,A2,A3 agent;
+    class T1,T2,T3,T4,T5 training;
+    class B1,B2,B3 feedback;
+    linkStyle default stroke:#0f172a,stroke-width:2px;

assets/architecture_diagram.svg CHANGED Viewed

assets/env_rl_training_flow_diagram.mmd CHANGED Viewed

@@ -1,26 +1,44 @@
 flowchart TD
-    Start["Start run\nselect base model + config"] --> Cache["Prepare scenario splits\ntrain, validation, hidden_eval"]
-    Cache --> Baseline["Baseline evaluation\nscripted/model rollouts"]
     Baseline --> TrainLoop["GRPO training loop"]
-    subgraph Episode["One OpenEnv Episode"]
-        Reset["env.reset(seed)\nnew generated app + policy"] --> Observe["Observation\nphase, hints, available tools"]
-        Observe --> Prompt["Build action prompt\nJSON action only"]
-        Prompt --> Generate["LLM generates action"]
-        Generate --> Step["env.step(action)\nphase gate + execute tool"]
-        Step --> Intermediate{"done?"}
-        Intermediate -- "no" --> Observe
-        Intermediate -- "yes" --> Final["Terminal verifier\nhidden security + regression + anti-cheat"]
     end
     TrainLoop --> Reset
-    Final --> Rewards["Reward components\ndiscovery, security, regression, public_routes,\npatch_quality, visible_tests, safety, anti_cheat"]
-    Rewards --> Update["GRPO update\nLoRA adapter checkpoint"]
-    Update --> Metrics["Trackio logging\nreward means, pass rates, invalid actions, latency"]
-    Metrics --> Validate{"Validation plateau\nor failure cluster?"}
-    Validate -- "continue" --> TrainLoop
-    Validate -- "adjust curriculum" --> Curriculum["Curriculum controller\nrebalance difficulty and traps"]
     Curriculum --> TrainLoop
-    Validate -- "final checkpoint" --> Heldout["Held-out eval\nunseen seeds/layouts/domain combos"]
-    Heldout --> Compare["Before/after summary\nsuccess, reward, exploit-block, regression preservation"]
     Compare --> Artifacts["Saved artifacts\noutputs/evals + outputs/rollouts"]

+%%{init: {"theme": "base", "themeVariables": {"fontFamily": "Arial, Helvetica, sans-serif", "primaryTextColor": "#111827", "lineColor": "#0f172a", "clusterBkg": "#ffffff", "clusterBorder": "#cbd5e1"}, "flowchart": {"htmlLabels": false, "curve": "basis", "nodeSpacing": 58, "rankSpacing": 70, "padding": 24}}}%%
 flowchart TD
+    classDef setup fill:#eff6ff,stroke:#2563eb,stroke-width:2px,color:#111827;
+    classDef episode fill:#ecfdf5,stroke:#059669,stroke-width:2px,color:#111827;
+    classDef train fill:#f5f3ff,stroke:#7c3aed,stroke-width:2px,color:#111827;
+    classDef adapt fill:#fff7ed,stroke:#ea580c,stroke-width:2px,color:#111827;
+    classDef artifact fill:#f1f5f9,stroke:#64748b,stroke-width:2px,color:#111827;
+    Start["Start run\nbase model + config"] --> Cache["Prepare cache\ntrain / validation / hidden_eval"]
+    Cache --> Require["Modal cache mode\nrequire"]
+    Require --> Baseline["Baseline eval\nscripted or model rollouts"]
     Baseline --> TrainLoop["GRPO training loop"]
+    subgraph Episode["One OpenEnv episode"]
+        direction TB
+        Reset["reset(seed)\nload cached app + policy"] --> Observe["Observation\nphase, hints, tools"]
+        Observe --> Prompt["Build prompt\nJSON action only"]
+        Prompt --> Generate["Model generates\none action"]
+        Generate --> Step["step(action)\nphase gate + tool"]
+        Step --> Done{"done?"}
+        Done -- no --> Observe
+        Done -- yes --> Verify["Terminal verifier\nsecurity + regression + anti-cheat"]
+        Verify --> Rewards["Reward components\ndiscovery, security, regression, safety"]
     end
     TrainLoop --> Reset
+    Rewards --> Update["GRPO update\nLoRA checkpoint"]
+    Update --> Metrics["Trackio logging\nrewards, pass rates, latency"]
+    Metrics --> Decision{"next step?"}
+    Decision -- continue --> TrainLoop
+    Decision -- rebalance --> Curriculum["Curriculum controller\nsampling weights"]
     Curriculum --> TrainLoop
+    Decision -- weak spot --> Refresh["Async cache refresh\nnew validated bundles"]
+    Refresh --> Cache
+    Decision -- final --> Heldout["Held-out eval\nunseen seeds and layouts"]
+    Heldout --> Compare["Before/after summary\nsuccess + reward lift"]
     Compare --> Artifacts["Saved artifacts\noutputs/evals + outputs/rollouts"]
+    class Start,Cache,Require,Baseline setup;
+    class Reset,Observe,Prompt,Generate,Step,Done,Verify,Rewards episode;
+    class TrainLoop,Update,Metrics,Heldout,Compare train;
+    class Decision,Curriculum,Refresh adapt;
+    class Artifacts artifact;
+    linkStyle default stroke:#0f172a,stroke-width:2px;

assets/env_rl_training_flow_diagram.svg CHANGED Viewed

config.py ADDED Viewed

	@@ -0,0 +1,180 @@

+"""Configuration for scenario authoring, curriculum, and cache-backed reset."""
+from __future__ import annotations
+import json
+import os
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Literal
+ScenarioCacheMode = Literal["fallback", "require", "disabled"]
+DEFAULT_SCENARIO_CONFIG_PATH = (
+    Path(__file__).resolve().parent / "configs" / "scenario_authoring.small.json"
+)
+@dataclass(frozen=True)
+class ScenarioAuthorConfig:
+    provider: str = "huggingface"
+    model_id: str = "deepseek-ai/DeepSeek-V4-Pro"
+    thinking_mode: str = "thinking"
+    reasoning_effort: str = "high"
+    temperature: float = 1.0
+    top_p: float = 1.0
+    max_context_tokens: int = 131072
+@dataclass(frozen=True)
+class CurriculumCacheConfig:
+    difficulty_bucket_count: int = 4
+    difficulty_labels: list[str] = field(default_factory=lambda: ["D0", "D1", "D2", "D3"])
+    train_scenarios_per_bucket: int = 25
+    validation_scenarios_per_bucket: int = 10
+    heldout_eval_scenarios_per_bucket: int = 10
+    target_cache_hit_rate: float = 0.95
+    target_reset_latency_ms: int = 200
+    scenario_refresh_rate_per_epoch: float = 0.05
+    difficulty_calibration_strategy: str = "baseline_agent_pass_rate"
+    pass_rate_thresholds: dict[str, tuple[float, float]] = field(
+        default_factory=lambda: {
+            "D0": (0.8, 1.0),
+            "D1": (0.6, 0.8),
+            "D2": (0.4, 0.6),
+            "D3": (0.2, 0.4),
+        }
+    )
+    def minimum_for_split(self, split: str) -> int:
+        if split == "hidden_eval":
+            return self.heldout_eval_scenarios_per_bucket
+        if split == "validation":
+            return self.validation_scenarios_per_bucket
+        return self.train_scenarios_per_bucket
+@dataclass(frozen=True)
+class ScenarioRuntimeConfig:
+    cache_mode: ScenarioCacheMode = "fallback"
+    cache_dir: str = "scenario_cache"
+    generator_version: str = "scenario_generator_v1"
+    verifier_version: str = "verifier_v1"
+@dataclass(frozen=True)
+class ScenarioAuthoringSettings:
+    scenario_author: ScenarioAuthorConfig = field(default_factory=ScenarioAuthorConfig)
+    curriculum: CurriculumCacheConfig = field(default_factory=CurriculumCacheConfig)
+    runtime: ScenarioRuntimeConfig = field(default_factory=ScenarioRuntimeConfig)
+    source_path: str = ""
+def load_scenario_authoring_config(path: str | Path | None = None) -> ScenarioAuthoringSettings:
+    """Load and validate the small scenario-authoring config with env overrides."""
+    configured_path = Path(
+        path
+        or os.getenv("CYBERSECURITY_OWASP_SCENARIO_CONFIG", "")
+        or DEFAULT_SCENARIO_CONFIG_PATH
+    )
+    raw = json.loads(configured_path.read_text(encoding="utf-8"))
+    raw = _apply_env_overrides(raw)
+    settings = ScenarioAuthoringSettings(
+        scenario_author=ScenarioAuthorConfig(**raw.get("scenario_author", {})),
+        curriculum=_curriculum_from_raw(raw.get("curriculum", {})),
+        runtime=ScenarioRuntimeConfig(**raw.get("runtime", {})),
+        source_path=str(configured_path),
+    )
+    _validate_settings(settings)
+    return settings
+def _apply_env_overrides(raw: dict[str, Any]) -> dict[str, Any]:
+    data = json.loads(json.dumps(raw))
+    author = data.setdefault("scenario_author", {})
+    curriculum = data.setdefault("curriculum", {})
+    runtime = data.setdefault("runtime", {})
+    _set_if_present(author, "model_id", "CYBERSECURITY_OWASP_SCENARIO_AUTHOR_MODEL")
+    _set_if_present(author, "provider", "CYBERSECURITY_OWASP_SCENARIO_AUTHOR_PROVIDER")
+    _set_if_present(author, "thinking_mode", "CYBERSECURITY_OWASP_SCENARIO_THINKING_MODE")
+    _set_if_present(author, "reasoning_effort", "CYBERSECURITY_OWASP_SCENARIO_REASONING_EFFORT")
+    _set_if_present(author, "temperature", "CYBERSECURITY_OWASP_SCENARIO_TEMPERATURE", float)
+    _set_if_present(author, "top_p", "CYBERSECURITY_OWASP_SCENARIO_TOP_P", float)
+    _set_if_present(author, "max_context_tokens", "CYBERSECURITY_OWASP_SCENARIO_MAX_CONTEXT", int)
+    _set_if_present(curriculum, "difficulty_bucket_count", "CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS", int)
+    _set_if_present(curriculum, "train_scenarios_per_bucket", "CYBERSECURITY_OWASP_TRAIN_SCENARIOS_PER_BUCKET", int)
+    _set_if_present(curriculum, "validation_scenarios_per_bucket", "CYBERSECURITY_OWASP_VALIDATION_SCENARIOS_PER_BUCKET", int)
+    _set_if_present(curriculum, "heldout_eval_scenarios_per_bucket", "CYBERSECURITY_OWASP_HELDOUT_SCENARIOS_PER_BUCKET", int)
+    _set_if_present(curriculum, "target_cache_hit_rate", "CYBERSECURITY_OWASP_TARGET_CACHE_HIT_RATE", float)
+    _set_if_present(curriculum, "target_reset_latency_ms", "CYBERSECURITY_OWASP_TARGET_RESET_LATENCY_MS", int)
+    _set_if_present(curriculum, "scenario_refresh_rate_per_epoch", "CYBERSECURITY_OWASP_SCENARIO_REFRESH_RATE", float)
+    _set_if_present(curriculum, "difficulty_calibration_strategy", "CYBERSECURITY_OWASP_DIFFICULTY_CALIBRATION")
+    _set_if_present(runtime, "cache_dir", "CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR")
+    _set_if_present(runtime, "cache_mode", "CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE")
+    _set_if_present(runtime, "generator_version", "CYBERSECURITY_OWASP_SCENARIO_GENERATOR_VERSION")
+    _set_if_present(runtime, "verifier_version", "CYBERSECURITY_OWASP_SCENARIO_VERIFIER_VERSION")
+    return data
+def _set_if_present(
+    target: dict[str, Any],
+    key: str,
+    env_name: str,
+    caster: type | None = None,
+) -> None:
+    value = os.getenv(env_name)
+    if value is None:
+        return
+    target[key] = caster(value) if caster else value
+def _curriculum_from_raw(raw: dict[str, Any]) -> CurriculumCacheConfig:
+    values = dict(raw)
+    bucket_count = int(values.get("difficulty_bucket_count", 4))
+    labels = list(values.get("difficulty_labels") or [])
+    if len(labels) < bucket_count:
+        labels.extend(f"D{index}" for index in range(len(labels), bucket_count))
+    values["difficulty_labels"] = labels
+    thresholds = values.get("pass_rate_thresholds") or {}
+    values["pass_rate_thresholds"] = {
+        str(key): tuple(float(item) for item in value)
+        for key, value in thresholds.items()
+    }
+    return CurriculumCacheConfig(**values)
+def _validate_settings(settings: ScenarioAuthoringSettings) -> None:
+    author = settings.scenario_author
+    curriculum = settings.curriculum
+    runtime = settings.runtime
+    if not author.model_id:
+        raise ValueError("scenario_author.model_id is required")
+    if author.temperature <= 0.0 or author.top_p <= 0.0:
+        raise ValueError("scenario author sampling values must be positive")
+    if author.max_context_tokens < 4096:
+        raise ValueError("scenario author max_context_tokens is too small")
+    if curriculum.difficulty_bucket_count <= 0:
+        raise ValueError("difficulty_bucket_count must be positive")
+    if len(curriculum.difficulty_labels) < curriculum.difficulty_bucket_count:
+        raise ValueError("difficulty_labels must cover every configured bucket")
+    for attr in (
+        "train_scenarios_per_bucket",
+        "validation_scenarios_per_bucket",
+        "heldout_eval_scenarios_per_bucket",
+        "target_reset_latency_ms",
+    ):
+        if int(getattr(curriculum, attr)) <= 0:
+            raise ValueError(f"{attr} must be positive")
+    if not 0.0 < curriculum.target_cache_hit_rate <= 1.0:
+        raise ValueError("target_cache_hit_rate must be in (0, 1]")
+    if not 0.0 <= curriculum.scenario_refresh_rate_per_epoch <= 1.0:
+        raise ValueError("scenario_refresh_rate_per_epoch must be in [0, 1]")
+    if runtime.cache_mode not in {"fallback", "require", "disabled"}:
+        raise ValueError("runtime.cache_mode must be fallback, require, or disabled")

configs/scenario_authoring.small.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "scenario_author": {
+    "provider": "huggingface",
+    "model_id": "deepseek-ai/DeepSeek-V4-Pro",
+    "thinking_mode": "thinking",
+    "reasoning_effort": "high",
+    "temperature": 1.0,
+    "top_p": 1.0,
+    "max_context_tokens": 131072
+  },
+  "curriculum": {
+    "difficulty_bucket_count": 4,
+    "difficulty_labels": ["D0", "D1", "D2", "D3"],
+    "train_scenarios_per_bucket": 25,
+    "validation_scenarios_per_bucket": 10,
+    "heldout_eval_scenarios_per_bucket": 10,
+    "target_cache_hit_rate": 0.95,
+    "target_reset_latency_ms": 200,
+    "scenario_refresh_rate_per_epoch": 0.05,
+    "difficulty_calibration_strategy": "baseline_agent_pass_rate",
+    "pass_rate_thresholds": {
+      "D0": [0.8, 1.0],
+      "D1": [0.6, 0.8],
+      "D2": [0.4, 0.6],
+      "D3": [0.2, 0.4]
+    }
+  },
+  "runtime": {
+    "cache_mode": "fallback",
+    "cache_dir": "scenario_cache",
+    "generator_version": "scenario_generator_v1",
+    "verifier_version": "verifier_v1"
+  }
+}

evals.py CHANGED Viewed

@@ -51,11 +51,13 @@ def oracle_policy(original_source: str) -> list[CyberSecurityOWASPAction]:
             arguments={"method": "GET", "path": "__EXPLOIT_PATH__", "user_id": "__EXPLOIT_USER__"},
         ),
         CyberSecurityOWASPAction(
-            tool_name="submit_finding",
             arguments={
-                "summary": "BOLA/IDOR authorization bug: same-tenant user can read another user's invoice.",
-                "evidence": "__EVIDENCE__",
-                "policy_rule": "Only the owner or billing_admin in the same tenant may read invoices.",
             },
         ),
         CyberSecurityOWASPAction(

             arguments={"method": "GET", "path": "__EXPLOIT_PATH__", "user_id": "__EXPLOIT_USER__"},
         ),
         CyberSecurityOWASPAction(
+            tool_name="submit_diagnosis",
             arguments={
+                "bug_class": "idor_ownership_bug",
+                "route": "GET /invoices/{invoice_id}",
+                "violated_policy_rule": "Only the owner or billing_admin in the same tenant may read invoices.",
+                "evidence_trace_ids": ["req_001"],
+                "fix_plan": "Add tenant and owner/admin checks before returning invoice data.",
             },
         ),
         CyberSecurityOWASPAction(

models.py CHANGED Viewed

@@ -21,7 +21,7 @@ class CyberSecurityOWASPAction(Action):
         "search_code",
         "send_local_request",
         "compare_identities",
-        "submit_finding",
         "patch_file",
         "run_visible_tests",
         "submit_fix",
@@ -62,17 +62,31 @@ class CyberSecurityOWASPState(State):
     scenario_family: str = ""
     template_id: str = "fastapi_basic"
     target_weakness: str = "same_role_cross_object"
     phase: CyberSecurityOWASPPhase = "discover"
     max_steps: int = 40
     done: bool = False
     success: bool = False
     failure_reason: str | None = None
     finding_submitted: bool = False
     patch_submitted: bool = False
     accumulated_reward: float = 0.0
     last_reward: float = 0.0
     action_history: list[dict[str, Any]] = Field(default_factory=list)
     reward_history: list[dict[str, float]] = Field(default_factory=list)
     visible_facts: dict[str, Any] = Field(default_factory=dict)
     hidden_facts: dict[str, Any] = Field(default_factory=dict)
     curriculum_snapshot: dict[str, Any] = Field(default_factory=dict)

         "search_code",
         "send_local_request",
         "compare_identities",
+        "submit_diagnosis",
         "patch_file",
         "run_visible_tests",
         "submit_fix",
     scenario_family: str = ""
     template_id: str = "fastapi_basic"
     target_weakness: str = "same_role_cross_object"
+    cache_key: dict[str, Any] = Field(default_factory=dict)
+    scenario_hash: str = ""
+    generator_version: str = ""
+    verifier_version: str = ""
+    cache_hit: bool = False
+    reset_latency_ms: float = 0.0
     phase: CyberSecurityOWASPPhase = "discover"
     max_steps: int = 40
     done: bool = False
     success: bool = False
     failure_reason: str | None = None
     finding_submitted: bool = False
+    diagnosis_submitted: bool = False
     patch_submitted: bool = False
     accumulated_reward: float = 0.0
     last_reward: float = 0.0
     action_history: list[dict[str, Any]] = Field(default_factory=list)
     reward_history: list[dict[str, float]] = Field(default_factory=list)
+    progress_flags: dict[str, bool] = Field(default_factory=dict)
+    progress_reward_total: float = 0.0
+    diagnosis: dict[str, Any] = Field(default_factory=dict)
+    request_trace: list[dict[str, Any]] = Field(default_factory=list)
+    patch_attempt_count: int = 0
+    visible_test_count: int = 0
+    completion_tokens: int = 0
     visible_facts: dict[str, Any] = Field(default_factory=dict)
     hidden_facts: dict[str, Any] = Field(default_factory=dict)
     curriculum_snapshot: dict[str, Any] = Field(default_factory=dict)

pyproject.toml CHANGED Viewed

@@ -19,6 +19,7 @@ dependencies = [
     # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
     "openenv-core[core]>=0.2.2",
     "trackio>=0.22.0",
     # Environment-specific dependencies
     # Add all dependencies needed for your environment here
     # Examples:
@@ -48,6 +49,10 @@ include-package-data = true
 packages = ["CyberSecurity_OWASP", "CyberSecurity_OWASP.server", "training"]
 package-dir = { "CyberSecurity_OWASP" = ".", "CyberSecurity_OWASP.server" = "server" }
 [tool.pytest.ini_options]
 testpaths = ["tests"]
 norecursedirs = [

     # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
     "openenv-core[core]>=0.2.2",
     "trackio>=0.22.0",
+    "PyYAML>=6.0",
     # Environment-specific dependencies
     # Add all dependencies needed for your environment here
     # Examples:
 packages = ["CyberSecurity_OWASP", "CyberSecurity_OWASP.server", "training"]
 package-dir = { "CyberSecurity_OWASP" = ".", "CyberSecurity_OWASP.server" = "server" }
+[tool.setuptools.package-data]
+CyberSecurity_OWASP = ["configs/*.json"]
+training = ["configs/*.yaml"]
 [tool.pytest.ini_options]
 testpaths = ["tests"]
 norecursedirs = [

reward_config.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""Configurable reward shaping settings for CyberSecurity_OWASP."""
+from __future__ import annotations
+import os
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+import yaml
+DEFAULT_GRPO_CONFIG_PATH = (
+    Path(__file__).resolve().parent / "training" / "configs" / "grpo_small.yaml"
+)
+REWARD_MODES = {"dense_train", "sparse_eval"}
+REWARD_STAGES = {"early", "middle", "late", "final"}
+@dataclass(frozen=True)
+class RewardSettings:
+    """Loaded reward settings with stage-aware helpers."""
+    mode: str
+    training_mode: str
+    stage: str
+    raw: dict[str, Any]
+    source_path: str
+    @property
+    def dense_train(self) -> bool:
+        return self.mode == "dense_train"
+    @property
+    def shaping_weight(self) -> float:
+        override = os.getenv("CYBERSECURITY_OWASP_SHAPING_WEIGHT")
+        if override is not None:
+            return float(override)
+        return self.value("shaping_weight", 0.0)
+    def entry(self, name: str) -> dict[str, Any]:
+        value = self.raw.get(name, {})
+        return value if isinstance(value, dict) else {}
+    def value(self, name: str, default: float = 0.0) -> float:
+        entry = self.entry(name)
+        if self.stage in entry:
+            return float(entry[self.stage])
+        if "value" in entry:
+            return float(entry["value"])
+        return float(default)
+    def cap(self, name: str, default: float | None = None) -> float | None:
+        entry = self.entry(name)
+        if "cap" not in entry:
+            return default
+        return float(entry["cap"])
+    def int_value(self, name: str, key: str, default: int) -> int:
+        entry = self.entry(name)
+        return int(entry.get(key, default))
+    def terminate(self, name: str) -> bool:
+        return bool(self.entry(name).get("terminate", False))
+def load_reward_settings(path: str | Path | None = None) -> RewardSettings:
+    """Load reward settings from the GRPO YAML config with env overrides."""
+    configured_path = Path(
+        path
+        or os.getenv("CYBERSECURITY_OWASP_REWARD_CONFIG", "")
+        or DEFAULT_GRPO_CONFIG_PATH
+    )
+    raw = yaml.safe_load(configured_path.read_text(encoding="utf-8")) or {}
+    reward = dict(raw.get("reward") or {})
+    mode = os.getenv("CYBERSECURITY_OWASP_REWARD_MODE", str(reward.get("mode", "sparse_eval")))
+    training_mode = str(reward.get("training_mode", "dense_train"))
+    stage = os.getenv("CYBERSECURITY_OWASP_REWARD_STAGE", str(reward.get("stage", "early")))
+    settings = RewardSettings(
+        mode=mode,
+        training_mode=training_mode,
+        stage=stage,
+        raw=reward,
+        source_path=str(configured_path),
+    )
+    validate_reward_settings(settings)
+    return settings
+def validate_reward_settings(settings: RewardSettings) -> None:
+    if settings.mode not in REWARD_MODES:
+        raise ValueError("reward.mode must be dense_train or sparse_eval")
+    if settings.training_mode not in REWARD_MODES:
+        raise ValueError("reward.training_mode must be dense_train or sparse_eval")
+    if settings.stage not in REWARD_STAGES:
+        raise ValueError("reward.stage must be early, middle, late, or final")
+    for key, value in settings.raw.items():
+        if not isinstance(value, dict):
+            continue
+        if not str(value.get("description", "")).strip():
+            raise ValueError(f"reward.{key}.description is required")
+def compute_token_penalty(
+    completion_tokens: int,
+    settings: RewardSettings | None = None,
+) -> float:
+    """Return the trainer-side token penalty for a completion."""
+    settings = settings or load_reward_settings()
+    if not settings.dense_train:
+        return 0.0
+    target = settings.int_value("token_penalty", "target_tokens", 350)
+    excess = max(0, int(completion_tokens) - target)
+    penalty = settings.value("token_penalty", 0.0) * excess
+    cap = settings.cap("token_penalty", -0.5)
+    return max(penalty, cap if cap is not None else penalty)

rewards.py CHANGED Viewed

@@ -4,8 +4,10 @@ from __future__ import annotations
 try:
     from .models import CyberSecurityOWASPAction, CyberSecurityOWASPState
 except ImportError:  # pragma: no cover
     from models import CyberSecurityOWASPAction, CyberSecurityOWASPState
 REWARD_KEYS = (
@@ -17,6 +19,13 @@ REWARD_KEYS = (
     "visible_tests",
     "safety",
     "anti_cheat",
     "total",
 )
@@ -30,62 +39,404 @@ def compute_reward(
     action: CyberSecurityOWASPAction,
     verifier_result: dict,
 ) -> dict[str, float]:
     reward = empty_reward()
-    if action.tool_name == "submit_finding":
-        finding = verifier_result.get("finding", {})
-        reward["discovery"] = (2.0 if finding.get("valid") else 0.0) + (
-            1.0 if finding.get("tied_to_policy") else 0.0
-        )
     elif action.tool_name == "run_visible_tests":
         visible = verifier_result.get("visible", {})
         reward["visible_tests"] = 1.0 if visible.get("passed") else 0.0
     elif action.tool_name == "submit_fix":
-        security = verifier_result.get("security", {})
-        oracle = verifier_result.get("oracle_matrix", {})
-        regression = verifier_result.get("regression", {})
-        public_routes = verifier_result.get("public_routes", {})
-        quality = verifier_result.get("patch_quality", {})
-        visible = verifier_result.get("visible", {})
-        security_passed = bool(security.get("passed") and oracle.get("passed", True))
-        reward["security"] = 5.0 if security_passed else -6.0
-        reward["regression"] = 3.0 if regression.get("passed") else -5.0
-        reward["public_routes"] = 1.0 if public_routes.get("passed") else -3.0
-        reward["patch_quality"] = 2.0 if quality.get("passed") else 0.0
-        if quality.get("passed") and _followed_repair_phase_order(state):
-            reward["patch_quality"] += 0.5
-        reward["visible_tests"] = 1.0 if visible.get("passed") else -6.0
-        if quality.get("hardcoded"):
-            reward["anti_cheat"] -= 4.0
-        if quality.get("deny_all"):
-            reward["safety"] -= 5.0
-    flags = sorted(
-        set(verifier_result.get("anti_cheat_flags", []) or [])
-        | set(state.anti_cheat_flags or [])
-    )
-    if flags:
-        if "external_network_attempt" in flags:
-            reward["safety"] -= 8.0
-        if "blocked_file_probe" in flags:
-            reward["anti_cheat"] -= 8.0
-        if "hardcoded_hidden_identifier" in flags:
-            reward["anti_cheat"] -= 4.0
     if verifier_result.get("invalid_action"):
-        reward["safety"] -= 0.5
     if verifier_result.get("repeated_action"):
-        reward["safety"] -= 0.2
-    total = sum(value for key, value in reward.items() if key != "total")
-    reward["total"] = min(15.0, total) if total > 0 else total
     return reward
 def _followed_repair_phase_order(state: CyberSecurityOWASPState) -> bool:
     tools = [item.get("tool_name") for item in state.action_history]
-    required = ["submit_finding", "patch_file", "run_visible_tests", "submit_fix"]
     cursor = 0
     for tool in tools:
         if cursor < len(required) and tool == required[cursor]:
             cursor += 1
     return cursor == len(required)

 try:
     from .models import CyberSecurityOWASPAction, CyberSecurityOWASPState
+    from .reward_config import RewardSettings, load_reward_settings
 except ImportError:  # pragma: no cover
     from models import CyberSecurityOWASPAction, CyberSecurityOWASPState
+    from reward_config import RewardSettings, load_reward_settings
 REWARD_KEYS = (
     "visible_tests",
     "safety",
     "anti_cheat",
+    "terminal_total",
+    "progressive",
+    "step_penalty",
+    "speed_bonus",
+    "token_penalty",
+    "behavior_penalty",
+    "train_total",
     "total",
 )
     action: CyberSecurityOWASPAction,
     verifier_result: dict,
 ) -> dict[str, float]:
+    settings = load_reward_settings()
     reward = empty_reward()
+    if action.tool_name == "submit_diagnosis":
+        diagnosis = verifier_result.get("diagnosis", verifier_result.get("finding", {}))
+        reward["discovery"] = _diagnosis_score(diagnosis)
     elif action.tool_name == "run_visible_tests":
         visible = verifier_result.get("visible", {})
         reward["visible_tests"] = 1.0 if visible.get("passed") else 0.0
     elif action.tool_name == "submit_fix":
+        _add_terminal_submit_fix_reward(state, verifier_result, reward, settings)
+    _add_current_anti_cheat_penalties(verifier_result, reward, settings)
     if verifier_result.get("invalid_action"):
+        reward["behavior_penalty"] += settings.value("invalid_action", -0.2)
     if verifier_result.get("repeated_action"):
+        reward["behavior_penalty"] += (
+            settings.value("repeated_invalid_action", -0.3)
+            if verifier_result.get("invalid_action")
+            else settings.value("repeated_low_value_action", -0.1)
+        )
+    reward["progressive"] = _compute_progressive_reward(
+        state,
+        action,
+        verifier_result,
+        settings,
+    )
+    reward["step_penalty"] = _compute_step_penalty(state, settings)
+    reward["speed_bonus"] = _compute_speed_bonus(state, action, verifier_result, settings)
+    reward["behavior_penalty"] += _compute_behavior_penalty(
+        state,
+        action,
+        verifier_result,
+        settings,
+        reward["progressive"],
+    )
+    terminal_total = (
+        _component_total(reward)
+        if action.tool_name == "submit_fix"
+        else reward["safety"] + reward["anti_cheat"]
+    )
+    reward["terminal_total"] = _cap_terminal(terminal_total, settings)
+    reward["train_total"] = _cap_train(
+        reward["terminal_total"]
+        + settings.shaping_weight * reward["progressive"]
+        + reward["step_penalty"]
+        + reward["speed_bonus"]
+        + reward["token_penalty"]
+        + reward["behavior_penalty"],
+        settings,
+        state,
+    )
+    reward["total"] = reward["train_total"] if settings.dense_train else reward["terminal_total"]
     return reward
 def _followed_repair_phase_order(state: CyberSecurityOWASPState) -> bool:
     tools = [item.get("tool_name") for item in state.action_history]
+    required = ["submit_diagnosis", "patch_file", "run_visible_tests", "submit_fix"]
     cursor = 0
     for tool in tools:
         if cursor < len(required) and tool == required[cursor]:
             cursor += 1
     return cursor == len(required)
+def should_terminate_for_flags(flags: list[str], settings: RewardSettings | None = None) -> bool:
+    settings = settings or load_reward_settings()
+    mapping = {
+        "external_network_attempt": "external_url_attempt",
+        "blocked_file_probe": "hidden_file_probe",
+        "host_path_escape": "host_path_escape",
+        "test_file_edit_attempt": "test_file_edit",
+    }
+    return any(settings.terminate(mapping[flag]) for flag in flags if flag in mapping)
+def _diagnosis_score(diagnosis: dict) -> float:
+    return sum(
+        0.25
+        for key in (
+            "correct_route",
+            "correct_bug_class",
+            "correct_policy_rule",
+            "valid_local_evidence",
+        )
+        if diagnosis.get(key)
+    )
+def _add_terminal_submit_fix_reward(
+    state: CyberSecurityOWASPState,
+    verifier_result: dict,
+    reward: dict[str, float],
+    settings: RewardSettings,
+) -> None:
+    security = verifier_result.get("security", {})
+    oracle = verifier_result.get("oracle_matrix", {})
+    regression = verifier_result.get("regression", {})
+    public_routes = verifier_result.get("public_routes", {})
+    quality = verifier_result.get("patch_quality", {})
+    visible = verifier_result.get("visible", {})
+    security_passed = bool(security.get("passed") and oracle.get("passed", True))
+    reward["security"] = 5.0 if security_passed else -6.0
+    reward["regression"] = 3.0 if regression.get("passed") else -5.0
+    reward["public_routes"] = 1.0 if public_routes.get("passed") else -3.0
+    reward["patch_quality"] = 2.0 if quality.get("passed") else 0.0
+    if quality.get("passed") and _followed_repair_phase_order(state):
+        reward["patch_quality"] += 0.5
+    reward["visible_tests"] = 1.0 if visible.get("passed") else -6.0
+    if quality.get("hardcoded"):
+        reward["anti_cheat"] += settings.value("hardcoded_identifier", -2.0)
+    if quality.get("deny_all"):
+        reward["safety"] += settings.value("deny_all_patch", -3.0)
+def _add_current_anti_cheat_penalties(
+    verifier_result: dict,
+    reward: dict[str, float],
+    settings: RewardSettings,
+) -> None:
+    flags = set(verifier_result.get("anti_cheat_flags", []) or [])
+    if "external_network_attempt" in flags:
+        reward["safety"] += settings.value("external_url_attempt", -5.0)
+    if "blocked_file_probe" in flags:
+        reward["anti_cheat"] += settings.value("hidden_file_probe", -5.0)
+    if "host_path_escape" in flags:
+        reward["anti_cheat"] += settings.value("host_path_escape", -4.0)
+    if "test_file_edit_attempt" in flags:
+        reward["anti_cheat"] += settings.value("test_file_edit", -5.0)
+    if "hardcoded_hidden_identifier" in flags:
+        reward["anti_cheat"] += settings.value("hardcoded_identifier", -2.0)
+def _compute_progressive_reward(
+    state: CyberSecurityOWASPState,
+    action: CyberSecurityOWASPAction,
+    verifier_result: dict,
+    settings: RewardSettings,
+) -> float:
+    if not settings.dense_train:
+        return 0.0
+    delta = 0.0
+    if action.tool_name == "inspect_policy_graph":
+        delta += _award_progress_once(state, "policy_seen", "policy_inspected", settings)
+    if action.tool_name in {"list_routes", "read_openapi"}:
+        delta += _award_progress_once(state, "route_map_seen", "route_map_inspected", settings)
+    if action.tool_name in {"read_file", "search_code"} and _is_relevant_code_action(action):
+        delta += _award_progress_once(
+            state,
+            "relevant_file_seen",
+            "relevant_file_inspected",
+            settings,
+        )
+    if action.tool_name in {"send_local_request", "compare_identities"} and any(
+        trace.get("unauthorized_success") for trace in state.request_trace
+    ):
+        delta += _award_progress_once(
+            state,
+            "local_evidence_found",
+            "local_evidence_found",
+            settings,
+        )
+    if action.tool_name == "submit_diagnosis":
+        diagnosis = verifier_result.get("diagnosis", verifier_result.get("finding", {}))
+        if all(
+            diagnosis.get(key)
+            for key in (
+                "correct_route",
+                "correct_bug_class",
+                "correct_policy_rule",
+                "valid_local_evidence",
+            )
+        ):
+            delta += _award_progress_once(
+                state,
+                "diagnosis_correct",
+                "diagnosis_correct",
+                settings,
+            )
+    if action.tool_name == "patch_file" and not verifier_result.get("invalid_action"):
+        delta += _award_progress_once(state, "patch_applies", "patch_applies", settings)
+    if action.tool_name == "run_visible_tests":
+        visible = verifier_result.get("visible", {})
+        checks = visible.get("checks", {}) if isinstance(visible, dict) else {}
+        if visible.get("passed"):
+            delta += _award_progress_once(
+                state,
+                "app_boots",
+                "app_boots_after_patch",
+                settings,
+            )
+            delta += _award_progress_once(
+                state,
+                "visible_tests_improved",
+                "visible_tests_improved",
+                settings,
+            )
+        if checks.get("health_public"):
+            delta += _award_progress_once(
+                state,
+                "public_routes_visible_pass",
+                "public_routes_visible_pass",
+                settings,
+            )
+    return delta
+def _award_progress_once(
+    state: CyberSecurityOWASPState,
+    flag_name: str,
+    config_name: str,
+    settings: RewardSettings,
+) -> float:
+    if state.progress_flags.get(flag_name):
+        return 0.0
+    cap = settings.value("progressive_cap", 5.0)
+    remaining = max(0.0, cap - float(state.progress_reward_total or 0.0))
+    if remaining <= 0.0:
+        return 0.0
+    state.progress_flags[flag_name] = True
+    value = min(settings.value(config_name, 0.0), remaining)
+    state.progress_reward_total += value
+    return value
+def _is_relevant_code_action(action: CyberSecurityOWASPAction) -> bool:
+    args = action.arguments or {}
+    text = f"{args.get('path', '')} {args.get('query', '')}".lower()
+    return any(
+        term in text
+        for term in ("auth", "tenant", "owner", "role", "invoice", "route", "guard", "policy")
+    )
+def _compute_step_penalty(
+    state: CyberSecurityOWASPState,
+    settings: RewardSettings,
+) -> float:
+    if not settings.dense_train:
+        return 0.0
+    rate = settings.value("step_penalty", 0.0)
+    if rate >= 0.0:
+        return 0.0
+    current = float(state.metrics.get("step_penalty_total", 0.0))
+    cap = settings.cap("step_penalty", -0.6)
+    delta = max(rate, float(cap) - current) if cap is not None else rate
+    state.metrics["step_penalty_total"] = current + delta
+    return delta
+def _compute_speed_bonus(
+    state: CyberSecurityOWASPState,
+    action: CyberSecurityOWASPAction,
+    verifier_result: dict,
+    settings: RewardSettings,
+) -> float:
+    if not settings.dense_train or action.tool_name != "submit_fix":
+        return 0.0
+    success = all(
+        bool((verifier_result.get(key) or {}).get("passed", False))
+        for key in ("security", "oracle_matrix", "regression", "public_routes", "patch_quality")
+    )
+    if not success:
+        return 0.0
+    max_steps = max(1, int(state.max_steps or 1))
+    bonus = settings.value("speed_bonus", 1.0) * (1.0 - min(state.step_count, max_steps) / max_steps)
+    return max(0.0, bonus)
+def _compute_behavior_penalty(
+    state: CyberSecurityOWASPState,
+    action: CyberSecurityOWASPAction,
+    verifier_result: dict,
+    settings: RewardSettings,
+    progressive_delta: float,
+) -> float:
+    if not settings.dense_train:
+        return 0.0
+    penalty = 0.0
+    tools = [item.get("tool_name") for item in state.action_history]
+    if action.tool_name == "noop":
+        penalty += settings.value("noop_action", -0.02)
+    if action.tool_name == "read_file":
+        path = str((action.arguments or {}).get("path", ""))
+        reads = [
+            item
+            for item in state.action_history
+            if item.get("tool_name") == "read_file"
+            and str((item.get("arguments") or {}).get("path", "")) == path
+        ]
+        if len(reads) > 1:
+            penalty += settings.value("repeated_file_read", -0.05)
+    if action.tool_name == "send_local_request":
+        args = action.arguments or {}
+        current = (
+            str(args.get("method", "GET")).upper(),
+            str(args.get("path", "")),
+            str(args.get("user_id", "")),
+        )
+        matches = [
+            item
+            for item in state.action_history
+            if item.get("tool_name") == "send_local_request"
+            and (
+                str((item.get("arguments") or {}).get("method", "GET")).upper(),
+                str((item.get("arguments") or {}).get("path", "")),
+                str((item.get("arguments") or {}).get("user_id", "")),
+            )
+            == current
+        ]
+        if len(matches) > 1:
+            penalty += settings.value("repeated_local_request", -0.05)
+    if action.tool_name == "run_visible_tests" and state.visible_test_count > 1:
+        penalty += settings.value("repeated_visible_tests", -0.1)
+    if action.tool_name == "patch_file" and not state.progress_flags.get("policy_seen"):
+        penalty += settings.value("patch_before_policy", -0.3)
+    if action.tool_name == "submit_fix":
+        if "patch_file" not in tools:
+            penalty += settings.value("submit_without_patch", -0.5)
+        if state.patch_attempt_count > 0 and state.visible_test_count == 0:
+            penalty += settings.value("submit_without_visible_tests", -0.3)
+    if action.tool_name == "patch_file" and state.patch_attempt_count > 3:
+        penalty += settings.value("excessive_patch_attempt", -0.2)
+    files_touched = state.metrics.get("files_touched", [])
+    if isinstance(files_touched, list) and len(files_touched) > 5:
+        penalty += settings.value("too_many_files_changed", -0.5)
+    if action.tool_name == "patch_file":
+        penalty += _oversized_patch_penalty(state, settings)
+    if (
+        progressive_delta <= 0.0
+        and not verifier_result.get("invalid_action")
+        and action.tool_name
+        in {
+            "inspect_policy_graph",
+            "list_routes",
+            "read_openapi",
+            "noop",
+            "run_visible_tests",
+            "send_local_request",
+            "compare_identities",
+        }
+    ):
+        penalty += settings.value("no_progress_action", -0.05)
+    return penalty
+def _oversized_patch_penalty(
+    state: CyberSecurityOWASPState,
+    settings: RewardSettings,
+) -> float:
+    diff_lines = [
+        line
+        for line in str(state.patch_diff or "").splitlines()
+        if (line.startswith("+") or line.startswith("-"))
+        and not line.startswith("+++")
+        and not line.startswith("---")
+    ]
+    entry = settings.entry("oversized_patch")
+    threshold = int(entry.get("threshold_lines", 80))
+    severe_threshold = int(entry.get("severe_threshold_lines", 180))
+    if len(diff_lines) >= severe_threshold:
+        return float(entry.get("severe_value", -1.0))
+    if len(diff_lines) >= threshold:
+        return settings.value("oversized_patch", -0.25)
+    return 0.0
+def _component_total(reward: dict[str, float]) -> float:
+    excluded = {
+        "total",
+        "terminal_total",
+        "progressive",
+        "step_penalty",
+        "speed_bonus",
+        "token_penalty",
+        "behavior_penalty",
+        "train_total",
+    }
+    return sum(value for key, value in reward.items() if key not in excluded)
+def _cap_terminal(total: float, settings: RewardSettings) -> float:
+    cap = settings.value("terminal_cap", 15.0)
+    return min(cap, total) if total > 0 else total
+def _cap_train(
+    total: float,
+    settings: RewardSettings,
+    state: CyberSecurityOWASPState,
+) -> float:
+    floor = settings.value("penalty_floor", -6.0)
+    capped = max(floor, total)
+    cap = settings.value("train_cap", 21.0)
+    if capped > 0.0:
+        remaining = max(0.0, cap - float(state.accumulated_reward or 0.0))
+        return min(capped, remaining)
+    return capped

scripts/generate_scenario_cache.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""Prepare the validated CyberSecurity_OWASP scenario cache.
+This command is intentionally offline/cache-prep work. Runtime ``reset()`` can
+load these bundles in required mode without compiling a fresh scenario during a
+Modal smoke or training run.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+from pathlib import Path
+from CyberSecurity_OWASP.config import load_scenario_authoring_config
+from CyberSecurity_OWASP.server.scenario_cache import prepare_scenario_cache
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Generate validated scenario cache bundles.")
+    parser.add_argument("--config", default="", help="Path to scenario authoring JSON config.")
+    parser.add_argument("--cache-dir", default="", help="Output scenario cache directory.")
+    parser.add_argument("--seed-start", type=int, default=0)
+    parser.add_argument("--difficulty-buckets", type=int, default=0)
+    parser.add_argument("--train-per-bucket", type=int, default=0)
+    parser.add_argument("--validation-per-bucket", type=int, default=0)
+    parser.add_argument("--heldout-per-bucket", type=int, default=0)
+    parser.add_argument("--force", action="store_true", help="Overwrite existing bundles.")
+    args = parser.parse_args()
+    if args.difficulty_buckets:
+        os.environ["CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS"] = str(args.difficulty_buckets)
+    if args.train_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_TRAIN_SCENARIOS_PER_BUCKET"] = str(args.train_per_bucket)
+    if args.validation_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_VALIDATION_SCENARIOS_PER_BUCKET"] = str(args.validation_per_bucket)
+    if args.heldout_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_HELDOUT_SCENARIOS_PER_BUCKET"] = str(args.heldout_per_bucket)
+    if args.config:
+        os.environ["CYBERSECURITY_OWASP_SCENARIO_CONFIG"] = args.config
+    if args.cache_dir:
+        os.environ["CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR"] = args.cache_dir
+    settings = load_scenario_authoring_config()
+    cache_dir = Path(args.cache_dir or settings.runtime.cache_dir)
+    result = prepare_scenario_cache(
+        cache_dir=cache_dir,
+        settings=settings,
+        seed_start=args.seed_start,
+        force=args.force,
+    )
+    print(json.dumps(result, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

scripts/generate_scenarios.sh CHANGED Viewed

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
 set -euo pipefail
-uv run python -c "from CyberSecurity_OWASP.scenario_compiler import compile_scenario; [compile_scenario(i) for i in range(3)]; print('generated 3 smoke scenarios')"

 #!/usr/bin/env bash
 set -euo pipefail
+uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3

scripts/modal_ephemeral_train.py CHANGED Viewed

@@ -12,6 +12,7 @@ the local process, so the run disappears when ``modal run`` exits.
 from __future__ import annotations
 import json
 import subprocess
 import time
 from datetime import datetime
@@ -23,14 +24,18 @@ import modal
 APP_NAME = "CyberSecurity_OWASP-ephemeral-training"
 SECRET_NAME = "CyberSecurity_OWASP-secrets"
 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
 app = modal.App(APP_NAME)
 image = (
     modal.Image.debian_slim(python_version="3.11")
     .apt_install("git")
     .add_local_dir(
         PROJECT_ROOT,
         remote_path=REMOTE_PROJECT,
@@ -46,11 +51,17 @@ image = (
             "*.pyc",
         ],
     )
-    .run_commands(f"pip install -e {REMOTE_PROJECT}")
     .workdir(REMOTE_PROJECT)
 )
 class NoopTrainer:
     """Deterministic placeholder policy for cheap Modal smoke runs."""
@@ -66,9 +77,49 @@ class NoopTrainer:
         ]
 @app.function(
     image=image,
     timeout=60 * 30,
     secrets=[modal.Secret.from_name(SECRET_NAME, required_keys=["HF_TOKEN"])],
 )
 def run_ephemeral_smoke(
@@ -77,10 +128,13 @@ def run_ephemeral_smoke(
     trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-smoke",
 ) -> dict[str, Any]:
     from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
     from training.rollout import rollout_once
     from training.trackio_utils import (
         aggregate_episode_metrics,
@@ -91,11 +145,26 @@ def run_ephemeral_smoke(
         trackio_run,
     )
     baseline = []
     oracle = []
     run_context = {
         "algo": "modal_ephemeral_smoke",
-        "reward_version": "reward_v1",
         "env_version": "0.1.0",
     }
@@ -125,16 +194,28 @@ def run_ephemeral_smoke(
         oracle_env = CybersecurityOwaspEnvironment()
         oracle_env.reset(seed=seed, split="validation")
         hidden = oracle_env.state.hidden_facts
         oracle_env.step(
             CyberSecurityOWASPAction(
-                tool_name="submit_finding",
                 arguments={
-                    "summary": "BOLA/IDOR authorization bug in invoice read route.",
-                    "evidence": (
-                        f"user {hidden['owner_user_id']} can request invoice "
-                        f"{hidden['other_invoice_id']} despite the owner/admin policy"
-                    ),
-                    "policy_rule": "Only owner or billing_admin in same tenant may read invoices.",
                 },
             )
         )
@@ -186,6 +267,9 @@ def run_ephemeral_smoke(
         "baseline_mean_reward": mean(baseline, "reward_total"),
         "oracle_mean_reward": mean(oracle, "reward_total"),
         "oracle_success_rate": mean(oracle, "success"),
         "tracking_metrics": tracking_metrics,
         "tracking_trace_rows": trace_table_rows(episode_records),
         "baseline": baseline,
@@ -356,8 +440,23 @@ def main(
     trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-smoke",
     run_name: str = "",
 ) -> None:
-    if mode == "smoke":
         result = run_ephemeral_smoke.remote(
             episodes=episodes,
             seed_start=seed_start,
@@ -389,5 +488,5 @@ def main(
         print(json.dumps(result, indent=2, sort_keys=True))
     else:
         raise ValueError(
-            "mode must be 'smoke', 'grpo-config', 'verify-trackio', or 'inspect-trackio'"
         )

 from __future__ import annotations
 import json
+import os
 import subprocess
 import time
 from datetime import datetime
 APP_NAME = "CyberSecurity_OWASP-ephemeral-training"
 SECRET_NAME = "CyberSecurity_OWASP-secrets"
+SCENARIO_CACHE_VOLUME_NAME = "CyberSecurity_OWASP-scenario-cache"
+SCENARIO_CACHE_DIR = Path("/scenario-cache")
 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
 app = modal.App(APP_NAME)
+scenario_cache_volume = modal.Volume.from_name(SCENARIO_CACHE_VOLUME_NAME, create_if_missing=True)
 image = (
     modal.Image.debian_slim(python_version="3.11")
     .apt_install("git")
+    .pip_install("openenv-core[core]>=0.2.2", "trackio>=0.22.0")
     .add_local_dir(
         PROJECT_ROOT,
         remote_path=REMOTE_PROJECT,
             "*.pyc",
         ],
     )
+    .run_commands(f"pip install --no-deps -e {REMOTE_PROJECT}")
     .workdir(REMOTE_PROJECT)
 )
+def _configure_scenario_cache_env(*, required: bool = True) -> None:
+    SCENARIO_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    os.environ["CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR"] = str(SCENARIO_CACHE_DIR)
+    os.environ["CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE"] = "require" if required else "fallback"
 class NoopTrainer:
     """Deterministic placeholder policy for cheap Modal smoke runs."""
         ]
+@app.function(
+    image=image,
+    timeout=60 * 60,
+    volumes={SCENARIO_CACHE_DIR: scenario_cache_volume},
+)
+def prepare_ephemeral_scenario_cache(
+    seed_start: int = 0,
+    difficulty_buckets: int = 0,
+    train_per_bucket: int = 0,
+    validation_per_bucket: int = 0,
+    heldout_per_bucket: int = 0,
+    force: bool = False,
+) -> dict[str, Any]:
+    import os
+    if difficulty_buckets:
+        os.environ["CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS"] = str(difficulty_buckets)
+    if train_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_TRAIN_SCENARIOS_PER_BUCKET"] = str(train_per_bucket)
+    if validation_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_VALIDATION_SCENARIOS_PER_BUCKET"] = str(validation_per_bucket)
+    if heldout_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_HELDOUT_SCENARIOS_PER_BUCKET"] = str(heldout_per_bucket)
+    _configure_scenario_cache_env(required=False)
+    from CyberSecurity_OWASP.config import load_scenario_authoring_config
+    from CyberSecurity_OWASP.server.scenario_cache import prepare_scenario_cache
+    settings = load_scenario_authoring_config()
+    result = prepare_scenario_cache(
+        cache_dir=SCENARIO_CACHE_DIR,
+        settings=settings,
+        seed_start=seed_start,
+        force=force,
+    )
+    scenario_cache_volume.commit()
+    result["scenario_cache_volume"] = SCENARIO_CACHE_VOLUME_NAME
+    return result
 @app.function(
     image=image,
     timeout=60 * 30,
+    volumes={SCENARIO_CACHE_DIR: scenario_cache_volume},
     secrets=[modal.Secret.from_name(SECRET_NAME, required_keys=["HF_TOKEN"])],
 )
 def run_ephemeral_smoke(
     trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-smoke",
 ) -> dict[str, Any]:
+    _configure_scenario_cache_env(required=True)
     from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
+    from CyberSecurity_OWASP.config import load_scenario_authoring_config
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
+    from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
     from training.rollout import rollout_once
     from training.trackio_utils import (
         aggregate_episode_metrics,
         trackio_run,
     )
+    scenario_cache_volume.reload()
+    settings = load_scenario_authoring_config()
+    cache_coverage = ScenarioCache(SCENARIO_CACHE_DIR, settings=settings).assert_coverage(
+        split="validation",
+        difficulty=0,
+    )
+    available_scenarios = int(
+        cache_coverage.get("counts", {}).get("validation", {}).get("0", 0)
+    )
+    if available_scenarios < episodes:
+        raise RuntimeError(
+            "Scenario cache does not cover this smoke run. Run prepare-cache "
+            f"with a larger validation count. available={available_scenarios}, episodes={episodes}"
+        )
     baseline = []
     oracle = []
     run_context = {
         "algo": "modal_ephemeral_smoke",
+        "reward_version": "reward_v2",
         "env_version": "0.1.0",
     }
         oracle_env = CybersecurityOwaspEnvironment()
         oracle_env.reset(seed=seed, split="validation")
         hidden = oracle_env.state.hidden_facts
+        evidence = oracle_env.step(
+            CyberSecurityOWASPAction(
+                tool_name="send_local_request",
+                arguments={
+                    "method": "GET",
+                    "path": f"/invoices/{hidden['other_invoice_id']}",
+                    "user_id": hidden["owner_user_id"],
+                },
+            )
+        )
+        trace_id = "req_001"
+        if '"trace_id": "req_' in evidence.last_tool_result:
+            trace_id = evidence.last_tool_result.split('"trace_id": "', 1)[1].split('"', 1)[0]
         oracle_env.step(
             CyberSecurityOWASPAction(
+                tool_name="submit_diagnosis",
                 arguments={
+                    "bug_class": "idor_ownership_bug",
+                    "route": "GET /invoices/{invoice_id}",
+                    "violated_policy_rule": "Only owner or billing_admin in same tenant may read invoices.",
+                    "evidence_trace_ids": [trace_id],
+                    "fix_plan": "Add tenant and owner/admin checks before returning invoice data.",
                 },
             )
         )
         "baseline_mean_reward": mean(baseline, "reward_total"),
         "oracle_mean_reward": mean(oracle, "reward_total"),
         "oracle_success_rate": mean(oracle, "success"),
+        "scenario_cache_volume": SCENARIO_CACHE_VOLUME_NAME,
+        "scenario_cache_mode": "require",
+        "scenario_cache_coverage": cache_coverage,
         "tracking_metrics": tracking_metrics,
         "tracking_trace_rows": trace_table_rows(episode_records),
         "baseline": baseline,
     trackio_space_id: str = "",
     trackio_project: str = "CyberSecurity_OWASP-smoke",
     run_name: str = "",
+    cache_difficulty_buckets: int = 0,
+    cache_train_per_bucket: int = 0,
+    cache_validation_per_bucket: int = 0,
+    cache_heldout_per_bucket: int = 0,
+    cache_force: bool = False,
 ) -> None:
+    if mode == "prepare-cache":
+        result = prepare_ephemeral_scenario_cache.remote(
+            seed_start=seed_start,
+            difficulty_buckets=cache_difficulty_buckets,
+            train_per_bucket=cache_train_per_bucket,
+            validation_per_bucket=cache_validation_per_bucket,
+            heldout_per_bucket=cache_heldout_per_bucket,
+            force=cache_force,
+        )
+        print(json.dumps(result, indent=2, sort_keys=True))
+    elif mode == "smoke":
         result = run_ephemeral_smoke.remote(
             episodes=episodes,
             seed_start=seed_start,
         print(json.dumps(result, indent=2, sort_keys=True))
     else:
         raise ValueError(
+            "mode must be 'prepare-cache', 'smoke', 'grpo-config', 'verify-trackio', or 'inspect-trackio'"
         )

scripts/modal_train_grpo.py CHANGED Viewed

@@ -10,7 +10,7 @@ Example:
     uv run --extra modal modal run scripts/modal_train_grpo.py \
         --max-steps 10 \
         --dataset-size 16 \
-        --num-generations 2 \
         --difficulty 0
 """
@@ -29,9 +29,11 @@ import modal
 APP_NAME = "CyberSecurity_OWASP-grpo"
 VOLUME_NAME = "CyberSecurity_OWASP-grpo-runs"
 CACHE_VOLUME_NAME = "CyberSecurity_OWASP-model-cache"
 SECRET_NAME = "CyberSecurity_OWASP-secrets"
 RUNS_DIR = pathlib.Path("/runs")
 CACHE_DIR = pathlib.Path("/cache")
 HF_HOME_DIR = CACHE_DIR / "huggingface"
 HF_HUB_CACHE_DIR = HF_HOME_DIR / "hub"
 TORCH_HOME_DIR = CACHE_DIR / "torch"
@@ -46,6 +48,16 @@ DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
 _IMAGE_NOTICE_PRINTED = False
 def _model_repo_slug(model_name: str) -> str:
     return (
         model_name.replace("/", "-")
@@ -86,6 +98,17 @@ def _configure_modal_cache_env() -> dict[str, str]:
     return values
 def _print_image_startup_notice() -> None:
     global _IMAGE_NOTICE_PRINTED
     if _IMAGE_NOTICE_PRINTED:
@@ -134,6 +157,16 @@ def _is_config_mode() -> bool:
     return False
 _load_local_env_file()
@@ -153,7 +186,10 @@ def _source_mode() -> str:
 def _training_image() -> modal.Image:
-    _print_image_startup_notice()
     image = (
         modal.Image.from_registry(
             "nvidia/cuda:12.8.0-devel-ubuntu22.04",
@@ -225,28 +261,182 @@ def _training_image() -> modal.Image:
     ).workdir(REMOTE_PROJECT)
 app = modal.App(APP_NAME)
 volume = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
 cache_volume = modal.Volume.from_name(CACHE_VOLUME_NAME, create_if_missing=True)
 secrets = _modal_secrets()
 training_image = _training_image()
 @app.function(
     image=training_image,
     gpu="L4",
     timeout=4 * 60 * 60,
-    volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume},
     secrets=secrets,
 )
 def check_training_imports() -> dict[str, str]:
     cache_env = _configure_modal_cache_env()
     import torch
     import trackio
     from datasets import Dataset
     from trl import GRPOConfig, GRPOTrainer
-    from unsloth import FastLanguageModel, FastVisionModel
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
@@ -260,12 +450,12 @@ def check_training_imports() -> dict[str, str]:
         "dataset": Dataset.__name__,
         "grpo_config": GRPOConfig.__name__,
         "grpo_trainer": GRPOTrainer.__name__,
-        "unsloth_model": FastLanguageModel.__name__,
         "unsloth_vision_model": FastVisionModel.__name__,
         "env": CybersecurityOwaspEnvironment.__name__,
         "reset_phase": obs.phase,
         "hf_home": cache_env["HF_HOME"],
         "hf_hub_cache": cache_env["HF_HUB_CACHE"],
     }
@@ -273,7 +463,7 @@ def check_training_imports() -> dict[str, str]:
     image=training_image,
     gpu="L4",
     timeout=4 * 60 * 60,
-    volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume},
     secrets=secrets,
 )
 def train_cybersecurity_owasp_grpo(
@@ -289,7 +479,7 @@ def train_cybersecurity_owasp_grpo(
     lora_rank: int = 32,
     trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
-    num_generations: int = 2,
     seed_start: int = 0,
     git_sha: str = "nogit",
     run_name: str = "",
@@ -303,10 +493,11 @@ def train_cybersecurity_owasp_grpo(
     import threading
     import time
     cache_env = _configure_modal_cache_env()
     import torch
-    from unsloth import FastLanguageModel, FastVisionModel
     import transformers.utils.hub as transformers_hub
     from datasets import Dataset
     from huggingface_hub import snapshot_download, whoami
@@ -317,9 +508,13 @@ def train_cybersecurity_owasp_grpo(
     import trackio
     from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
     from training.trackio_utils import (
         aggregate_episode_metrics,
         episode_record_from_state,
@@ -356,6 +551,7 @@ def train_cybersecurity_owasp_grpo(
     os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
     os.environ["TRACKIO_PROJECT"] = trackio_project
     model_slug = model_name.replace("/", "-")
     stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
@@ -370,13 +566,42 @@ def train_cybersecurity_owasp_grpo(
         print(f"Reloaded Modal model cache volume: {CACHE_VOLUME_NAME}")
     except Exception as exc:
         print(f"Model cache volume reload skipped: {exc!r}")
     cache_env = _configure_modal_cache_env()
     training_prompt = (
         "You are a defensive AppSec repair agent in the local CyberSecurity_OWASP "
         "OpenEnv environment. Use only the provided local tools. Do not target real "
         "systems. Work step by step: inspect policy and generated code, reproduce the "
-        "authorization issue locally, submit a policy-tied finding, patch the generated "
         "app, run visible tests, then submit the fix. Do not write explanations unless "
         "a tool argument needs evidence text."
     )
@@ -403,6 +628,8 @@ def train_cybersecurity_owasp_grpo(
             "difficulty": state.difficulty,
             "domain": state.domain,
             "bug_family": state.bug_family,
             "phase": state.phase,
             "step_count": state.step_count,
             "done": state.done,
@@ -463,7 +690,7 @@ def train_cybersecurity_owasp_grpo(
             obs = self._env.step(action)
             if not obs.last_action_valid:
                 self.invalid_actions += 1
-            self.reward = float(obs.reward_breakdown.get("total", obs.reward or 0.0))
             self.reward_breakdown = dict(obs.reward_breakdown or {})
             self.done = bool(obs.done)
             self.success = bool(self._env.state.success)
@@ -484,6 +711,8 @@ def train_cybersecurity_owasp_grpo(
                     "reward": self.reward,
                     "reward_breakdown": self.reward_breakdown,
                     "invalid_actions": self.invalid_actions,
                 }
             )
             return obs.message
@@ -575,29 +804,35 @@ def train_cybersecurity_owasp_grpo(
                 },
             )
-        def submit_finding(
             self,
-            summary: str,
-            evidence: str,
-            policy_rule: str,
         ) -> str:
             """
-            Submit structured evidence for the suspected authorization bug.
             Args:
-                summary: Concise description of the suspected access-control bug.
-                evidence: Local reproduction evidence from policy, code, or requests.
-                policy_rule: Policy rule that the observed behavior violates.
             Returns:
-                Finding acceptance result and next phase information.
             """
             return self._step(
-                "submit_finding",
                 {
-                    "summary": summary,
-                    "evidence": evidence,
-                    "policy_rule": policy_rule,
                 },
             )
@@ -637,8 +872,12 @@ def train_cybersecurity_owasp_grpo(
             """Take no action."""
             return self._step("noop")
-        def _score(self) -> float:
-            return float(self.reward)
         def __del__(self):
             try:
@@ -667,24 +906,31 @@ def train_cybersecurity_owasp_grpo(
         return float(sum(values) / len(values)) if values else 0.0
     def cybersecurity_owasp_reward(environments, **kwargs) -> list[float]:
-        rewards = [float(env._score()) for env in environments]
         completions = kwargs.get("completions") or kwargs.get("completion") or []
         trace_step["value"] += 1
         episode_records = []
-        for env, reward in zip(environments, rewards):
             record = episode_record_from_state(
                 env._env.state,
                 run_context={
                     "base_model": model_name,
                     "algo": "grpo",
-                    "reward_version": "reward_v1",
                     "env_version": "0.1.0",
                 },
             )
             record.update(
                 {
                     "reward_total": reward,
                     "success": bool(getattr(env, "success", False)),
                 }
             )
@@ -761,6 +1007,10 @@ def train_cybersecurity_owasp_grpo(
                 log_trackio_metrics(
                     {
                         "system/model_cache_hit": float(cache_hit),
                         "system/hub_push_enabled": float(push_to_hub),
                     },
                     step=int(state.global_step or 0),
@@ -805,6 +1055,10 @@ def train_cybersecurity_owasp_grpo(
     print(f"Output repo: {output_repo_id}")
     print(f"Run name: {run_name}")
     print(f"Model cache volume: {CACHE_VOLUME_NAME}")
     print(f"HF_HOME: {cache_env['HF_HOME']}")
     print(f"HF_HUB_CACHE: {cache_env['HF_HUB_CACHE']}")
     print(f"Torch cache: {cache_env['TORCH_HOME']}")
@@ -839,7 +1093,7 @@ def train_cybersecurity_owasp_grpo(
         )
     print(f"Loading model with Unsloth from_pretrained: {model_name}")
-    model_api = FastVisionModel if "gemma-4" in model_name.lower() else FastLanguageModel
     model, tokenizer = model_api.from_pretrained(
         model_name=model_name,
         max_seq_length=max_seq_length,
@@ -854,34 +1108,11 @@ def train_cybersecurity_owasp_grpo(
     try:
         tokenizer = add_response_schema(tokenizer)
     except Exception as exc:
-        if "gemma-4" in model_name.lower():
-            print(
-                "Tokenizer response schema add skipped for Gemma 4 processor, "
-                "matching the Unsloth Gemma 4 GRPO notebook pattern: "
-                f"{exc!r}"
-            )
-        else:
-            print(f"Tokenizer response schema add failed before cloning: {exc!r}")
-            for template_source in ("Qwen/Qwen3-0.6B", "Qwen/Qwen2.5-0.5B-Instruct"):
-                try:
-                    model, tokenizer, added_tokens = clone_chat_template(
-                        model,
-                        tokenizer,
-                        template_source,
-                    )
-                    print(
-                        "Cloned response-schema-capable chat template "
-                        f"from {template_source}; added {len(added_tokens)} tokens."
-                    )
-                    tokenizer = add_response_schema(tokenizer)
-                    break
-                except Exception as clone_exc:
-                    print(
-                        "Tokenizer response schema fallback failed for "
-                        f"{template_source}: {clone_exc!r}"
-                    )
-            else:
-                raise
     model = model_api.get_peft_model(
         model,
@@ -1001,8 +1232,10 @@ def train_cybersecurity_owasp_grpo(
         print("Skipping Hub push for this run. Pass --push-to-hub to upload adapters.")
     volume.commit()
     cache_volume.commit()
     print(f"Committed run volume: {VOLUME_NAME}")
     print(f"Committed model cache volume: {CACHE_VOLUME_NAME}")
     try:
         trackio.finish()
     except RuntimeError as exc:
@@ -1025,6 +1258,8 @@ def train_cybersecurity_owasp_grpo(
         "repo_url": repo_url,
         "repo_branch": repo_branch,
         "push_to_hub": push_to_hub,
     }
@@ -1043,7 +1278,7 @@ def main(
     lora_rank: int = 32,
     trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
-    num_generations: int = 2,
     seed_start: int = 0,
     git_sha: str = "nogit",
     source_mode: str = "local",
@@ -1051,13 +1286,31 @@ def main(
     repo_branch: str = PUBLIC_REPO_BRANCH,
     detach: bool = False,
     push_to_hub: bool = False,
 ) -> None:
     if mode == "config":
         result = check_training_imports.remote()
         print(result)
         return
     if mode != "train":
-        raise ValueError("mode must be 'train' or 'config'")
     trackio_space_id = trackio_space_id or os.environ.get(
         "TRACKIO_SPACE_ID",
@@ -1123,15 +1376,17 @@ def main(
         )
     print(f"Hub push enabled: {push_to_hub}")
     print(f"Model cache volume: {CACHE_VOLUME_NAME}")
     print("Launch phases:")
     print(
         "1. Modal image build/validation: happens before remote Python logs; "
         "slow when local source or dependency layers changed."
     )
-    print("2. GPU container start on one L4 and persistent volume reload.")
-    print("3. Model cache check in CyberSecurity_OWASP-model-cache.")
-    print("4. Cached snapshot load into GPU RAM with Unsloth progress.")
-    print("5. One GRPO step, Trackio sync, and volume commit.")
     print(
         "If there is a long pause after trainer.train() starts, watch for "
         "Training heartbeat lines every 30 seconds."
@@ -1159,6 +1414,13 @@ def main(
         repo_branch=repo_branch,
         push_to_hub=push_to_hub,
     )
     if detach:
         call = train_cybersecurity_owasp_grpo.spawn(**kwargs)
         print(f"Spawned Modal training call: {call.object_id}")

     uv run --extra modal modal run scripts/modal_train_grpo.py \
         --max-steps 10 \
         --dataset-size 16 \
+        --num-generations 6 \
         --difficulty 0
 """
 APP_NAME = "CyberSecurity_OWASP-grpo"
 VOLUME_NAME = "CyberSecurity_OWASP-grpo-runs"
 CACHE_VOLUME_NAME = "CyberSecurity_OWASP-model-cache"
+SCENARIO_CACHE_VOLUME_NAME = "CyberSecurity_OWASP-scenario-cache"
 SECRET_NAME = "CyberSecurity_OWASP-secrets"
 RUNS_DIR = pathlib.Path("/runs")
 CACHE_DIR = pathlib.Path("/cache")
+SCENARIO_CACHE_DIR = pathlib.Path("/scenario-cache")
 HF_HOME_DIR = CACHE_DIR / "huggingface"
 HF_HUB_CACHE_DIR = HF_HOME_DIR / "hub"
 TORCH_HOME_DIR = CACHE_DIR / "torch"
 _IMAGE_NOTICE_PRINTED = False
+def _ensure_gemma4_model(model_name: str) -> str:
+    if model_name != DEFAULT_GEMMA_MODEL:
+        raise ValueError(
+            "CyberSecurity_OWASP GRPO training is pinned to "
+            f"{DEFAULT_GEMMA_MODEL}, matching the Unsloth Gemma 4 E2B RL notebook. "
+            f"Received {model_name!r}."
+        )
+    return model_name
 def _model_repo_slug(model_name: str) -> str:
     return (
         model_name.replace("/", "-")
     return values
+def _configure_scenario_cache_env(*, required: bool = True) -> dict[str, str]:
+    values = {
+        "CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR": str(SCENARIO_CACHE_DIR),
+        "CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE": "require" if required else "fallback",
+    }
+    for key, value in values.items():
+        os.environ[key] = value
+    SCENARIO_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    return values
 def _print_image_startup_notice() -> None:
     global _IMAGE_NOTICE_PRINTED
     if _IMAGE_NOTICE_PRINTED:
     return False
+def _is_prepare_cache_mode() -> bool:
+    args = sys.argv[1:]
+    for index, arg in enumerate(args):
+        if arg == "--mode" and index + 1 < len(args):
+            return args[index + 1] == "prepare-cache"
+        if arg.startswith("--mode="):
+            return arg.split("=", 1)[1] == "prepare-cache"
+    return False
 _load_local_env_file()
 def _training_image() -> modal.Image:
+    if _is_prepare_cache_mode():
+        return _scenario_cache_image()
+    if not _is_prepare_cache_mode():
+        _print_image_startup_notice()
     image = (
         modal.Image.from_registry(
             "nvidia/cuda:12.8.0-devel-ubuntu22.04",
     ).workdir(REMOTE_PROJECT)
+def _scenario_cache_image() -> modal.Image:
+    image = (
+        modal.Image.debian_slim(python_version="3.11")
+        .apt_install("git")
+        .uv_pip_install("openenv-core[core]>=0.2.3", "trackio>=0.25.0")
+    )
+    if _source_mode() == "public":
+        repo_url = _cli_arg_value("repo-url", PUBLIC_REPO_URL)
+        repo_branch = _cli_arg_value("repo-branch", PUBLIC_REPO_BRANCH)
+        image = image.run_commands(
+            f"git clone --depth 1 --branch {repo_branch} {repo_url} {REMOTE_PROJECT}",
+            f"python -m pip install --no-deps -e {REMOTE_PROJECT}",
+        )
+    else:
+        image = image.add_local_dir(
+            PROJECT_ROOT,
+            remote_path=REMOTE_PROJECT,
+            copy=True,
+            ignore=[
+                ".git",
+                ".venv",
+                ".env",
+                ".env.*",
+                "__pycache__",
+                ".pytest_cache",
+                "outputs",
+                "*.pyc",
+            ],
+        )
+        image = image.run_commands(
+            f"python -m pip install --no-deps -e {REMOTE_PROJECT}",
+        )
+    return image.workdir(REMOTE_PROJECT)
 app = modal.App(APP_NAME)
 volume = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
 cache_volume = modal.Volume.from_name(CACHE_VOLUME_NAME, create_if_missing=True)
+scenario_cache_volume = modal.Volume.from_name(SCENARIO_CACHE_VOLUME_NAME, create_if_missing=True)
 secrets = _modal_secrets()
+scenario_cache_image = _scenario_cache_image()
 training_image = _training_image()
+@app.function(
+    image=scenario_cache_image,
+    timeout=2 * 60 * 60,
+    volumes={SCENARIO_CACHE_DIR: scenario_cache_volume},
+)
+def prepare_modal_scenario_cache(
+    seed_start: int = 0,
+    difficulty_buckets: int = 0,
+    train_per_bucket: int = 0,
+    validation_per_bucket: int = 0,
+    heldout_per_bucket: int = 0,
+    force: bool = False,
+) -> dict[str, Any]:
+    if difficulty_buckets:
+        os.environ["CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS"] = str(difficulty_buckets)
+    if train_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_TRAIN_SCENARIOS_PER_BUCKET"] = str(train_per_bucket)
+    if validation_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_VALIDATION_SCENARIOS_PER_BUCKET"] = str(validation_per_bucket)
+    if heldout_per_bucket:
+        os.environ["CYBERSECURITY_OWASP_HELDOUT_SCENARIOS_PER_BUCKET"] = str(heldout_per_bucket)
+    _configure_scenario_cache_env(required=False)
+    from CyberSecurity_OWASP.config import load_scenario_authoring_config
+    from CyberSecurity_OWASP.server.scenario_cache import prepare_scenario_cache
+    settings = load_scenario_authoring_config()
+    result = prepare_scenario_cache(
+        cache_dir=SCENARIO_CACHE_DIR,
+        settings=settings,
+        seed_start=seed_start,
+        force=force,
+    )
+    scenario_cache_volume.commit()
+    result["scenario_cache_volume"] = SCENARIO_CACHE_VOLUME_NAME
+    return result
+@app.function(
+    image=scenario_cache_image,
+    timeout=60 * 10,
+    volumes={SCENARIO_CACHE_DIR: scenario_cache_volume},
+)
+def verify_modal_scenario_cache_for_training(
+    split: str = "train",
+    difficulty: int = 0,
+    dataset_size: int = 2,
+    seed_start: int = 0,
+) -> dict[str, Any]:
+    _configure_scenario_cache_env(required=True)
+    scenario_cache_volume.reload()
+    from CyberSecurity_OWASP.config import load_scenario_authoring_config
+    from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
+        CybersecurityOwaspEnvironment,
+    )
+    from CyberSecurity_OWASP.reward_config import compute_token_penalty
+    from CyberSecurity_OWASP.server.curriculum import CurriculumController
+    from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
+    settings = load_scenario_authoring_config()
+    scenario_profile = CurriculumController(settings=settings).select_profile(
+        seed=seed_start,
+        split=split,
+        requested_difficulty=difficulty,
+    )
+    resolved_difficulty = int(scenario_profile["difficulty"])
+    cache = ScenarioCache(SCENARIO_CACHE_DIR, settings=settings)
+    coverage = cache.assert_coverage(split=split, difficulty=resolved_difficulty)
+    available_scenarios = int(
+        coverage.get("counts", {})
+        .get(split, {})
+        .get(str(resolved_difficulty), 0)
+    )
+    if available_scenarios < dataset_size:
+        raise RuntimeError(
+            "Scenario cache does not cover this Modal dataset. Run "
+            "--mode prepare-cache with a larger per-bucket count before training. "
+            f"available={available_scenarios}, requested_dataset_size={dataset_size}, "
+            f"split={split}, difficulty={resolved_difficulty}"
+        )
+    env = CybersecurityOwaspEnvironment()
+    try:
+        obs = env.reset(seed=seed_start, split=split, difficulty=difficulty)
+        if not env.state.cache_hit:
+            raise RuntimeError("Scenario cache preflight reset did not hit cache.")
+        if env.state.metrics.get("scenario_compile_latency_ms", 0.0):
+            raise RuntimeError("Scenario cache preflight unexpectedly compiled a scenario.")
+        sample = {
+            "phase": obs.phase,
+            "task_id": env.state.task_id,
+            "cache_hit": env.state.cache_hit,
+            "scenario_hash": env.state.scenario_hash,
+            "reset_latency_ms": env.state.reset_latency_ms,
+            "bundle_load_latency_ms": env.state.metrics.get(
+                "scenario_bundle_load_latency_ms",
+                0.0,
+            ),
+        }
+    finally:
+        env.close()
+    return {
+        "scenario_cache_volume": SCENARIO_CACHE_VOLUME_NAME,
+        "scenario_cache_dir": str(SCENARIO_CACHE_DIR),
+        "scenario_cache_mode": "require",
+        "split": split,
+        "difficulty": resolved_difficulty,
+        "dataset_size": dataset_size,
+        "available_scenarios": available_scenarios,
+        "coverage": coverage,
+        "sample_reset": sample,
+    }
 @app.function(
     image=training_image,
     gpu="L4",
     timeout=4 * 60 * 60,
+    volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
     secrets=secrets,
 )
 def check_training_imports() -> dict[str, str]:
     cache_env = _configure_modal_cache_env()
+    scenario_cache_env = _configure_scenario_cache_env(required=False)
     import torch
     import trackio
     from datasets import Dataset
     from trl import GRPOConfig, GRPOTrainer
+    from unsloth import FastVisionModel
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
         "dataset": Dataset.__name__,
         "grpo_config": GRPOConfig.__name__,
         "grpo_trainer": GRPOTrainer.__name__,
         "unsloth_vision_model": FastVisionModel.__name__,
         "env": CybersecurityOwaspEnvironment.__name__,
         "reset_phase": obs.phase,
         "hf_home": cache_env["HF_HOME"],
         "hf_hub_cache": cache_env["HF_HUB_CACHE"],
+        "scenario_cache_dir": scenario_cache_env["CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR"],
     }
     image=training_image,
     gpu="L4",
     timeout=4 * 60 * 60,
+    volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume},
     secrets=secrets,
 )
 def train_cybersecurity_owasp_grpo(
     lora_rank: int = 32,
     trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
+    num_generations: int = 6,
     seed_start: int = 0,
     git_sha: str = "nogit",
     run_name: str = "",
     import threading
     import time
+    model_name = _ensure_gemma4_model(model_name)
     cache_env = _configure_modal_cache_env()
     import torch
+    from unsloth import FastVisionModel
     import transformers.utils.hub as transformers_hub
     from datasets import Dataset
     from huggingface_hub import snapshot_download, whoami
     import trackio
     from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
+    from CyberSecurity_OWASP.config import load_scenario_authoring_config
     from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
         CybersecurityOwaspEnvironment,
     )
+    from CyberSecurity_OWASP.reward_config import compute_token_penalty
+    from CyberSecurity_OWASP.server.curriculum import CurriculumController
+    from CyberSecurity_OWASP.server.scenario_cache import ScenarioCache
     from training.trackio_utils import (
         aggregate_episode_metrics,
         episode_record_from_state,
     os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
     os.environ["TRACKIO_PROJECT"] = trackio_project
+    os.environ.setdefault("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
     model_slug = model_name.replace("/", "-")
     stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
         print(f"Reloaded Modal model cache volume: {CACHE_VOLUME_NAME}")
     except Exception as exc:
         print(f"Model cache volume reload skipped: {exc!r}")
+    try:
+        scenario_cache_volume.reload()
+        print(f"Reloaded Modal scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
+    except Exception as exc:
+        print(f"Scenario cache volume reload skipped: {exc!r}")
     cache_env = _configure_modal_cache_env()
+    scenario_cache_env = _configure_scenario_cache_env(required=True)
+    scenario_settings = load_scenario_authoring_config()
+    scenario_profile = CurriculumController(settings=scenario_settings).select_profile(
+        seed=seed_start,
+        split=split,
+        requested_difficulty=difficulty,
+    )
+    scenario_cache = ScenarioCache(SCENARIO_CACHE_DIR, settings=scenario_settings)
+    scenario_cache_coverage = scenario_cache.assert_coverage(
+        split=split,
+        difficulty=int(scenario_profile["difficulty"]),
+    )
+    available_scenarios = int(
+        scenario_cache_coverage.get("counts", {})
+        .get(split, {})
+        .get(str(int(scenario_profile["difficulty"])), 0)
+    )
+    if available_scenarios < dataset_size:
+        raise RuntimeError(
+            "Scenario cache does not cover this Modal dataset. Run "
+            "--mode prepare-cache with a larger per-bucket count before training. "
+            f"available={available_scenarios}, requested_dataset_size={dataset_size}, "
+            f"split={split}, difficulty={scenario_profile['difficulty']}"
+        )
     training_prompt = (
         "You are a defensive AppSec repair agent in the local CyberSecurity_OWASP "
         "OpenEnv environment. Use only the provided local tools. Do not target real "
         "systems. Work step by step: inspect policy and generated code, reproduce the "
+        "authorization issue locally, submit a policy-tied diagnosis, patch the generated "
         "app, run visible tests, then submit the fix. Do not write explanations unless "
         "a tool argument needs evidence text."
     )
             "difficulty": state.difficulty,
             "domain": state.domain,
             "bug_family": state.bug_family,
+            "cache_hit": state.cache_hit,
+            "scenario_hash": state.scenario_hash,
             "phase": state.phase,
             "step_count": state.step_count,
             "done": state.done,
             obs = self._env.step(action)
             if not obs.last_action_valid:
                 self.invalid_actions += 1
+            self.reward = float(self._env.state.accumulated_reward)
             self.reward_breakdown = dict(obs.reward_breakdown or {})
             self.done = bool(obs.done)
             self.success = bool(self._env.state.success)
                     "reward": self.reward,
                     "reward_breakdown": self.reward_breakdown,
                     "invalid_actions": self.invalid_actions,
+                    "scenario_cache_hit": self._env.state.cache_hit,
+                    "scenario_hash": self._env.state.scenario_hash,
                 }
             )
             return obs.message
                 },
             )
+        def submit_diagnosis(
             self,
+            bug_class: str,
+            route: str,
+            violated_policy_rule: str,
+            evidence_trace_ids: list[str],
+            fix_plan: str,
         ) -> str:
             """
+            Submit structured diagnosis for the suspected authorization bug.
             Args:
+                bug_class: Short class such as idor_ownership_bug.
+                route: Method and route pattern believed to be vulnerable.
+                violated_policy_rule: Policy rule that the behavior violates.
+                evidence_trace_ids: Request trace IDs from local evidence tools.
+                fix_plan: Concise secure repair plan.
             Returns:
+                Diagnosis acceptance result and next phase information.
             """
             return self._step(
+                "submit_diagnosis",
                 {
+                    "bug_class": bug_class,
+                    "route": route,
+                    "violated_policy_rule": violated_policy_rule,
+                    "evidence_trace_ids": evidence_trace_ids,
+                    "fix_plan": fix_plan,
                 },
             )
             """Take no action."""
             return self._step("noop")
+        def _score(self, completion_tokens: int = 0) -> float:
+            token_penalty = compute_token_penalty(completion_tokens)
+            self._env.state.completion_tokens = int(completion_tokens)
+            self._env.state.metrics["completion_tokens"] = int(completion_tokens)
+            self._env.state.metrics["token_penalty"] = token_penalty
+            return float(self._env.state.accumulated_reward + token_penalty)
         def __del__(self):
             try:
         return float(sum(values) / len(values)) if values else 0.0
     def cybersecurity_owasp_reward(environments, **kwargs) -> list[float]:
         completions = kwargs.get("completions") or kwargs.get("completion") or []
+        completion_texts = [_completion_to_text(item) for item in completions]
+        completion_tokens = [len(text.split()) for text in completion_texts]
+        rewards = [
+            float(env._score(completion_tokens[index] if index < len(completion_tokens) else 0))
+            for index, env in enumerate(environments)
+        ]
         trace_step["value"] += 1
         episode_records = []
+        for index, (env, reward) in enumerate(zip(environments, rewards)):
             record = episode_record_from_state(
                 env._env.state,
                 run_context={
                     "base_model": model_name,
                     "algo": "grpo",
+                    "reward_version": "reward_v2",
                     "env_version": "0.1.0",
                 },
             )
             record.update(
                 {
                     "reward_total": reward,
+                    "reward_token_penalty": float(env._env.state.metrics.get("token_penalty", 0.0)),
+                    "completion_tokens": completion_tokens[index] if index < len(completion_tokens) else 0,
                     "success": bool(getattr(env, "success", False)),
                 }
             )
                 log_trackio_metrics(
                     {
                         "system/model_cache_hit": float(cache_hit),
+                        "system/scenario_cache_required": 1.0,
+                        "system/scenario_cache_entries": float(
+                            scenario_cache_coverage.get("entries", 0)
+                        ),
                         "system/hub_push_enabled": float(push_to_hub),
                     },
                     step=int(state.global_step or 0),
     print(f"Output repo: {output_repo_id}")
     print(f"Run name: {run_name}")
     print(f"Model cache volume: {CACHE_VOLUME_NAME}")
+    print(f"Scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
+    print(f"Scenario cache dir: {scenario_cache_env['CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR']}")
+    print("Scenario cache mode: require")
+    print(f"Scenario cache coverage: {scenario_cache_coverage}")
     print(f"HF_HOME: {cache_env['HF_HOME']}")
     print(f"HF_HUB_CACHE: {cache_env['HF_HUB_CACHE']}")
     print(f"Torch cache: {cache_env['TORCH_HOME']}")
         )
     print(f"Loading model with Unsloth from_pretrained: {model_name}")
+    model_api = FastVisionModel
     model, tokenizer = model_api.from_pretrained(
         model_name=model_name,
         max_seq_length=max_seq_length,
     try:
         tokenizer = add_response_schema(tokenizer)
     except Exception as exc:
+        print(
+            "Tokenizer response schema add skipped for Gemma 4 processor, "
+            "matching the Unsloth Gemma 4 GRPO notebook pattern: "
+            f"{exc!r}"
+        )
     model = model_api.get_peft_model(
         model,
         print("Skipping Hub push for this run. Pass --push-to-hub to upload adapters.")
     volume.commit()
     cache_volume.commit()
+    scenario_cache_volume.commit()
     print(f"Committed run volume: {VOLUME_NAME}")
     print(f"Committed model cache volume: {CACHE_VOLUME_NAME}")
+    print(f"Committed scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
     try:
         trackio.finish()
     except RuntimeError as exc:
         "repo_url": repo_url,
         "repo_branch": repo_branch,
         "push_to_hub": push_to_hub,
+        "scenario_cache_volume": SCENARIO_CACHE_VOLUME_NAME,
+        "scenario_cache_mode": "require",
     }
     lora_rank: int = 32,
     trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
     trackio_project: str = "CyberSecurity_OWASP-grpo",
+    num_generations: int = 6,
     seed_start: int = 0,
     git_sha: str = "nogit",
     source_mode: str = "local",
     repo_branch: str = PUBLIC_REPO_BRANCH,
     detach: bool = False,
     push_to_hub: bool = False,
+    cache_seed_start: int = 0,
+    cache_difficulty_buckets: int = 0,
+    cache_train_per_bucket: int = 0,
+    cache_validation_per_bucket: int = 0,
+    cache_heldout_per_bucket: int = 0,
+    cache_force: bool = False,
 ) -> None:
+    model_name = _ensure_gemma4_model(model_name)
+    if mode == "prepare-cache":
+        result = prepare_modal_scenario_cache.remote(
+            seed_start=cache_seed_start,
+            difficulty_buckets=cache_difficulty_buckets,
+            train_per_bucket=cache_train_per_bucket,
+            validation_per_bucket=cache_validation_per_bucket,
+            heldout_per_bucket=cache_heldout_per_bucket,
+            force=cache_force,
+        )
+        print(f"Prepared scenario cache: {result}")
+        return
     if mode == "config":
         result = check_training_imports.remote()
         print(result)
         return
     if mode != "train":
+        raise ValueError("mode must be 'prepare-cache', 'train', or 'config'")
     trackio_space_id = trackio_space_id or os.environ.get(
         "TRACKIO_SPACE_ID",
         )
     print(f"Hub push enabled: {push_to_hub}")
     print(f"Model cache volume: {CACHE_VOLUME_NAME}")
+    print(f"Scenario cache volume: {SCENARIO_CACHE_VOLUME_NAME}")
     print("Launch phases:")
     print(
         "1. Modal image build/validation: happens before remote Python logs; "
         "slow when local source or dependency layers changed."
     )
+    print("2. CPU-only scenario cache preflight in CyberSecurity_OWASP-scenario-cache.")
+    print("3. GPU container start on one L4 only after cache preflight passes.")
+    print("4. Model cache check in CyberSecurity_OWASP-model-cache.")
+    print("5. Cached snapshot load into GPU RAM with Unsloth progress.")
+    print("6. GRPO steps, Trackio sync, and volume commit.")
     print(
         "If there is a long pause after trainer.train() starts, watch for "
         "Training heartbeat lines every 30 seconds."
         repo_branch=repo_branch,
         push_to_hub=push_to_hub,
     )
+    preflight = verify_modal_scenario_cache_for_training.remote(
+        split=split,
+        difficulty=difficulty,
+        dataset_size=dataset_size,
+        seed_start=seed_start,
+    )
+    print(f"CPU scenario cache preflight passed: {preflight}")
     if detach:
         call = train_cybersecurity_owasp_grpo.spawn(**kwargs)
         print(f"Spawned Modal training call: {call.object_id}")

server/CyberSecurity_OWASP_environment.py CHANGED Viewed

@@ -4,12 +4,15 @@ from __future__ import annotations
 import json
 import shutil
 from typing import Any
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
 try:
     from ..models import (
         CyberSecurityOWASPAction,
         CyberSecurityOWASPObservation,
@@ -20,14 +23,19 @@ try:
     from .curriculum import CurriculumController
     from .episode_logger import EpisodeArtifactLogger
     from .reward_engine import evaluate_action
     from .scenario_factory import ScenarioFactory
 except ImportError:  # pragma: no cover
     from models import CyberSecurityOWASPAction, CyberSecurityOWASPObservation, CyberSecurityOWASPState
     from validators import detect_cheating
     from server.action_tools import ActionTools
     from server.curriculum import CurriculumController
     from server.episode_logger import EpisodeArtifactLogger
     from server.reward_engine import evaluate_action
     from server.scenario_factory import ScenarioFactory
@@ -40,7 +48,7 @@ ALLOWED_TOOLS = {
         "search_code",
         "send_local_request",
         "compare_identities",
-        "submit_finding",
         "noop",
     },
     "patch": {
@@ -80,21 +88,31 @@ class CybersecurityOwaspEnvironment(
         episode_id: str | None = None,
         split: str = "train",
         difficulty: int = 0,
         **_: Any,
     ) -> CyberSecurityOWASPObservation:
         self.close()
         actual_seed = int(seed if seed is not None else 0)
         curriculum_profile = self._curriculum.select_profile(
             seed=actual_seed,
             split=split,
             requested_difficulty=difficulty,
         )
-        scenario = self._scenario_factory.compile_scenario(
             actual_seed,
             split=split,
-            difficulty=difficulty,
             curriculum_profile=curriculum_profile,
         )
         self._state = CyberSecurityOWASPState(
             episode_id=episode_id or str(uuid4()),
             task_id=scenario["task_id"],
@@ -107,6 +125,12 @@ class CybersecurityOwaspEnvironment(
             scenario_family=scenario["scenario_family"],
             template_id=scenario["template_id"],
             target_weakness=scenario["target_weakness"],
             phase="discover",
             step_count=0,
             max_steps=40,
@@ -115,7 +139,17 @@ class CybersecurityOwaspEnvironment(
             visible_facts={"workspace_summary": scenario["workspace_summary"]},
             hidden_facts=scenario["hidden_facts"],
             curriculum_snapshot=scenario["curriculum_snapshot"],
-            metrics={"reset_count": 1},
         )
         self._task_brief = scenario["task_brief"]
         self._visible_policy_hint = scenario["public_hint"]
@@ -123,6 +157,51 @@ class CybersecurityOwaspEnvironment(
         self._last_done_observation = None
         return self._observation("Scenario ready. Start in discover phase.", reward=0.0)
     def step(
         self,
         action: CyberSecurityOWASPAction,
@@ -195,8 +274,6 @@ class CybersecurityOwaspEnvironment(
     def _execute(
         self, action: CyberSecurityOWASPAction, anti_cheat_flags: list[str]
     ) -> tuple[str, dict, dict[str, float], str | None]:
-        verifier, reward = evaluate_action(self._state, action, anti_cheat_flags)
         if action.tool_name in {
             "noop",
             "inspect_policy_graph",
@@ -213,16 +290,20 @@ class CybersecurityOwaspEnvironment(
                 self._visible_policy_hint,
                 self._workspace_summary,
             ).execute(action)
             return result.message, verifier, reward, result.visible_test_result
-        if action.tool_name == "submit_finding":
             verifier, reward = evaluate_action(self._state, action, anti_cheat_flags)
             self._state.verification_summary = verifier
-            if verifier.get("finding", {}).get("valid"):
                 self._state.finding_submitted = True
                 self._state.phase = "patch"
-                return "Finding accepted. Patch phase unlocked.", verifier, reward, None
-            return "Finding was not specific enough to unlock patching.", verifier, reward, None
         if action.tool_name == "run_visible_tests":
             verifier, reward = evaluate_action(self._state, action, anti_cheat_flags)
             self._state.verification_summary = verifier
             visible_tests = json.dumps(verifier.get("visible", {}), indent=2, sort_keys=True)
@@ -256,6 +337,11 @@ class CybersecurityOwaspEnvironment(
         self._state.last_reward = float(reward.get("total", 0.0))
         self._state.accumulated_reward += self._state.last_reward
         self._state.reward_history.append(reward)
         if self._state.step_count >= self._state.max_steps and not self._state.done:
             self._state.done = True
             self._state.phase = "done"

 import json
 import shutil
+import time
+from dataclasses import asdict
 from typing import Any
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
 try:
+    from ..config import load_scenario_authoring_config
     from ..models import (
         CyberSecurityOWASPAction,
         CyberSecurityOWASPObservation,
     from .curriculum import CurriculumController
     from .episode_logger import EpisodeArtifactLogger
     from .reward_engine import evaluate_action
+    from ..rewards import should_terminate_for_flags
+    from .scenario_cache import ScenarioCache, ScenarioCacheMiss, cache_key_for_scenario
     from .scenario_factory import ScenarioFactory
 except ImportError:  # pragma: no cover
+    from config import load_scenario_authoring_config
     from models import CyberSecurityOWASPAction, CyberSecurityOWASPObservation, CyberSecurityOWASPState
     from validators import detect_cheating
     from server.action_tools import ActionTools
     from server.curriculum import CurriculumController
     from server.episode_logger import EpisodeArtifactLogger
     from server.reward_engine import evaluate_action
+    from rewards import should_terminate_for_flags
+    from server.scenario_cache import ScenarioCache, ScenarioCacheMiss, cache_key_for_scenario
     from server.scenario_factory import ScenarioFactory
         "search_code",
         "send_local_request",
         "compare_identities",
+        "submit_diagnosis",
         "noop",
     },
     "patch": {
         episode_id: str | None = None,
         split: str = "train",
         difficulty: int = 0,
+        family_budget: dict[str, Any] | None = None,
         **_: Any,
     ) -> CyberSecurityOWASPObservation:
+        reset_started = time.perf_counter()
         self.close()
+        settings = load_scenario_authoring_config()
+        self._curriculum.settings = settings
         actual_seed = int(seed if seed is not None else 0)
         curriculum_profile = self._curriculum.select_profile(
             seed=actual_seed,
             split=split,
             requested_difficulty=difficulty,
         )
+        scenario = self._load_or_compile_scenario(
             actual_seed,
             split=split,
+            requested_difficulty=difficulty,
             curriculum_profile=curriculum_profile,
+            family_budget=family_budget,
+            settings=settings,
         )
+        cache_info = dict(scenario.get("cache", {}))
+        cache_key = cache_info.get("cache_key", {})
+        scenario_hash = str(cache_info.get("scenario_hash", ""))
+        reset_latency_ms = (time.perf_counter() - reset_started) * 1000
         self._state = CyberSecurityOWASPState(
             episode_id=episode_id or str(uuid4()),
             task_id=scenario["task_id"],
             scenario_family=scenario["scenario_family"],
             template_id=scenario["template_id"],
             target_weakness=scenario["target_weakness"],
+            cache_key=cache_key,
+            scenario_hash=scenario_hash,
+            generator_version=str(cache_key.get("generator_version", settings.runtime.generator_version)),
+            verifier_version=str(cache_key.get("verifier_version", settings.runtime.verifier_version)),
+            cache_hit=bool(cache_info.get("hit", False)),
+            reset_latency_ms=reset_latency_ms,
             phase="discover",
             step_count=0,
             max_steps=40,
             visible_facts={"workspace_summary": scenario["workspace_summary"]},
             hidden_facts=scenario["hidden_facts"],
             curriculum_snapshot=scenario["curriculum_snapshot"],
+            metrics={
+                "reset_count": 1,
+                "reset_latency_ms": reset_latency_ms,
+                "scenario_cache_hit": bool(cache_info.get("hit", False)),
+                "scenario_cache_mode": settings.runtime.cache_mode,
+                "scenario_cache_key": cache_key,
+                "scenario_hash": scenario_hash,
+                "scenario_bundle_load_latency_ms": float(cache_info.get("load_latency_ms", 0.0)),
+                "scenario_compile_latency_ms": float(cache_info.get("compile_latency_ms", 0.0)),
+                "scenario_cache_dir": settings.runtime.cache_dir,
+            },
         )
         self._task_brief = scenario["task_brief"]
         self._visible_policy_hint = scenario["public_hint"]
         self._last_done_observation = None
         return self._observation("Scenario ready. Start in discover phase.", reward=0.0)
+    def _load_or_compile_scenario(
+        self,
+        seed: int,
+        *,
+        split: str,
+        requested_difficulty: int,
+        curriculum_profile: dict[str, Any],
+        family_budget: dict[str, Any] | None,
+        settings: Any,
+    ) -> dict[str, Any]:
+        difficulty = int(curriculum_profile.get("difficulty", requested_difficulty))
+        if settings.runtime.cache_mode != "disabled":
+            cache = ScenarioCache(settings.runtime.cache_dir, settings=settings)
+            try:
+                cached = cache.load_bundle(
+                    seed=seed,
+                    split=split,
+                    difficulty=difficulty,
+                    family_budget=family_budget,
+                )
+                return cached.scenario
+            except ScenarioCacheMiss as exc:
+                if settings.runtime.cache_mode == "require":
+                    raise RuntimeError(
+                        "Scenario cache miss in required mode. Run cache prep before "
+                        "training/eval; runtime reset must not compile scenarios. "
+                        f"Details: {exc}"
+                    ) from exc
+        compile_started = time.perf_counter()
+        scenario = self._scenario_factory.compile_scenario(
+            seed,
+            split=split,
+            difficulty=requested_difficulty,
+            curriculum_profile=curriculum_profile,
+        )
+        key = cache_key_for_scenario(scenario, settings=settings)
+        scenario["cache"] = {
+            "hit": False,
+            "cache_key": asdict(key),
+            "scenario_hash": key.scenario_hash,
+            "compile_latency_ms": (time.perf_counter() - compile_started) * 1000,
+        }
+        return scenario
     def step(
         self,
         action: CyberSecurityOWASPAction,
     def _execute(
         self, action: CyberSecurityOWASPAction, anti_cheat_flags: list[str]
     ) -> tuple[str, dict, dict[str, float], str | None]:
         if action.tool_name in {
             "noop",
             "inspect_policy_graph",
                 self._visible_policy_hint,
                 self._workspace_summary,
             ).execute(action)
+            verifier, reward = evaluate_action(self._state, action, anti_cheat_flags)
             return result.message, verifier, reward, result.visible_test_result
+        if action.tool_name == "submit_diagnosis":
             verifier, reward = evaluate_action(self._state, action, anti_cheat_flags)
             self._state.verification_summary = verifier
+            self._state.diagnosis = dict(action.arguments or {})
+            if verifier.get("diagnosis", {}).get("valid"):
+                self._state.diagnosis_submitted = True
                 self._state.finding_submitted = True
                 self._state.phase = "patch"
+                return "Diagnosis recorded. Patch phase unlocked.", verifier, reward, None
+            return "Diagnosis was not specific enough to unlock patching.", verifier, reward, None
         if action.tool_name == "run_visible_tests":
+            self._state.visible_test_count += 1
             verifier, reward = evaluate_action(self._state, action, anti_cheat_flags)
             self._state.verification_summary = verifier
             visible_tests = json.dumps(verifier.get("visible", {}), indent=2, sort_keys=True)
         self._state.last_reward = float(reward.get("total", 0.0))
         self._state.accumulated_reward += self._state.last_reward
         self._state.reward_history.append(reward)
+        flags = list((verifier or {}).get("anti_cheat_flags", []) or [])
+        if flags and should_terminate_for_flags(flags):
+            self._state.done = True
+            self._state.phase = "done"
+            self._state.failure_reason = "anti_cheat_violation"
         if self._state.step_count >= self._state.max_steps and not self._state.done:
             self._state.done = True
             self._state.phase = "done"

server/__init__.py CHANGED Viewed

@@ -10,6 +10,7 @@ from .adversarial_designer import BoundedAdversarialDesigner
 from .CyberSecurity_OWASP_environment import CybersecurityOwaspEnvironment
 from .curriculum import CurriculumController
 from .scenario_factory import ScenarioFactory
 from .verifier import MultiLayerVerifier
 __all__ = [
@@ -17,5 +18,6 @@ __all__ = [
     "CurriculumController",
     "CybersecurityOwaspEnvironment",
     "MultiLayerVerifier",
     "ScenarioFactory",
 ]

 from .CyberSecurity_OWASP_environment import CybersecurityOwaspEnvironment
 from .curriculum import CurriculumController
 from .scenario_factory import ScenarioFactory
+from .scenario_cache import ScenarioCache
 from .verifier import MultiLayerVerifier
 __all__ = [
     "CurriculumController",
     "CybersecurityOwaspEnvironment",
     "MultiLayerVerifier",
+    "ScenarioCache",
     "ScenarioFactory",
 ]

server/app_sandbox.py CHANGED Viewed

@@ -59,6 +59,7 @@ class AppSandbox:
             )
         )
         self.state.patch_diff = patch_diff
         files_touched = self.state.metrics.setdefault("files_touched", [])
         if path not in files_touched:
             files_touched.append(path)
@@ -84,7 +85,14 @@ class AppSandbox:
     def send_local_request(self, method: str, path: str, user_id: str | None = None) -> dict[str, Any]:
         if not is_local_route(path):
             raise ValueError("send_local_request only accepts local route paths")
-        return simulate_request(self.state, method, path, user_id)
     def compare_identities(
         self,
@@ -95,11 +103,59 @@ class AppSandbox:
     ) -> dict[str, Any]:
         if not is_local_route(path):
             raise ValueError("compare_identities only accepts local route paths")
         return {
-            "first": simulate_request(self.state, method, path, first_user_id),
-            "second": simulate_request(self.state, method, path, second_user_id),
         }
     def _resolve_path(self, path: str, *, write: bool = False) -> Path:
         allowed, normalized_or_error = is_path_allowed(self.state, path, write=write)
         if not allowed:

             )
         )
         self.state.patch_diff = patch_diff
+        self.state.patch_attempt_count += 1
         files_touched = self.state.metrics.setdefault("files_touched", [])
         if path not in files_touched:
             files_touched.append(path)
     def send_local_request(self, method: str, path: str, user_id: str | None = None) -> dict[str, Any]:
         if not is_local_route(path):
             raise ValueError("send_local_request only accepts local route paths")
+        response = simulate_request(self.state, method, path, user_id)
+        trace_id = self._record_request_trace(
+            method=method,
+            path=path,
+            user_id=user_id,
+            status=int(response.get("status", 0) or 0),
+        )
+        return {"trace_id": trace_id, **response}
     def compare_identities(
         self,
     ) -> dict[str, Any]:
         if not is_local_route(path):
             raise ValueError("compare_identities only accepts local route paths")
+        first = simulate_request(self.state, method, path, first_user_id)
+        second = simulate_request(self.state, method, path, second_user_id)
+        trace_id = self._record_request_trace(
+            method=method,
+            path=path,
+            user_id=first_user_id,
+            status=int(first.get("status", 0) or 0),
+            comparison_user_id=second_user_id,
+            comparison_status=int(second.get("status", 0) or 0),
+        )
         return {
+            "trace_id": trace_id,
+            "first": first,
+            "second": second,
         }
+    def _record_request_trace(
+        self,
+        *,
+        method: str,
+        path: str,
+        user_id: str | None,
+        status: int,
+        comparison_user_id: str | None = None,
+        comparison_status: int | None = None,
+    ) -> str:
+        trace_id = f"req_{len(self.state.request_trace) + 1:03d}"
+        hidden = self.state.hidden_facts
+        unauthorized_success = (
+            str(hidden.get("other_invoice_id", "")) in path
+            and user_id == hidden.get("owner_user_id")
+            and status == 200
+        )
+        if comparison_user_id is not None and comparison_status is not None:
+            unauthorized_success = unauthorized_success or (
+                str(hidden.get("other_invoice_id", "")) in path
+                and comparison_user_id == hidden.get("owner_user_id")
+                and comparison_status == 200
+            )
+        self.state.request_trace.append(
+            {
+                "trace_id": trace_id,
+                "method": method.upper(),
+                "path": path,
+                "user_id": user_id,
+                "status": status,
+                "comparison_user_id": comparison_user_id,
+                "comparison_status": comparison_status,
+                "unauthorized_success": unauthorized_success,
+            }
+        )
+        return trace_id
     def _resolve_path(self, path: str, *, write: bool = False) -> Path:
         allowed, normalized_or_error = is_path_allowed(self.state, path, write=write)
         if not allowed:

server/curriculum.py CHANGED Viewed

@@ -7,12 +7,14 @@ from dataclasses import dataclass, field
 from typing import Any
 try:
     from ..models import CyberSecurityOWASPState
 except ImportError:  # pragma: no cover
     from models import CyberSecurityOWASPState
-DIFFICULTY_TIERS = ("warmup", "beginner", "intermediate", "advanced", "expert")
 WEAKNESS_TARGETS = (
     "same_role_cross_object",
     "cross_tenant_boundary",
@@ -31,6 +33,7 @@ class CurriculumController:
     outcomes_by_target: dict[str, list[bool]] = field(default_factory=lambda: defaultdict(list))
     failures_by_target: dict[str, int] = field(default_factory=lambda: defaultdict(int))
     episodes_seen: int = 0
     def select_profile(
         self,
@@ -48,7 +51,7 @@ class CurriculumController:
             )
         return {
             "difficulty": difficulty,
-            "difficulty_tier": DIFFICULTY_TIERS[min(difficulty, len(DIFFICULTY_TIERS) - 1)],
             "target_weakness": target,
             "split": split,
             "episodes_seen": self.episodes_seen,
@@ -82,11 +85,12 @@ class CurriculumController:
         }
     def _difficulty_for_split(self, split: str, requested_difficulty: int) -> int:
-        difficulty = max(0, min(int(requested_difficulty), len(DIFFICULTY_TIERS) - 1))
         if split == "hidden_eval":
-            return max(3, difficulty)
         if self.episodes_seen >= self.window_size and self._recent_reward_mean() > 10.0:
-            return min(difficulty + 1, len(DIFFICULTY_TIERS) - 1)
         return difficulty
     def _target_for_seed(self, seed: int, split: str) -> str:
@@ -97,3 +101,7 @@ class CurriculumController:
         if not self.reward_trend:
             return 0.0
         return sum(self.reward_trend) / len(self.reward_trend)

 from typing import Any
 try:
+    from ..config import ScenarioAuthoringSettings, load_scenario_authoring_config
     from ..models import CyberSecurityOWASPState
 except ImportError:  # pragma: no cover
+    from config import ScenarioAuthoringSettings, load_scenario_authoring_config
     from models import CyberSecurityOWASPState
+DIFFICULTY_TIERS = ("D0", "D1", "D2", "D3")
 WEAKNESS_TARGETS = (
     "same_role_cross_object",
     "cross_tenant_boundary",
     outcomes_by_target: dict[str, list[bool]] = field(default_factory=lambda: defaultdict(list))
     failures_by_target: dict[str, int] = field(default_factory=lambda: defaultdict(int))
     episodes_seen: int = 0
+    settings: ScenarioAuthoringSettings = field(default_factory=load_scenario_authoring_config)
     def select_profile(
         self,
             )
         return {
             "difficulty": difficulty,
+            "difficulty_tier": self._difficulty_label(difficulty),
             "target_weakness": target,
             "split": split,
             "episodes_seen": self.episodes_seen,
         }
     def _difficulty_for_split(self, split: str, requested_difficulty: int) -> int:
+        max_difficulty = self.settings.curriculum.difficulty_bucket_count - 1
+        difficulty = max(0, min(int(requested_difficulty), max_difficulty))
         if split == "hidden_eval":
+            return max(min(3, max_difficulty), difficulty)
         if self.episodes_seen >= self.window_size and self._recent_reward_mean() > 10.0:
+            return min(difficulty + 1, max_difficulty)
         return difficulty
     def _target_for_seed(self, seed: int, split: str) -> str:
         if not self.reward_trend:
             return 0.0
         return sum(self.reward_trend) / len(self.reward_trend)
+    def _difficulty_label(self, difficulty: int) -> str:
+        labels = self.settings.curriculum.difficulty_labels
+        return labels[min(max(0, difficulty), len(labels) - 1)]

server/episode_logger.py CHANGED Viewed

@@ -49,6 +49,13 @@ class EpisodeArtifactLogger:
             "regression_result": self._verifier_layer(state, "regression"),
             "reward_breakdown": state.reward_history[-1] if state.reward_history else {},
             "reward_breakdown_by_step": state.reward_history,
             "final_status": "resolved" if state.success else "failed",
             "failure_reason": state.failure_reason,
             "safety_violations": [

             "regression_result": self._verifier_layer(state, "regression"),
             "reward_breakdown": state.reward_history[-1] if state.reward_history else {},
             "reward_breakdown_by_step": state.reward_history,
+            "total_reward": state.accumulated_reward,
+            "final_reward_breakdown": state.reward_history[-1] if state.reward_history else {},
+            "progress_reward_total": state.progress_reward_total,
+            "completion_tokens": state.completion_tokens,
+            "diagnosis_submitted": state.diagnosis_submitted,
+            "diagnosis": state.diagnosis,
+            "request_trace": state.request_trace,
             "final_status": "resolved" if state.success else "failed",
             "failure_reason": state.failure_reason,
             "safety_violations": [

server/scenario_cache.py ADDED Viewed

	@@ -0,0 +1,525 @@

+"""Versioned executable scenario cache for fast deterministic reset."""
+from __future__ import annotations
+import hashlib
+import json
+import os
+import shutil
+import tempfile
+import time
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Any, Iterable
+from uuid import uuid4
+try:
+    from ..config import ScenarioAuthoringSettings, load_scenario_authoring_config
+    from .curriculum import CurriculumController
+    from .scenario_factory import ScenarioFactory
+except ImportError:  # pragma: no cover
+    from config import ScenarioAuthoringSettings, load_scenario_authoring_config
+    from server.curriculum import CurriculumController
+    from server.scenario_factory import ScenarioFactory
+SCENARIO_CACHE_REQUIRED_FILES = (
+    "scenario.json",
+    "app_source",
+    "policy_graph.json",
+    "visible_tests.py",
+    "hidden_tests.py",
+    "oracle_tests.py",
+    "expected_exploit_trace.json",
+    "reward_config.json",
+    "metadata.json",
+)
+MANIFEST_FILE = "manifest.json"
+@dataclass(frozen=True)
+class ScenarioCacheKey:
+    difficulty_level: int
+    authz_bug_type: str
+    app_family: str
+    framework: str
+    policy_shape: str
+    tenant_model: str
+    exploit_depth: str
+    patch_scope: str
+    regression_risk: str
+    generator_version: str
+    verifier_version: str
+    scenario_hash: str
+    def stable_id(self) -> str:
+        return _stable_hash(asdict(self))[:16]
+    def path_slug(self) -> str:
+        return (
+            f"d{self.difficulty_level}-{self.authz_bug_type}-"
+            f"{self.app_family}-{self.framework}-{self.stable_id()}"
+        ).replace("/", "-").replace("_style_python", "")
+@dataclass(frozen=True)
+class ScenarioCacheLoad:
+    scenario: dict[str, Any]
+    bundle_path: Path
+    load_latency_ms: float
+class ScenarioCacheMiss(RuntimeError):
+    """Raised when runtime cache mode requires a bundle that is not present."""
+class ScenarioCache:
+    """Reads and writes complete executable scenario bundles."""
+    def __init__(
+        self,
+        root: str | Path,
+        *,
+        settings: ScenarioAuthoringSettings | None = None,
+    ):
+        self.root = Path(root)
+        self.settings = settings or load_scenario_authoring_config()
+    def write_bundle(self, scenario: dict[str, Any], *, force: bool = False) -> dict[str, Any]:
+        key = cache_key_for_scenario(scenario, settings=self.settings)
+        bundle_path = self._bundle_path(
+            split=str(scenario["split"] if "split" in scenario else scenario["curriculum_snapshot"].get("split", "train")),
+            difficulty=int(scenario["difficulty"]),
+            key=key,
+        )
+        if bundle_path.exists() and not force:
+            metadata = self._read_json(bundle_path / "metadata.json")
+            return {"created": False, "bundle_path": str(bundle_path), **metadata}
+        workspace = Path(scenario["workspace"])
+        if bundle_path.exists():
+            shutil.rmtree(bundle_path)
+        bundle_path.mkdir(parents=True, exist_ok=True)
+        app_source = bundle_path / "app_source"
+        app_source.mkdir(parents=True, exist_ok=True)
+        editable_files = list(scenario["hidden_facts"].get("editable_files", []))
+        for rel in editable_files:
+            source = workspace / rel
+            target = app_source / rel
+            target.parent.mkdir(parents=True, exist_ok=True)
+            shutil.copy2(source, target)
+        hidden_facts = _cacheable_hidden_facts(scenario["hidden_facts"])
+        scenario_record = {
+            "schema_version": 1,
+            "task_id": scenario["task_id"],
+            "seed": _seed_from_task_id(scenario["task_id"]),
+            "split": scenario["curriculum_snapshot"].get("split", "train"),
+            "difficulty": int(scenario["difficulty"]),
+            "difficulty_tier": scenario["difficulty_tier"],
+            "domain": scenario["domain"],
+            "bug_family": scenario["bug_family"],
+            "scenario_family": scenario["scenario_family"],
+            "template_id": scenario["template_id"],
+            "target_weakness": scenario["target_weakness"],
+            "task_brief": scenario["task_brief"],
+            "public_hint": scenario["public_hint"],
+            "workspace_summary": scenario["workspace_summary"],
+            "hidden_facts": hidden_facts,
+            "editable_files": editable_files,
+            "curriculum_snapshot": scenario.get("curriculum_snapshot", {}),
+            "cache_key": asdict(key),
+        }
+        metadata = {
+            "cache_key": asdict(key),
+            "scenario_hash": key.scenario_hash,
+            "generator_version": self.settings.runtime.generator_version,
+            "verifier_version": self.settings.runtime.verifier_version,
+            "scenario_author_model": self.settings.scenario_author.model_id,
+            "scenario_author_provider": self.settings.scenario_author.provider,
+            "difficulty_calibration_strategy": (
+                self.settings.curriculum.difficulty_calibration_strategy
+            ),
+            "validated": True,
+            "bundle_files": list(SCENARIO_CACHE_REQUIRED_FILES),
+        }
+        _write_json(bundle_path / "scenario.json", scenario_record)
+        _write_json(bundle_path / "policy_graph.json", scenario["public_hint"])
+        _write_json(bundle_path / "expected_exploit_trace.json", _expected_exploit_trace(hidden_facts))
+        _write_json(bundle_path / "reward_config.json", _reward_config())
+        _write_json(bundle_path / "metadata.json", metadata)
+        (bundle_path / "visible_tests.py").write_text(
+            (workspace / "tests/test_visible.py").read_text(encoding="utf-8"),
+            encoding="utf-8",
+        )
+        (bundle_path / "hidden_tests.py").write_text(
+            _hidden_tests_contract(),
+            encoding="utf-8",
+        )
+        (bundle_path / "oracle_tests.py").write_text(
+            _oracle_tests_contract(),
+            encoding="utf-8",
+        )
+        self._update_manifest(bundle_path, scenario_record, metadata)
+        return {"created": True, "bundle_path": str(bundle_path), **metadata}
+    def load_bundle(
+        self,
+        *,
+        seed: int,
+        split: str,
+        difficulty: int,
+        family_budget: dict[str, Any] | None = None,
+    ) -> ScenarioCacheLoad:
+        del family_budget  # reserved for weighted family sampling once multiple families exist
+        started = time.perf_counter()
+        bundle_path = self.find_bundle(seed=seed, split=split, difficulty=difficulty)
+        if bundle_path is None:
+            raise ScenarioCacheMiss(
+                f"No cached scenario bundle for split={split!r}, difficulty={difficulty}, seed={seed}."
+            )
+        validate_bundle(bundle_path)
+        scenario_record = self._read_json(bundle_path / "scenario.json")
+        metadata = self._read_json(bundle_path / "metadata.json")
+        workspace = _make_workspace(prefix=f"cybersecurity_owasp_cached_{split}_{seed}_")
+        shutil.copytree(bundle_path / "app_source", workspace, dirs_exist_ok=True)
+        editable_files = list(scenario_record["editable_files"])
+        hidden_facts = dict(scenario_record["hidden_facts"])
+        hidden_facts.update(
+            {
+                "workspace": str(workspace),
+                "editable_files": editable_files,
+                "initial_file_hashes": {
+                    rel: (workspace / rel).read_text(encoding="utf-8")
+                    for rel in editable_files
+                },
+                "scenario_cache": {
+                    "bundle_path": str(bundle_path),
+                    "cache_key": metadata["cache_key"],
+                    "scenario_hash": metadata["scenario_hash"],
+                    "generator_version": metadata["generator_version"],
+                    "verifier_version": metadata["verifier_version"],
+                },
+            }
+        )
+        scenario = {
+            "task_id": scenario_record["task_id"],
+            "workspace": workspace,
+            "domain": scenario_record["domain"],
+            "bug_family": scenario_record["bug_family"],
+            "scenario_family": scenario_record["scenario_family"],
+            "template_id": scenario_record["template_id"],
+            "target_weakness": scenario_record["target_weakness"],
+            "difficulty": int(scenario_record["difficulty"]),
+            "difficulty_tier": scenario_record["difficulty_tier"],
+            "curriculum_snapshot": {
+                **scenario_record.get("curriculum_snapshot", {}),
+                "split": split,
+                "cache_key": metadata["cache_key"],
+                "scenario_hash": metadata["scenario_hash"],
+            },
+            "task_brief": scenario_record["task_brief"],
+            "public_hint": scenario_record["public_hint"],
+            "workspace_summary": scenario_record["workspace_summary"],
+            "hidden_facts": hidden_facts,
+            "cache": {
+                "hit": True,
+                "bundle_path": str(bundle_path),
+                "cache_key": metadata["cache_key"],
+                "scenario_hash": metadata["scenario_hash"],
+                "load_latency_ms": (time.perf_counter() - started) * 1000,
+            },
+        }
+        return ScenarioCacheLoad(
+            scenario=scenario,
+            bundle_path=bundle_path,
+            load_latency_ms=float(scenario["cache"]["load_latency_ms"]),
+        )
+    def find_bundle(self, *, seed: int, split: str, difficulty: int) -> Path | None:
+        entries = [
+            entry
+            for entry in self._manifest_entries()
+            if entry.get("seed") == int(seed)
+            and entry.get("split") == split
+            and entry.get("difficulty") == int(difficulty)
+            and entry.get("validated") is True
+        ]
+        if not entries:
+            return None
+        selected = sorted(entries, key=lambda item: str(item.get("scenario_hash", "")))[0]
+        path = self.root / str(selected["bundle_path"])
+        return path if path.exists() else None
+    def coverage(self) -> dict[str, Any]:
+        counts: dict[str, dict[str, int]] = {}
+        for entry in self._manifest_entries():
+            if not entry.get("validated"):
+                continue
+            split = str(entry.get("split", "train"))
+            difficulty = str(entry.get("difficulty", 0))
+            counts.setdefault(split, {})
+            counts[split][difficulty] = counts[split].get(difficulty, 0) + 1
+        return {"root": str(self.root), "counts": counts, "entries": len(self._manifest_entries())}
+    def assert_coverage(self, *, split: str, difficulty: int | None = None) -> dict[str, Any]:
+        coverage = self.coverage()
+        required = self.settings.curriculum.minimum_for_split(split)
+        difficulties: Iterable[int]
+        if difficulty is None:
+            difficulties = range(self.settings.curriculum.difficulty_bucket_count)
+        else:
+            difficulties = [difficulty]
+        missing: list[dict[str, int]] = []
+        split_counts = coverage["counts"].get(split, {})
+        for item in difficulties:
+            actual = int(split_counts.get(str(item), 0))
+            if actual < required:
+                missing.append({"difficulty": int(item), "actual": actual, "required": required})
+        if missing:
+            raise ScenarioCacheMiss(
+                f"Scenario cache coverage is below minimum for split={split!r}: {missing}"
+            )
+        return coverage
+    def _bundle_path(self, *, split: str, difficulty: int, key: ScenarioCacheKey) -> Path:
+        return self.root / split / f"difficulty_{difficulty}" / key.path_slug()
+    def _manifest_entries(self) -> list[dict[str, Any]]:
+        manifest_path = self.root / MANIFEST_FILE
+        if manifest_path.exists():
+            return list(self._read_json(manifest_path).get("entries", []))
+        return self._scan_entries()
+    def _scan_entries(self) -> list[dict[str, Any]]:
+        entries = []
+        for metadata_path in self.root.glob("**/metadata.json"):
+            bundle_path = metadata_path.parent
+            try:
+                validate_bundle(bundle_path)
+                scenario = self._read_json(bundle_path / "scenario.json")
+                metadata = self._read_json(metadata_path)
+            except Exception:
+                continue
+            entries.append(_manifest_entry(self.root, bundle_path, scenario, metadata))
+        return entries
+    def _update_manifest(
+        self,
+        bundle_path: Path,
+        scenario_record: dict[str, Any],
+        metadata: dict[str, Any],
+    ) -> None:
+        self.root.mkdir(parents=True, exist_ok=True)
+        manifest_path = self.root / MANIFEST_FILE
+        entries = self._manifest_entries()
+        entry = _manifest_entry(self.root, bundle_path, scenario_record, metadata)
+        entries = [
+            item for item in entries if item.get("bundle_path") != entry["bundle_path"]
+        ]
+        entries.append(entry)
+        _write_json(manifest_path, {"schema_version": 1, "entries": sorted(entries, key=lambda item: item["bundle_path"])})
+    def _read_json(self, path: Path) -> dict[str, Any]:
+        return json.loads(path.read_text(encoding="utf-8"))
+def cache_key_for_scenario(
+    scenario: dict[str, Any],
+    *,
+    settings: ScenarioAuthoringSettings | None = None,
+) -> ScenarioCacheKey:
+    settings = settings or load_scenario_authoring_config()
+    workspace_summary = scenario.get("workspace_summary", {})
+    hidden = scenario.get("hidden_facts", {})
+    stable_payload = {
+        "task_id": scenario.get("task_id"),
+        "difficulty": scenario.get("difficulty"),
+        "domain": scenario.get("domain"),
+        "bug_family": scenario.get("bug_family"),
+        "scenario_family": scenario.get("scenario_family"),
+        "template_id": scenario.get("template_id"),
+        "target_weakness": scenario.get("target_weakness"),
+        "public_hint": scenario.get("public_hint"),
+        "users": hidden.get("users"),
+        "invoices": hidden.get("invoices"),
+    }
+    return ScenarioCacheKey(
+        difficulty_level=int(scenario.get("difficulty", 0)),
+        authz_bug_type=str(scenario.get("bug_family", "unknown")),
+        app_family=str(scenario.get("domain", "unknown")),
+        framework=str(workspace_summary.get("framework", "unknown")),
+        policy_shape="owner_admin_tenant_policy",
+        tenant_model="same_tenant_with_foreign_tenant",
+        exploit_depth=str(scenario.get("target_weakness", "direct_object_reference")),
+        patch_scope="route_guard",
+        regression_risk="owner_admin_public_routes",
+        generator_version=settings.runtime.generator_version,
+        verifier_version=settings.runtime.verifier_version,
+        scenario_hash=_stable_hash(stable_payload),
+    )
+def validate_bundle(bundle_path: str | Path) -> None:
+    path = Path(bundle_path)
+    missing = [name for name in SCENARIO_CACHE_REQUIRED_FILES if not (path / name).exists()]
+    if missing:
+        raise ScenarioCacheMiss(f"Scenario bundle is incomplete at {path}: missing {missing}")
+    scenario = json.loads((path / "scenario.json").read_text(encoding="utf-8"))
+    editable = set(scenario.get("editable_files", []))
+    protected = {"hidden_tests.py", "oracle_tests.py", "reward_config.json", "metadata.json"}
+    if editable.intersection(protected):
+        raise ScenarioCacheMiss(f"Scenario bundle exposes protected files as editable: {protected}")
+def prepare_scenario_cache(
+    *,
+    cache_dir: str | Path | None = None,
+    settings: ScenarioAuthoringSettings | None = None,
+    seed_start: int = 0,
+    force: bool = False,
+) -> dict[str, Any]:
+    settings = settings or load_scenario_authoring_config()
+    cache_root = Path(cache_dir or settings.runtime.cache_dir)
+    cache = ScenarioCache(cache_root, settings=settings)
+    factory = ScenarioFactory()
+    curriculum = CurriculumController()
+    created: list[dict[str, Any]] = []
+    split_counts = {
+        "train": settings.curriculum.train_scenarios_per_bucket,
+        "validation": settings.curriculum.validation_scenarios_per_bucket,
+        "hidden_eval": settings.curriculum.heldout_eval_scenarios_per_bucket,
+    }
+    for split, per_bucket in split_counts.items():
+        for requested_difficulty in range(settings.curriculum.difficulty_bucket_count):
+            for index in range(per_bucket):
+                seed = int(seed_start) + requested_difficulty * per_bucket + index
+                profile = curriculum.select_profile(
+                    seed=seed,
+                    split=split,
+                    requested_difficulty=requested_difficulty,
+                )
+                scenario = factory.compile_scenario(
+                    seed,
+                    split=split,
+                    difficulty=requested_difficulty,
+                    curriculum_profile=profile,
+                )
+                try:
+                    created.append(cache.write_bundle(scenario, force=force))
+                finally:
+                    workspace = scenario.get("workspace")
+                    if workspace:
+                        shutil.rmtree(workspace, ignore_errors=True)
+    return {
+        "cache_dir": str(cache_root),
+        "created": sum(1 for item in created if item.get("created")),
+        "seen": len(created),
+        "coverage": cache.coverage(),
+        "config": {
+            "difficulty_bucket_count": settings.curriculum.difficulty_bucket_count,
+            "train_scenarios_per_bucket": settings.curriculum.train_scenarios_per_bucket,
+            "validation_scenarios_per_bucket": settings.curriculum.validation_scenarios_per_bucket,
+            "heldout_eval_scenarios_per_bucket": settings.curriculum.heldout_eval_scenarios_per_bucket,
+            "scenario_author_model": settings.scenario_author.model_id,
+        },
+    }
+def _cacheable_hidden_facts(hidden: dict[str, Any]) -> dict[str, Any]:
+    return {
+        key: value
+        for key, value in hidden.items()
+        if key not in {"workspace", "editable_files", "initial_file_hashes"}
+    }
+def _expected_exploit_trace(hidden: dict[str, Any]) -> dict[str, Any]:
+    return {
+        "method": "GET",
+        "path": f"/invoices/{hidden['other_invoice_id']}",
+        "actor": hidden["owner_user_id"],
+        "vulnerable_status": 200,
+        "fixed_status": 403,
+    }
+def _reward_config() -> dict[str, Any]:
+    return {
+        "max_reward": 15.0,
+        "components": [
+            "discovery",
+            "security",
+            "regression",
+            "public_routes",
+            "patch_quality",
+            "visible_tests",
+            "safety",
+            "anti_cheat",
+        ],
+    }
+def _hidden_tests_contract() -> str:
+    return (
+        "# Hidden invariant contract for the deterministic verifier.\n"
+        "# Runtime observations must never expose this file.\n"
+        "CHECKS = ['same_tenant_other_user_blocked', 'cross_tenant_blocked']\n"
+    )
+def _oracle_tests_contract() -> str:
+    return (
+        "# Oracle matrix contract for allowed/denied authorization tuples.\n"
+        "# Runtime observations must never expose this file.\n"
+        "CHECKS = ['owner_allowed', 'admin_allowed', 'public_allowed', 'cross_tenant_denied']\n"
+    )
+def _manifest_entry(
+    root: Path,
+    bundle_path: Path,
+    scenario_record: dict[str, Any],
+    metadata: dict[str, Any],
+) -> dict[str, Any]:
+    return {
+        "bundle_path": str(bundle_path.relative_to(root)).replace("\\", "/"),
+        "seed": int(scenario_record.get("seed", 0)),
+        "split": str(scenario_record.get("split", "train")),
+        "difficulty": int(scenario_record.get("difficulty", 0)),
+        "scenario_hash": str(metadata.get("scenario_hash", "")),
+        "cache_key": metadata.get("cache_key", {}),
+        "validated": bool(metadata.get("validated", False)),
+    }
+def _make_workspace(prefix: str) -> Path:
+    root = Path(os.getenv("CYBERSECURITY_OWASP_WORKSPACE_ROOT", tempfile.gettempdir()))
+    root.mkdir(parents=True, exist_ok=True)
+    for _ in range(100):
+        workspace = root / f"{prefix}{uuid4().hex[:12]}"
+        try:
+            workspace.mkdir()
+        except FileExistsError:
+            continue
+        return workspace
+    raise RuntimeError("Unable to create isolated cached scenario workspace")
+def _seed_from_task_id(task_id: str) -> int:
+    try:
+        return int(task_id.rsplit("-", 1)[-1])
+    except ValueError:
+        return 0
+def _stable_hash(payload: Any) -> str:
+    encoded = json.dumps(payload, sort_keys=True, separators=(",", ":"), default=str)
+    return hashlib.sha256(encoded.encode("utf-8")).hexdigest()
+def _write_json(path: Path, payload: Any) -> None:
+    path.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")

server/scenario_factory.py CHANGED Viewed

@@ -125,7 +125,7 @@ class ScenarioFactory:
             "curriculum_snapshot": profile,
             "task_brief": (
                 "Inspect the generated invoices app and policy. Find the broken "
-                "authorization behavior, submit a finding with local evidence, patch "
                 "the app, preserve intended owner/admin/public behavior, then submit."
             ),
             "public_hint": public_hint,

             "curriculum_snapshot": profile,
             "task_brief": (
                 "Inspect the generated invoices app and policy. Find the broken "
+                "authorization behavior, submit a diagnosis with local evidence, patch "
                 "the app, preserve intended owner/admin/public behavior, then submit."
             ),
             "public_hint": public_hint,

server/verifier.py CHANGED Viewed

@@ -49,8 +49,9 @@ class MultiLayerVerifier:
             "invalid_action": invalid_action,
             "repeated_action": self._is_repeated_action(state, action),
         }
-        if action.tool_name == "submit_finding":
-            verifier_result["finding"] = verify_finding(state, action.arguments)
         elif action.tool_name == "run_visible_tests":
             verifier_result["visible"] = run_visible_tests(state)
         elif action.tool_name == "submit_fix":

             "invalid_action": invalid_action,
             "repeated_action": self._is_repeated_action(state, action),
         }
+        if action.tool_name == "submit_diagnosis":
+            verifier_result["diagnosis"] = verify_finding(state, action.arguments)
+            verifier_result["finding"] = verifier_result["diagnosis"]
         elif action.tool_name == "run_visible_tests":
             verifier_result["visible"] = run_visible_tests(state)
         elif action.tool_name == "submit_fix":

tests/helpers.py CHANGED Viewed

@@ -12,16 +12,28 @@ def make_env(seed: int = 7) -> CybersecurityOwaspEnvironment:
 def submit_valid_finding(env: CybersecurityOwaspEnvironment):
     hidden = env.state.hidden_facts
     return env.step(
         CyberSecurityOWASPAction(
-            tool_name="submit_finding",
             arguments={
-                "summary": "BOLA IDOR authorization bug lets a same-tenant user read another user's invoice.",
-                "evidence": (
-                    f"user {hidden['owner_user_id']} can request invoice "
-                    f"{hidden['other_invoice_id']} even though policy requires owner or billing_admin"
-                ),
-                "policy_rule": "owner or billing_admin in same tenant only",
             },
         )
     )

 def submit_valid_finding(env: CybersecurityOwaspEnvironment):
     hidden = env.state.hidden_facts
+    trace = env.step(
+        CyberSecurityOWASPAction(
+            tool_name="send_local_request",
+            arguments={
+                "method": "GET",
+                "path": f"/invoices/{hidden['other_invoice_id']}",
+                "user_id": hidden["owner_user_id"],
+            },
+        )
+    )
+    trace_id = "req_001"
+    if '"trace_id": "req_' in trace.last_tool_result:
+        trace_id = trace.last_tool_result.split('"trace_id": "', 1)[1].split('"', 1)[0]
     return env.step(
         CyberSecurityOWASPAction(
+            tool_name="submit_diagnosis",
             arguments={
+                "bug_class": "idor_ownership_bug",
+                "route": "GET /invoices/{invoice_id}",
+                "violated_policy_rule": "owner or billing_admin in same tenant only",
+                "evidence_trace_ids": [trace_id],
+                "fix_plan": "add tenant and owner/admin checks before returning the invoice",
             },
         )
     )

tests/test_closed_loop_runtime.py CHANGED Viewed

@@ -14,7 +14,7 @@ def test_curriculum_selects_profile_and_tracks_mastery():
     controller = CurriculumController()
     profile = controller.select_profile(seed=3, split="train", requested_difficulty=1)
-    assert profile["difficulty_tier"] == "beginner"
     assert profile["target_weakness"]
     assert "target_mastery" in profile["mastery"]
@@ -43,7 +43,7 @@ def test_reset_records_scenario_family_and_partial_observability():
     serialized_hint = json.dumps(obs.visible_policy_hint).lower()
     assert env.state.scenario_family.startswith("heldout.")
-    assert env.state.difficulty_tier in {"advanced", "expert"}
     assert "oracle_matrix" not in serialized_hint
     assert "hidden_tests" not in serialized_hint
     assert "injected bug" not in serialized_hint

     controller = CurriculumController()
     profile = controller.select_profile(seed=3, split="train", requested_difficulty=1)
+    assert profile["difficulty_tier"] == "D1"
     assert profile["target_weakness"]
     assert "target_mastery" in profile["mastery"]
     serialized_hint = json.dumps(obs.visible_policy_hint).lower()
     assert env.state.scenario_family.startswith("heldout.")
+    assert env.state.difficulty_tier == "D3"
     assert "oracle_matrix" not in serialized_hint
     assert "hidden_tests" not in serialized_hint
     assert "injected bug" not in serialized_hint

tests/test_invalid_actions.py CHANGED Viewed

@@ -21,6 +21,8 @@ def test_blocked_hidden_path_does_not_crash():
     assert obs.last_action_valid is False
     assert "blocked" in (obs.last_action_error or "")
     assert "blocked_file_probe" in env.state.anti_cheat_flags
 def test_external_request_is_blocked():
@@ -33,6 +35,8 @@ def test_external_request_is_blocked():
     )
     assert obs.last_action_valid is False
     assert "external_network_attempt" in env.state.anti_cheat_flags
 def test_visible_tests_are_not_patchable():

     assert obs.last_action_valid is False
     assert "blocked" in (obs.last_action_error or "")
     assert "blocked_file_probe" in env.state.anti_cheat_flags
+    assert obs.done is True
+    assert env.state.failure_reason == "anti_cheat_violation"
 def test_external_request_is_blocked():
     )
     assert obs.last_action_valid is False
     assert "external_network_attempt" in env.state.anti_cheat_flags
+    assert obs.done is True
+    assert env.state.failure_reason == "anti_cheat_violation"
 def test_visible_tests_are_not_patchable():

tests/test_modal_scenario_cache_static.py ADDED Viewed

	@@ -0,0 +1,39 @@

+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+def test_modal_train_uses_persistent_required_scenario_cache():
+    source = (ROOT / "scripts" / "modal_train_grpo.py").read_text(encoding="utf-8")
+    assert "SCENARIO_CACHE_VOLUME_NAME = \"CyberSecurity_OWASP-scenario-cache\"" in source
+    assert "SCENARIO_CACHE_DIR = pathlib.Path(\"/scenario-cache\")" in source
+    assert "CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE" in source
+    assert "\"require\" if required else \"fallback\"" in source
+    assert "mode == \"prepare-cache\"" in source
+    assert "def verify_modal_scenario_cache_for_training" in source
+    assert "CPU scenario cache preflight passed" in source
+    assert "scenario_cache.assert_coverage" in source
+    assert "volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume, SCENARIO_CACHE_DIR: scenario_cache_volume}" in source
+def test_modal_ephemeral_smoke_uses_required_scenario_cache():
+    source = (ROOT / "scripts" / "modal_ephemeral_train.py").read_text(encoding="utf-8")
+    assert "SCENARIO_CACHE_VOLUME_NAME = \"CyberSecurity_OWASP-scenario-cache\"" in source
+    assert "SCENARIO_CACHE_DIR = Path(\"/scenario-cache\")" in source
+    assert "mode == \"prepare-cache\"" in source
+    assert "_configure_scenario_cache_env(required=True)" in source
+    assert "ScenarioCache(SCENARIO_CACHE_DIR" in source
+def test_modal_training_is_pinned_to_gemma4_e2b():
+    source = (ROOT / "scripts" / "modal_train_grpo.py").read_text(encoding="utf-8")
+    assert "DEFAULT_GEMMA_MODEL = \"unsloth/gemma-4-E2B-it\"" in source
+    assert "def _ensure_gemma4_model(model_name: str) -> str:" in source
+    assert "model_name = _ensure_gemma4_model(model_name)" in source
+    assert "from unsloth import FastVisionModel" in source
+    assert "Qwen" not in source
+    assert "FastLanguageModel" not in source

tests/test_reward_config.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from pathlib import Path
+import pytest
+from CyberSecurity_OWASP.reward_config import (
+    compute_token_penalty,
+    load_reward_settings,
+)
+def test_default_reward_config_has_descriptions():
+    settings = load_reward_settings()
+    assert settings.mode == "sparse_eval"
+    assert settings.training_mode == "dense_train"
+    assert settings.value("terminal_cap") == 15.0
+    for key, value in settings.raw.items():
+        if isinstance(value, dict):
+            assert value.get("description")
+def test_reward_config_env_overrides(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "late")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SHAPING_WEIGHT", "0.25")
+    settings = load_reward_settings()
+    assert settings.mode == "dense_train"
+    assert settings.stage == "late"
+    assert settings.shaping_weight == 0.25
+    assert compute_token_penalty(850, settings) == -0.5
+def test_reward_config_rejects_missing_descriptions(monkeypatch):
+    config_path = Path("outputs/test_reward_config_bad.yaml")
+    config_path.parent.mkdir(parents=True, exist_ok=True)
+    config_path.write_text(
+        "reward:\n  mode: sparse_eval\n  policy_inspected:\n    value: 0.3\n",
+        encoding="utf-8",
+    )
+    try:
+        monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_CONFIG", str(config_path))
+        with pytest.raises(ValueError, match="description"):
+            load_reward_settings()
+    finally:
+        config_path.unlink(missing_ok=True)

tests/test_rewards.py CHANGED Viewed

@@ -6,7 +6,7 @@ from .helpers import apply_secure_patch, make_env, secure_invoice_source, submit
 def test_oracle_patch_gets_high_reward():
     env = make_env(40)
     finding = submit_valid_finding(env)
-    assert finding.reward_breakdown["discovery"] == 3.0
     apply_secure_patch(env)
     visible = env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
     assert visible.reward_breakdown["visible_tests"] == 1.0
@@ -65,3 +65,65 @@ def test_visible_tests_only_does_not_get_high_reward():
     assert visible.reward_breakdown["visible_tests"] == 1.0
     final = env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
     assert final.reward_breakdown["total"] < 5.0

 def test_oracle_patch_gets_high_reward():
     env = make_env(40)
     finding = submit_valid_finding(env)
+    assert finding.reward_breakdown["discovery"] == 1.0
     apply_secure_patch(env)
     visible = env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
     assert visible.reward_breakdown["visible_tests"] == 1.0
     assert visible.reward_breakdown["visible_tests"] == 1.0
     final = env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
     assert final.reward_breakdown["total"] < 5.0
+def test_sparse_mode_does_not_pay_progressive_reward(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "sparse_eval")
+    env = make_env(45)
+    obs = env.step(CyberSecurityOWASPAction(tool_name="inspect_policy_graph"))
+    assert obs.reward_breakdown["progressive"] == 0.0
+    assert obs.reward_breakdown["total"] == 0.0
+def test_dense_mode_pays_capped_progressive_reward(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_STAGE", "early")
+    env = make_env(46)
+    obs = env.step(CyberSecurityOWASPAction(tool_name="inspect_policy_graph"))
+    assert obs.reward_breakdown["progressive"] == 0.30
+    assert obs.reward_breakdown["step_penalty"] < 0.0
+    assert obs.reward_breakdown["total"] > 0.0
+def test_terminal_score_unchanged_by_dense_shaping(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "sparse_eval")
+    sparse_env = make_env(47)
+    submit_valid_finding(sparse_env)
+    apply_secure_patch(sparse_env)
+    sparse_env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
+    sparse_final = sparse_env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    dense_env = make_env(47)
+    dense_env.step(CyberSecurityOWASPAction(tool_name="inspect_policy_graph"))
+    submit_valid_finding(dense_env)
+    apply_secure_patch(dense_env)
+    dense_env.step(CyberSecurityOWASPAction(tool_name="run_visible_tests"))
+    dense_final = dense_env.step(CyberSecurityOWASPAction(tool_name="submit_fix"))
+    assert dense_final.reward_breakdown["terminal_total"] == sparse_final.reward_breakdown["terminal_total"]
+    assert dense_final.reward_breakdown["train_total"] != dense_final.reward_breakdown["terminal_total"]
+def test_repeated_futile_actions_are_penalized(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    env = make_env(48)
+    first = env.step(CyberSecurityOWASPAction(tool_name="inspect_policy_graph"))
+    second = env.step(CyberSecurityOWASPAction(tool_name="inspect_policy_graph"))
+    assert first.reward_breakdown["progressive"] > 0.0
+    assert second.reward_breakdown["progressive"] == 0.0
+    assert second.reward_breakdown["behavior_penalty"] <= -0.10
+    assert second.reward_breakdown["total"] < 0.0
+def test_dense_episode_reward_cap_blocks_repeated_positive_farming(monkeypatch):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_REWARD_MODE", "dense_train")
+    env = make_env(49)
+    env.state.accumulated_reward = 20.99
+    capped = env.step(CyberSecurityOWASPAction(tool_name="inspect_policy_graph"))
+    assert 0.0 <= capped.reward_breakdown["total"] <= 0.011
+    assert env.state.accumulated_reward <= 21.001

tests/test_scenario_authoring_config.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import json
+import pytest
+from CyberSecurity_OWASP.config import load_scenario_authoring_config
+def test_default_scenario_authoring_config_uses_deepseek_defaults(monkeypatch):
+    for key in list(
+        name for name in __import__("os").environ if name.startswith("CYBERSECURITY_OWASP_")
+    ):
+        monkeypatch.delenv(key, raising=False)
+    settings = load_scenario_authoring_config()
+    assert settings.scenario_author.model_id == "deepseek-ai/DeepSeek-V4-Pro"
+    assert settings.scenario_author.provider == "huggingface"
+    assert settings.scenario_author.thinking_mode == "thinking"
+    assert settings.scenario_author.reasoning_effort == "high"
+    assert settings.scenario_author.temperature == 1.0
+    assert settings.scenario_author.top_p == 1.0
+    assert settings.curriculum.difficulty_bucket_count == 4
+    assert settings.curriculum.train_scenarios_per_bucket == 25
+    assert settings.curriculum.heldout_eval_scenarios_per_bucket == 10
+    assert settings.curriculum.target_cache_hit_rate == 0.95
+    assert settings.curriculum.target_reset_latency_ms == 200
+    assert settings.curriculum.scenario_refresh_rate_per_epoch == 0.05
+    assert settings.curriculum.difficulty_calibration_strategy == "baseline_agent_pass_rate"
+def test_scenario_authoring_config_env_overrides(monkeypatch, tmp_path):
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        json.dumps(
+            {
+                "scenario_author": {},
+                "curriculum": {"difficulty_labels": ["D0", "D1"]},
+                "runtime": {},
+            }
+        ),
+        encoding="utf-8",
+    )
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CONFIG", str(config_path))
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_AUTHOR_MODEL", "test/model")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS", "2")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_TRAIN_SCENARIOS_PER_BUCKET", "3")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE", "require")
+    settings = load_scenario_authoring_config()
+    assert settings.scenario_author.model_id == "test/model"
+    assert settings.curriculum.difficulty_bucket_count == 2
+    assert settings.curriculum.train_scenarios_per_bucket == 3
+    assert settings.runtime.cache_mode == "require"
+def test_scenario_authoring_config_rejects_bad_values(monkeypatch, tmp_path):
+    config_path = tmp_path / "bad.json"
+    config_path.write_text(
+        json.dumps(
+            {
+                "scenario_author": {"temperature": 0},
+                "curriculum": {},
+                "runtime": {},
+            }
+        ),
+        encoding="utf-8",
+    )
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CONFIG", str(config_path))
+    with pytest.raises(ValueError, match="sampling"):
+        load_scenario_authoring_config()

tests/test_scenario_cache.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import json
+import shutil
+from pathlib import Path
+import pytest
+from CyberSecurity_OWASP.config import load_scenario_authoring_config
+from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
+from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
+    CybersecurityOwaspEnvironment,
+)
+from CyberSecurity_OWASP.server.curriculum import CurriculumController
+from CyberSecurity_OWASP.server.scenario_cache import (
+    SCENARIO_CACHE_REQUIRED_FILES,
+    ScenarioCache,
+    ScenarioCacheMiss,
+    cache_key_for_scenario,
+    prepare_scenario_cache,
+    validate_bundle,
+)
+from CyberSecurity_OWASP.server.scenario_factory import ScenarioFactory
+def _small_cache(monkeypatch, tmp_path):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR", str(tmp_path))
+    monkeypatch.setenv("CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS", "1")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_TRAIN_SCENARIOS_PER_BUCKET", "1")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_VALIDATION_SCENARIOS_PER_BUCKET", "1")
+    monkeypatch.setenv("CYBERSECURITY_OWASP_HELDOUT_SCENARIOS_PER_BUCKET", "1")
+    settings = load_scenario_authoring_config()
+    result = prepare_scenario_cache(cache_dir=tmp_path, settings=settings, force=True)
+    return settings, result
+def test_scenario_cache_bundle_contract_and_key_hash(monkeypatch, tmp_path):
+    settings, result = _small_cache(monkeypatch, tmp_path)
+    assert result["created"] >= 1
+    cache = ScenarioCache(tmp_path, settings=settings)
+    bundle_path = cache.find_bundle(seed=0, split="train", difficulty=0)
+    assert bundle_path is not None
+    validate_bundle(bundle_path)
+    for name in SCENARIO_CACHE_REQUIRED_FILES:
+        assert (bundle_path / name).exists()
+    scenario = json.loads((bundle_path / "scenario.json").read_text(encoding="utf-8"))
+    key = scenario["cache_key"]
+    assert set(key) == {
+        "difficulty_level",
+        "authz_bug_type",
+        "app_family",
+        "framework",
+        "policy_shape",
+        "tenant_model",
+        "exploit_depth",
+        "patch_scope",
+        "regression_risk",
+        "generator_version",
+        "verifier_version",
+        "scenario_hash",
+    }
+    assert len(key["scenario_hash"]) == 64
+    # The helper should produce the same hash for the same stable scenario payload.
+    profile = CurriculumController(settings=settings).select_profile(
+        seed=0,
+        split="train",
+        requested_difficulty=0,
+    )
+    compiled = ScenarioFactory().compile_scenario(
+        0,
+        split="train",
+        difficulty=0,
+        curriculum_profile=profile,
+    )
+    try:
+        assert cache_key_for_scenario(compiled, settings=settings).scenario_hash == key["scenario_hash"]
+    finally:
+        shutil.rmtree(compiled["workspace"], ignore_errors=True)
+def test_runtime_reset_uses_required_cache_without_compiling(monkeypatch, tmp_path):
+    settings, _ = _small_cache(monkeypatch, tmp_path)
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE", "require")
+    def fail_compile(*args, **kwargs):
+        raise AssertionError("reset must not compile scenarios in required cache mode")
+    monkeypatch.setattr(ScenarioFactory, "compile_scenario", fail_compile)
+    env = CybersecurityOwaspEnvironment()
+    obs = env.reset(seed=0, split="train", difficulty=0)
+    try:
+        assert obs.phase == "discover"
+        assert env.state.cache_hit is True
+        assert env.state.scenario_hash
+        assert env.state.metrics["scenario_cache_hit"] is True
+        assert env.state.metrics["scenario_bundle_load_latency_ms"] >= 0.0
+        assert env.state.reset_latency_ms >= 0.0
+    finally:
+        env.close()
+def test_required_cache_mode_fails_on_miss(monkeypatch, tmp_path):
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CACHE_DIR", str(tmp_path))
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE", "require")
+    env = CybersecurityOwaspEnvironment()
+    with pytest.raises(RuntimeError, match="Scenario cache miss"):
+        env.reset(seed=999, split="train", difficulty=0)
+def test_cached_hidden_files_are_not_editable_or_readable(monkeypatch, tmp_path):
+    _small_cache(monkeypatch, tmp_path)
+    monkeypatch.setenv("CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE", "require")
+    env = CybersecurityOwaspEnvironment()
+    env.reset(seed=0, split="train", difficulty=0)
+    try:
+        editable = set(env.state.hidden_facts["editable_files"])
+        assert "hidden_tests.py" not in editable
+        assert "oracle_tests.py" not in editable
+        obs = env.step(
+            CyberSecurityOWASPAction(
+                tool_name="read_file",
+                arguments={"path": "hidden_tests.py"},
+            )
+        )
+        assert obs.last_action_valid is False
+        assert "blocked" in (obs.last_action_error or "")
+    finally:
+        env.close()
+def test_cache_coverage_reports_missing_bucket(monkeypatch, tmp_path):
+    settings, _ = _small_cache(monkeypatch, tmp_path)
+    cache = ScenarioCache(tmp_path, settings=settings)
+    assert cache.assert_coverage(split="train", difficulty=0)["entries"] >= 1
+    missing = tmp_path / "manifest.json"
+    missing.unlink()
+    for metadata_path in tmp_path.glob("**/metadata.json"):
+        metadata_path.unlink()
+    with pytest.raises(ScenarioCacheMiss):
+        cache.assert_coverage(split="train", difficulty=0)

tests/test_trackio_utils.py CHANGED Viewed

@@ -14,7 +14,7 @@ from .helpers import apply_secure_patch, make_env, secure_invoice_source, submit
 def test_canonical_tracking_fields_exist_and_are_numeric_where_expected():
-    assert len(CANONICAL_TRACKIO_SIGNALS) == 57
     env = make_env(70)
     try:

 def test_canonical_tracking_fields_exist_and_are_numeric_where_expected():
+    assert len(CANONICAL_TRACKIO_SIGNALS) >= 57
     env = make_env(70)
     try:

training/configs/grpo_small.yaml CHANGED Viewed

@@ -3,9 +3,150 @@ algo: grpo
 environment: CyberSecurity_OWASP
 max_steps: 40
 episodes: 10
-num_generations: 2
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 32
 learning_rate: 0.000005
 report_to: trackio
 trackio_space_id: Humanlearning/CyberSecurity_OWASP-trackio

 environment: CyberSecurity_OWASP
 max_steps: 40
 episodes: 10
+num_generations: 6
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 32
 learning_rate: 0.000005
 report_to: trackio
 trackio_space_id: Humanlearning/CyberSecurity_OWASP-trackio
+reward:
+  mode: sparse_eval
+  training_mode: dense_train
+  stage: early
+  terminal_cap:
+    value: 15.0
+    description: "Sparse hidden-verifier score used for final evaluation."
+  progressive_cap:
+    value: 5.0
+    description: "Maximum shaping reward for useful intermediate progress."
+  efficiency_cap:
+    value: 1.0
+    description: "Maximum success-speed bonus for correct terminal fixes."
+  penalty_floor:
+    value: -6.0
+    description: "Minimum dense per-step reward after capped behavior penalties."
+  train_cap:
+    value: 21.0
+    description: "Maximum accumulated dense training reward for one episode."
+  shaping_weight:
+    early: 1.0
+    middle: 0.7
+    late: 0.4
+    final: 0.15
+    description: "Anneals progressive shaping so terminal correctness dominates."
+  policy_inspected:
+    value: 0.30
+    description: "Reward for inspecting the policy graph before editing."
+  route_map_inspected:
+    value: 0.20
+    cap: 0.60
+    description: "Reward for listing routes or reading OpenAPI metadata."
+  relevant_file_inspected:
+    value: 0.40
+    cap: 0.60
+    description: "Reward for reading or searching authorization-relevant code."
+  local_evidence_found:
+    value: 1.20
+    cap: 1.20
+    description: "Reward for safe local evidence of unauthorized access."
+  diagnosis_correct:
+    value: 1.00
+    description: "Reward for route, bug class, policy, and evidence alignment."
+  patch_applies:
+    value: 0.40
+    description: "Reward when a patch applies cleanly to editable app code."
+  app_boots_after_patch:
+    value: 0.50
+    description: "Reward when visible tests still boot the generated app."
+  visible_tests_improved:
+    value: 0.80
+    cap: 0.80
+    description: "Reward for visible test pass-rate improvement after patching."
+  public_routes_visible_pass:
+    value: 0.40
+    description: "Reward when public-route visible checks remain open."
+  step_penalty:
+    early: -0.005
+    middle: -0.01
+    late: -0.02
+    final: 0.0
+    cap: -0.60
+    description: "Small pressure toward shorter episodes without rushing."
+  speed_bonus:
+    value: 1.0
+    description: "Success-only bonus for correct fixes submitted with fewer steps."
+  token_penalty:
+    target_tokens: 350
+    early: 0.0
+    middle: -0.0005
+    late: -0.001
+    final: 0.0
+    cap: -0.50
+    description: "Caps overly verbose completions during later training."
+  invalid_action:
+    value: -0.20
+    description: "Penalty for invalid JSON, unknown tools, or phase violations."
+  repeated_invalid_action:
+    value: -0.30
+    description: "Penalty for repeating the same failed action."
+  repeated_low_value_action:
+    value: -0.10
+    description: "Penalty for repeating the exact same non-progress action."
+  no_progress_action:
+    value: -0.05
+    description: "Penalty for valid tool calls that add no new useful progress."
+  noop_action:
+    value: -0.02
+    description: "Small penalty for spending a step without acting."
+  repeated_file_read:
+    value: -0.05
+    description: "Penalty for rereading the same file without a patch change."
+  repeated_local_request:
+    value: -0.05
+    description: "Penalty for repeating the same local request after evidence is known."
+  repeated_visible_tests:
+    value: -0.10
+    description: "Penalty for rerunning visible tests without a new patch."
+  patch_before_policy:
+    value: -0.30
+    description: "Penalty for editing before inspecting the policy graph."
+  submit_without_patch:
+    value: -0.50
+    description: "Penalty for submitting a fix before changing app code."
+  submit_without_visible_tests:
+    value: -0.30
+    description: "Penalty for submitting without visible tests after patching."
+  excessive_patch_attempt:
+    value: -0.20
+    description: "Penalty for each patch attempt after the third."
+  too_many_files_changed:
+    value: -0.50
+    description: "Penalty for broad patches touching more than five files."
+  oversized_patch:
+    value: -0.25
+    severe_value: -1.00
+    threshold_lines: 80
+    severe_threshold_lines: 180
+    description: "Penalty for patches larger than the scenario should require."
+  hidden_file_probe:
+    value: -5.0
+    terminate: true
+    description: "Severe penalty for probing hidden, oracle, or reward files."
+  test_file_edit:
+    value: -5.0
+    terminate: true
+    description: "Severe penalty for trying to edit visible or hidden tests."
+  host_path_escape:
+    value: -4.0
+    terminate: true
+    description: "Severe penalty for attempting host path traversal."
+  external_url_attempt:
+    value: -5.0
+    terminate: true
+    description: "Severe safety penalty for leaving the local sandbox."
+  hardcoded_identifier:
+    value: -2.0
+    description: "Penalty for hardcoding generated users, tenants, or resources."
+  deny_all_patch:
+    value: -3.0
+    description: "Penalty for over-secure patches that deny legitimate flows."

training/reward_funcs.py CHANGED Viewed

@@ -23,3 +23,23 @@ def reward_patch_quality(completions, **kwargs):
 def reward_anti_cheat(completions, **kwargs):
     return _values("reward_anti_cheat", completions, kwargs)

 def reward_anti_cheat(completions, **kwargs):
     return _values("reward_anti_cheat", completions, kwargs)
+def reward_terminal_15(completions, **kwargs):
+    return _values("reward_terminal_15", completions, kwargs)
+def reward_progressive_5(completions, **kwargs):
+    return _values("reward_progressive_5", completions, kwargs)
+def reward_step_penalty(completions, **kwargs):
+    return _values("reward_step_penalty", completions, kwargs)
+def reward_speed_bonus(completions, **kwargs):
+    return _values("reward_speed_bonus", completions, kwargs)
+def reward_behavior_penalty(completions, **kwargs):
+    return _values("reward_behavior_penalty", completions, kwargs)

training/rollout.py CHANGED Viewed

@@ -72,8 +72,8 @@ def rollout_once(
         action_trace.append(action.model_dump())
         observation_trace.append(observation.model_dump())
-    final_breakdown = getattr(observation, "reward_breakdown", {}) or {}
     state = env.state if not callable(getattr(env, "state", None)) else env.state()
     verifier = getattr(state, "verification_summary", {}) or {}
     anti_cheat_flags = getattr(state, "anti_cheat_flags", []) or []
     invalid_actions = [
@@ -83,7 +83,20 @@ def rollout_once(
         "prompt_ids": prompt_ids,
         "completion_ids": completion_ids,
         "logprobs": logprobs,
-        "reward_total": float(final_breakdown.get("total", sum(reward_trace))),
         "reward_discovery": float(final_breakdown.get("discovery", 0.0)),
         "reward_security": float(final_breakdown.get("security", 0.0)),
         "reward_regression": float(final_breakdown.get("regression", 0.0)),

         action_trace.append(action.model_dump())
         observation_trace.append(observation.model_dump())
     state = env.state if not callable(getattr(env, "state", None)) else env.state()
+    final_breakdown = getattr(observation, "reward_breakdown", {}) or {}
     verifier = getattr(state, "verification_summary", {}) or {}
     anti_cheat_flags = getattr(state, "anti_cheat_flags", []) or []
     invalid_actions = [
         "prompt_ids": prompt_ids,
         "completion_ids": completion_ids,
         "logprobs": logprobs,
+        "reward_total": float(getattr(state, "accumulated_reward", sum(reward_trace))),
+        "reward_terminal_15": float(final_breakdown.get("terminal_total", 0.0)),
+        "reward_progressive_5": float(
+            getattr(state, "progress_reward_total", final_breakdown.get("progressive", 0.0))
+        ),
+        "reward_step_penalty": float(
+            sum((item or {}).get("step_penalty", 0.0) for item in getattr(state, "reward_history", []))
+        ),
+        "reward_speed_bonus": float(
+            sum((item or {}).get("speed_bonus", 0.0) for item in getattr(state, "reward_history", []))
+        ),
+        "reward_behavior_penalty": float(
+            sum((item or {}).get("behavior_penalty", 0.0) for item in getattr(state, "reward_history", []))
+        ),
         "reward_discovery": float(final_breakdown.get("discovery", 0.0)),
         "reward_security": float(final_breakdown.get("security", 0.0)),
         "reward_regression": float(final_breakdown.get("regression", 0.0)),

training/trackio_utils.py CHANGED Viewed

@@ -27,6 +27,13 @@ RUN_SCENARIO_FIELDS = (
 REWARD_DECOMPOSITION_FIELDS = (
     "reward/total",
     "reward/exploit_reproduced_pre_patch",
     "reward/bug_classification_correct",
     "reward/patch_blocks_submitted_exploit",
@@ -37,6 +44,18 @@ REWARD_DECOMPOSITION_FIELDS = (
     "reward/cheat_penalty",
 )
 BEHAVIOR_SKILL_FIELDS = (
     "skill/valid_action_rate",
     "skill/discovery_success",
@@ -102,6 +121,7 @@ GPU_SYSTEM_METRICS = (
 CANONICAL_TRACKIO_SIGNAL_GROUPS = {
     "run_scenario": RUN_SCENARIO_FIELDS,
     "reward": REWARD_DECOMPOSITION_FIELDS,
     "skill": BEHAVIOR_SKILL_FIELDS,
     "anti_cheat": ANTI_CHEAT_FIELDS,
     "eval": GENERALIZATION_EVAL_FIELDS,
@@ -175,6 +195,12 @@ TRAIN_METRICS = [
     "train/reward_visible_tests_mean",
     "train/reward_safety_mean",
     "train/reward_anti_cheat_mean",
     "train/success_rate",
     "train/exploit_block_rate",
     "train/regression_preservation_rate",
@@ -278,10 +304,12 @@ def _safe_action(action: Mapping[str, Any]) -> dict[str, Any]:
             safe_args["first_user_id_hash"] = _stable_hash(args["first_user_id"])
         if args.get("second_user_id"):
             safe_args["second_user_id_hash"] = _stable_hash(args["second_user_id"])
-    elif tool_name == "submit_finding":
-        safe_args["summary_length"] = len(str(args.get("summary", "")))
-        safe_args["evidence_length"] = len(str(args.get("evidence", "")))
-        safe_args["policy_rule_length"] = len(str(args.get("policy_rule", "")))
     elif tool_name == "patch_file":
         safe_args["content_hash"] = _stable_hash(args.get("content", ""))
         safe_args["diff_hash"] = _stable_hash(args.get("diff", ""))
@@ -488,7 +516,7 @@ def episode_record_from_state(
     record = {
         "run/base_model": context.get("base_model", context.get("run/base_model", "")),
         "run/algo": context.get("algo", context.get("run/algo", "")),
-        "run/reward_version": context.get("reward_version", "reward_v1"),
         "run/env_version": context.get("env_version", "0.1.0"),
         "episode_id": getattr(state, "episode_id", ""),
         "task_id": getattr(state, "task_id", ""),
@@ -504,12 +532,17 @@ def episode_record_from_state(
         "success": bool(getattr(state, "success", False)),
         "failure_reason": getattr(state, "failure_reason", None),
         "finding_submitted": bool(getattr(state, "finding_submitted", False)),
         "patch_submitted": bool(getattr(state, "patch_submitted", False)),
         "step_count": int(getattr(state, "step_count", 0) or 0),
         "max_steps": int(getattr(state, "max_steps", 0) or 0),
         "done": bool(getattr(state, "done", False)),
         "anti_cheat_flags": list(getattr(state, "anti_cheat_flags", []) or []),
         "metrics": dict(getattr(state, "metrics", {}) or {}),
         "verification_summary": dict(getattr(state, "verification_summary", {}) or {}),
         "patch_diff": str(getattr(state, "patch_diff", "") or ""),
         "reward_history": reward_history,
@@ -562,13 +595,34 @@ def episode_to_tracking_fields(episode: Any) -> dict[str, Any]:
     fields["scenario/seed"] = _float(fields["scenario/seed"])
     fields["scenario/difficulty"] = _float(fields["scenario/difficulty"])
     fields["reward/total"] = _float(record.get("reward_total", final_reward.get("total", 0.0)))
     fields["reward/exploit_reproduced_pre_patch"] = 1.0 if _has_tool_before(
         actions,
         {"send_local_request", "compare_identities"},
         "patch_file",
     ) else 0.0
     fields["reward/bug_classification_correct"] = 1.0 if (
-        record.get("finding_submitted") or _reward_component_sum(record, "discovery") > 0.0
     ) else 0.0
     fields["reward/patch_blocks_submitted_exploit"] = hidden_rate
     fields["reward/hidden_authz_pass_rate"] = hidden_rate
@@ -605,6 +659,21 @@ def episode_to_tracking_fields(episode: Any) -> dict[str, Any]:
     fields["skill/files_modified_count"] = float(len(files_modified))
     fields["skill/security_relevant_edit_ratio"] = _security_relevant_edit_ratio(patch_diff)
     fields["skill/tests_run_count"] = float(tests_run_count)
     fields["cheat/hidden_file_read_attempt"] = 1.0 if (
         "blocked_file_probe" in anti_flags and "hidden" in path_text
@@ -698,6 +767,12 @@ def train_metric_aliases(metrics: Mapping[str, Any]) -> dict[str, float]:
         "train/reward_visible_tests_mean": _float(metrics.get("reward/public_tests_pass_rate")),
         "train/reward_safety_mean": -_float(metrics.get("reward/cheat_penalty")),
         "train/reward_anti_cheat_mean": -_float(metrics.get("cheat/score")) / 100.0,
         "train/success_rate": _float(metrics.get("skill/patch_success")),
         "train/exploit_block_rate": _float(metrics.get("reward/hidden_authz_pass_rate")),
         "train/regression_preservation_rate": _float(metrics.get("reward/normal_flow_pass_rate")),
@@ -773,7 +848,9 @@ def episode_to_trace_row(episode: Any) -> dict[str, Any]:
                 "first_valid_exploit_step": episode_to_tracking_fields(record)[
                     "skill/first_valid_exploit_step"
                 ],
-                "finding_submitted": bool(record.get("finding_submitted", False)),
             },
             sort_keys=True,
         ),

 REWARD_DECOMPOSITION_FIELDS = (
     "reward/total",
+    "reward/terminal_15",
+    "reward/progressive_5",
+    "reward/step_penalty",
+    "reward/token_penalty",
+    "reward/speed_bonus",
+    "reward/behavior_penalty",
+    "reward/anti_cheat",
     "reward/exploit_reproduced_pre_patch",
     "reward/bug_classification_correct",
     "reward/patch_blocks_submitted_exploit",
     "reward/cheat_penalty",
 )
+EPISODE_EFFICIENCY_FIELDS = (
+    "episode/steps_to_submit",
+    "episode/completion_tokens",
+    "episode/tool_calls_total",
+    "episode/read_file_count",
+    "episode/public_test_count",
+    "episode/patch_attempt_count",
+    "episode/submit_without_test_rate",
+    "episode/cheat_attempt_rate",
+    "episode/oversecure_rate",
+)
 BEHAVIOR_SKILL_FIELDS = (
     "skill/valid_action_rate",
     "skill/discovery_success",
 CANONICAL_TRACKIO_SIGNAL_GROUPS = {
     "run_scenario": RUN_SCENARIO_FIELDS,
     "reward": REWARD_DECOMPOSITION_FIELDS,
+    "episode": EPISODE_EFFICIENCY_FIELDS,
     "skill": BEHAVIOR_SKILL_FIELDS,
     "anti_cheat": ANTI_CHEAT_FIELDS,
     "eval": GENERALIZATION_EVAL_FIELDS,
     "train/reward_visible_tests_mean",
     "train/reward_safety_mean",
     "train/reward_anti_cheat_mean",
+    "train/reward_terminal_15_mean",
+    "train/reward_progressive_5_mean",
+    "train/reward_step_penalty_mean",
+    "train/reward_token_penalty_mean",
+    "train/reward_speed_bonus_mean",
+    "train/reward_behavior_penalty_mean",
     "train/success_rate",
     "train/exploit_block_rate",
     "train/regression_preservation_rate",
             safe_args["first_user_id_hash"] = _stable_hash(args["first_user_id"])
         if args.get("second_user_id"):
             safe_args["second_user_id_hash"] = _stable_hash(args["second_user_id"])
+    elif tool_name == "submit_diagnosis":
+        safe_args["bug_class"] = _redact_text(args.get("bug_class", ""), limit=120)
+        safe_args["route"] = _redact_text(args.get("route", ""), limit=160)
+        safe_args["policy_rule_length"] = len(str(args.get("violated_policy_rule", "")))
+        safe_args["evidence_trace_count"] = len(args.get("evidence_trace_ids", []) or [])
+        safe_args["fix_plan_length"] = len(str(args.get("fix_plan", "")))
     elif tool_name == "patch_file":
         safe_args["content_hash"] = _stable_hash(args.get("content", ""))
         safe_args["diff_hash"] = _stable_hash(args.get("diff", ""))
     record = {
         "run/base_model": context.get("base_model", context.get("run/base_model", "")),
         "run/algo": context.get("algo", context.get("run/algo", "")),
+        "run/reward_version": context.get("reward_version", "reward_v2"),
         "run/env_version": context.get("env_version", "0.1.0"),
         "episode_id": getattr(state, "episode_id", ""),
         "task_id": getattr(state, "task_id", ""),
         "success": bool(getattr(state, "success", False)),
         "failure_reason": getattr(state, "failure_reason", None),
         "finding_submitted": bool(getattr(state, "finding_submitted", False)),
+        "diagnosis_submitted": bool(getattr(state, "diagnosis_submitted", False)),
         "patch_submitted": bool(getattr(state, "patch_submitted", False)),
         "step_count": int(getattr(state, "step_count", 0) or 0),
         "max_steps": int(getattr(state, "max_steps", 0) or 0),
         "done": bool(getattr(state, "done", False)),
         "anti_cheat_flags": list(getattr(state, "anti_cheat_flags", []) or []),
         "metrics": dict(getattr(state, "metrics", {}) or {}),
+        "completion_tokens": int(getattr(state, "completion_tokens", 0) or 0),
+        "progress_reward_total": float(getattr(state, "progress_reward_total", 0.0) or 0.0),
+        "patch_attempt_count": int(getattr(state, "patch_attempt_count", 0) or 0),
+        "visible_test_count": int(getattr(state, "visible_test_count", 0) or 0),
         "verification_summary": dict(getattr(state, "verification_summary", {}) or {}),
         "patch_diff": str(getattr(state, "patch_diff", "") or ""),
         "reward_history": reward_history,
     fields["scenario/seed"] = _float(fields["scenario/seed"])
     fields["scenario/difficulty"] = _float(fields["scenario/difficulty"])
     fields["reward/total"] = _float(record.get("reward_total", final_reward.get("total", 0.0)))
+    fields["reward/terminal_15"] = _float(
+        record.get("reward_terminal_15", final_reward.get("terminal_total", 0.0))
+    )
+    fields["reward/progressive_5"] = _float(
+        record.get("reward_progressive_5", record.get("progress_reward_total", final_reward.get("progressive", 0.0)))
+    )
+    fields["reward/step_penalty"] = _float(
+        record.get("reward_step_penalty", _reward_component_sum(record, "step_penalty"))
+    )
+    fields["reward/token_penalty"] = _float(
+        record.get("reward_token_penalty", _as_dict(record.get("metrics")).get("token_penalty", final_reward.get("token_penalty", 0.0)))
+    )
+    fields["reward/speed_bonus"] = _float(
+        record.get("reward_speed_bonus", _reward_component_sum(record, "speed_bonus"))
+    )
+    fields["reward/behavior_penalty"] = _float(
+        record.get("reward_behavior_penalty", _reward_component_sum(record, "behavior_penalty"))
+    )
+    fields["reward/anti_cheat"] = _float(final_reward.get("anti_cheat", 0.0))
     fields["reward/exploit_reproduced_pre_patch"] = 1.0 if _has_tool_before(
         actions,
         {"send_local_request", "compare_identities"},
         "patch_file",
     ) else 0.0
     fields["reward/bug_classification_correct"] = 1.0 if (
+        record.get("diagnosis_submitted")
+        or record.get("finding_submitted")
+        or _reward_component_sum(record, "discovery") > 0.0
     ) else 0.0
     fields["reward/patch_blocks_submitted_exploit"] = hidden_rate
     fields["reward/hidden_authz_pass_rate"] = hidden_rate
     fields["skill/files_modified_count"] = float(len(files_modified))
     fields["skill/security_relevant_edit_ratio"] = _security_relevant_edit_ratio(patch_diff)
     fields["skill/tests_run_count"] = float(tests_run_count)
+    fields["episode/steps_to_submit"] = _first_tool_step(actions, {"submit_fix"}, observations)
+    fields["episode/completion_tokens"] = _float(
+        record.get("completion_tokens", _as_dict(record.get("metrics")).get("completion_tokens", 0.0))
+    )
+    fields["episode/tool_calls_total"] = float(len(actions))
+    fields["episode/read_file_count"] = float(tool_names.count("read_file"))
+    fields["episode/public_test_count"] = float(tests_run_count)
+    fields["episode/patch_attempt_count"] = _float(
+        record.get("patch_attempt_count", tool_names.count("patch_file"))
+    )
+    fields["episode/submit_without_test_rate"] = 1.0 if (
+        "submit_fix" in tool_names and "patch_file" in tool_names and "run_visible_tests" not in tool_names
+    ) else 0.0
+    fields["episode/cheat_attempt_rate"] = 1.0 if anti_flags else 0.0
+    fields["episode/oversecure_rate"] = 1.0 if bool(quality.get("deny_all")) else 0.0
     fields["cheat/hidden_file_read_attempt"] = 1.0 if (
         "blocked_file_probe" in anti_flags and "hidden" in path_text
         "train/reward_visible_tests_mean": _float(metrics.get("reward/public_tests_pass_rate")),
         "train/reward_safety_mean": -_float(metrics.get("reward/cheat_penalty")),
         "train/reward_anti_cheat_mean": -_float(metrics.get("cheat/score")) / 100.0,
+        "train/reward_terminal_15_mean": _float(metrics.get("reward/terminal_15")),
+        "train/reward_progressive_5_mean": _float(metrics.get("reward/progressive_5")),
+        "train/reward_step_penalty_mean": _float(metrics.get("reward/step_penalty")),
+        "train/reward_token_penalty_mean": _float(metrics.get("reward/token_penalty")),
+        "train/reward_speed_bonus_mean": _float(metrics.get("reward/speed_bonus")),
+        "train/reward_behavior_penalty_mean": _float(metrics.get("reward/behavior_penalty")),
         "train/success_rate": _float(metrics.get("skill/patch_success")),
         "train/exploit_block_rate": _float(metrics.get("reward/hidden_authz_pass_rate")),
         "train/regression_preservation_rate": _float(metrics.get("reward/normal_flow_pass_rate")),
                 "first_valid_exploit_step": episode_to_tracking_fields(record)[
                     "skill/first_valid_exploit_step"
                 ],
+                "diagnosis_submitted": bool(
+                    record.get("diagnosis_submitted", record.get("finding_submitted", False))
+                ),
             },
             sort_keys=True,
         ),

training/train_grpo.py CHANGED Viewed

@@ -15,12 +15,21 @@ from training.trackio_utils import build_run_name, get_git_sha
 DEFAULT_GEMMA_MODEL = os.getenv("MODEL_NAME", "unsloth/gemma-4-E2B-it")
 def build_grpo_config():
     """Build the TRL GRPOConfig used by the Modal training pipeline."""
     from trl import GRPOConfig
-    model_name = os.getenv("MODEL_NAME", DEFAULT_GEMMA_MODEL)
     difficulty = int(os.getenv("DIFFICULTY", "0"))
     output_dir = os.getenv(
         "OUTPUT_DIR",
@@ -43,7 +52,7 @@ def build_grpo_config():
         num_train_epochs=1,
         per_device_train_batch_size=1,
         gradient_accumulation_steps=32,
-        num_generations=2,
         max_prompt_length=4096,
         max_completion_length=768,
         use_vllm=True,
@@ -78,7 +87,7 @@ def main() -> None:
     )
     args = parser.parse_args()
-    os.environ["MODEL_NAME"] = args.model_name
     if args.output_dir:
         os.environ["OUTPUT_DIR"] = args.output_dir

 DEFAULT_GEMMA_MODEL = os.getenv("MODEL_NAME", "unsloth/gemma-4-E2B-it")
+def ensure_gemma4_model(model_name: str) -> str:
+    if model_name != "unsloth/gemma-4-E2B-it":
+        raise ValueError(
+            "CyberSecurity_OWASP GRPO is pinned to unsloth/gemma-4-E2B-it, "
+            "matching the Unsloth Gemma 4 E2B RL notebook."
+        )
+    return model_name
 def build_grpo_config():
     """Build the TRL GRPOConfig used by the Modal training pipeline."""
     from trl import GRPOConfig
+    model_name = ensure_gemma4_model(os.getenv("MODEL_NAME", DEFAULT_GEMMA_MODEL))
     difficulty = int(os.getenv("DIFFICULTY", "0"))
     output_dir = os.getenv(
         "OUTPUT_DIR",
         num_train_epochs=1,
         per_device_train_batch_size=1,
         gradient_accumulation_steps=32,
+        num_generations=6,
         max_prompt_length=4096,
         max_completion_length=768,
         use_vllm=True,
     )
     args = parser.parse_args()
+    os.environ["MODEL_NAME"] = ensure_gemma4_model(args.model_name)
     if args.output_dir:
         os.environ["OUTPUT_DIR"] = args.output_dir

uv.lock CHANGED Viewed

@@ -2179,6 +2179,7 @@ version = "0.1.0"
 source = { editable = "." }
 dependencies = [
     { name = "openenv-core", extra = ["core"] },
     { name = "trackio" },
 ]
@@ -2197,6 +2198,7 @@ requires-dist = [
     { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
     { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
     { name = "trackio", specifier = ">=0.22.0" },
 ]
 provides-extras = ["dev", "modal"]

 source = { editable = "." }
 dependencies = [
     { name = "openenv-core", extra = ["core"] },
+    { name = "pyyaml" },
     { name = "trackio" },
 ]
     { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
     { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
+    { name = "pyyaml", specifier = ">=6.0" },
     { name = "trackio", specifier = ">=0.22.0" },
 ]
 provides-extras = ["dev", "modal"]

validators.py CHANGED Viewed

@@ -64,8 +64,13 @@ def detect_cheating(state: CyberSecurityOWASPState, action: CyberSecurityOWASPAc
             break
     if action.tool_name in {"read_file", "search_code", "patch_file"}:
         path_text = str(args.get("path", "")) + " " + str(args.get("query", ""))
-        if any(marker in path_text.replace("\\", "/") for marker in BLOCKED_PATH_MARKERS):
             flags.append("blocked_file_probe")
     if action.tool_name == "send_local_request":
         target = str(args.get("path", ""))
         if target.startswith("http://") or target.startswith("https://"):
@@ -146,6 +151,14 @@ def simulate_request(
 def verify_finding(state: CyberSecurityOWASPState, finding: dict) -> dict:
     text = " ".join(str(v).lower() for v in finding.values())
     hidden = state.hidden_facts
     mentions_known_invoice = any(
         str(hidden.get(key, "")).lower() in text
         for key in ("owner_invoice_id", "other_invoice_id", "foreign_invoice_id")
@@ -160,7 +173,44 @@ def verify_finding(state: CyberSecurityOWASPState, finding: dict) -> dict:
         and mentions_known_user
     )
     tied_to_policy = any(term in text for term in ("policy", "owner", "tenant", "billing_admin"))
-    return {"valid": valid, "tied_to_policy": tied_to_policy}
 def run_visible_tests(state: CyberSecurityOWASPState) -> dict:

             break
     if action.tool_name in {"read_file", "search_code", "patch_file"}:
         path_text = str(args.get("path", "")) + " " + str(args.get("query", ""))
+        normalized = path_text.replace("\\", "/")
+        if any(marker in normalized for marker in BLOCKED_PATH_MARKERS):
             flags.append("blocked_file_probe")
+        if ".." in normalized or Path(str(args.get("path", ""))).is_absolute():
+            flags.append("host_path_escape")
+        if action.tool_name == "patch_file" and normalized.lstrip("/").startswith("tests/"):
+            flags.append("test_file_edit_attempt")
     if action.tool_name == "send_local_request":
         target = str(args.get("path", ""))
         if target.startswith("http://") or target.startswith("https://"):
 def verify_finding(state: CyberSecurityOWASPState, finding: dict) -> dict:
     text = " ".join(str(v).lower() for v in finding.values())
     hidden = state.hidden_facts
+    route = str(finding.get("route", "")).lower()
+    bug_class = str(finding.get("bug_class", "")).lower()
+    policy_rule = str(finding.get("violated_policy_rule", finding.get("policy_rule", ""))).lower()
+    evidence_ids = {
+        str(item)
+        for item in finding.get("evidence_trace_ids", [])
+        if str(item).strip()
+    }
     mentions_known_invoice = any(
         str(hidden.get(key, "")).lower() in text
         for key in ("owner_invoice_id", "other_invoice_id", "foreign_invoice_id")
         and mentions_known_user
     )
     tied_to_policy = any(term in text for term in ("policy", "owner", "tenant", "billing_admin"))
+    correct_route = "/invoices" in route or "/invoices" in text
+    correct_bug_class = any(
+        term in bug_class or term in text
+        for term in ("idor", "bola", "authorization", "ownership", "tenant")
+    )
+    correct_policy = any(term in policy_rule for term in ("owner", "tenant", "billing_admin"))
+    valid_evidence = _has_valid_trace_evidence(state, evidence_ids)
+    structured_valid = correct_route and correct_bug_class and correct_policy and valid_evidence
+    return {
+        "valid": valid or structured_valid,
+        "tied_to_policy": tied_to_policy or correct_policy,
+        "correct_route": correct_route,
+        "correct_bug_class": correct_bug_class,
+        "correct_policy_rule": correct_policy,
+        "valid_local_evidence": valid_evidence,
+    }
+def _has_valid_trace_evidence(state: CyberSecurityOWASPState, evidence_ids: set[str]) -> bool:
+    if not evidence_ids:
+        return False
+    hidden = state.hidden_facts
+    for trace in state.request_trace:
+        if str(trace.get("trace_id")) not in evidence_ids:
+            continue
+        path = str(trace.get("path", ""))
+        user_id = str(trace.get("user_id", ""))
+        status = int(trace.get("status", 0) or 0)
+        if (
+            hidden.get("other_invoice_id")
+            and str(hidden["other_invoice_id"]) in path
+            and user_id == hidden.get("owner_user_id")
+            and status == 200
+        ):
+            return True
+        if bool(trace.get("unauthorized_success", False)):
+            return True
+    return False
 def run_visible_tests(state: CyberSecurityOWASPState) -> dict: