Spaces:
Sleeping
Sleeping
feat: enhance scenario authoring and caching mechanisms, update action submission terminology, and improve reward configuration for CyberSecurity_OWASP environment
be8eade | # 01_ARCHITECTURE.md | |
| # CyberSecurity_OWASP — Architecture | |
| ## 1. System goal | |
| `CyberSecurity_OWASP` is an OpenEnv environment for training a **single LLM policy** to perform a complete defensive authorization-repair workflow: | |
| ```text | |
| Understand policy → discover local evidence → patch code → validate → submit | |
| ``` | |
| The environment is intentionally not a two-agent red-team/blue-team setup. The agent is one model with one trajectory. It must learn both sides of the defensive workflow: finding the policy violation and fixing it safely. | |
| ## 2. Final architecture diagram | |
| Rendered asset: | |
|  | |
| Editable source: `assets/architecture_diagram.mmd` | |
| ```mermaid | |
| flowchart TB | |
| subgraph A[Async Scenario Authoring + Curriculum Factory] | |
| A1[Config-guided LLM Scenario Author\nDeepSeek-V4-Pro default] | |
| A2[ScenarioSpec JSON\npolicy, app family, bug target] | |
| A3[Template + A01 Mutator\nFastAPI code variants] | |
| A4[Deterministic Compiler\nexecutable bundle] | |
| A5[Static + Dynamic Verifier\nsolvable, safe, hidden/visible tests] | |
| A6[Difficulty Calibrator\nbaseline pass-rate buckets] | |
| A7[Versioned Scenario Cache\nsplit, difficulty, family, hash] | |
| A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7 | |
| end | |
| subgraph B[CyberSecurity_OWASP OpenEnv Runtime] | |
| B1[reset\(seed, difficulty, family_budget\)\ncache lookup only] | |
| B2[Curriculum Sampler\nvalidated cache slice] | |
| B3[Episode State Store\nphase, history, cache metadata, patch diff] | |
| B4[Typed Action Tools\ninspect, request, patch, visible tests] | |
| B5[Ephemeral App Sandbox\ncloned cached workspace + fixtures] | |
| B6[Multi-layer Verifier\nvisible, hidden, oracle, regression] | |
| B7[Deterministic Reward Engine\nstable components + penalties] | |
| B8[Episode Artifact Logger\nJSONL transcript + verifier + diff] | |
| B1 --> B2 --> B3 --> B4 | |
| B4 <--> B5 | |
| B5 --> B6 --> B7 --> B3 | |
| B3 --> B8 | |
| end | |
| subgraph C[Single LLM Agent] | |
| C1[Observation Parser] | |
| C2[AuthZ + Code Reasoning] | |
| C3[Discover → Diagnose → Patch → Test\none JSON action] | |
| C1 --> C2 --> C3 | |
| end | |
| subgraph D[Training + Evaluation + Demo] | |
| D1[Parallel Rollouts\nfast cached reset] | |
| D2[TRL GRPO + LoRA] | |
| D3[Trackio Curves\nreward, pass rates, cache metrics] | |
| D4[Held-out Family Eval\nbase vs trained model] | |
| D5[Demo Artifacts\nbefore/after traces + JSONL] | |
| D1 --> D2 --> D3 --> D4 --> D5 | |
| end | |
| subgraph E[Feedback / Adaptation Loop] | |
| E1[Episode logs + failures] | |
| E2[Mastery Model\nweakness and plateau tracking] | |
| E3[Cache Sampling Weights\nnew generation queue] | |
| E1 --> E2 --> E3 | |
| end | |
| A7 --> B1 | |
| C3 -->|typed action| B4 | |
| B4 -->|observation + reward + done| C1 | |
| B7 --> D1 | |
| D2 --> C1 | |
| B8 --> E1 | |
| E3 --> A1 | |
| ``` | |
| ## 3. Component responsibilities | |
| ### 3.1 Async Scenario Authoring Plane | |
| Scenario generation is offline, asynchronous, validated, and cached. Runtime `reset()` must not call an LLM and must not compile a fresh app during Modal smoke, training, or evaluation runs. | |
| The scenario authoring plane outputs complete executable bundles: | |
| - `scenario.json`; | |
| - `app_source/`; | |
| - `policy_graph.json`; | |
| - `visible_tests.py`; | |
| - `hidden_tests.py`; | |
| - `oracle_tests.py`; | |
| - `expected_exploit_trace.json`; | |
| - `reward_config.json`; | |
| - `metadata.json`. | |
| The default scenario/curriculum author is configured in `configs/scenario_authoring.small.json`: | |
| ```yaml | |
| provider: huggingface | |
| model_id: deepseek-ai/DeepSeek-V4-Pro | |
| thinking_mode: thinking | |
| reasoning_effort: high | |
| temperature: 1.0 | |
| top_p: 1.0 | |
| ``` | |
| DeepSeek-V4-Pro is only used for offline scenario/curriculum authoring. It is not the RL policy model unless explicitly selected for training. | |
| The compiler remains the main anti-overfitting mechanism. It should vary: | |
| - route names; | |
| - schema names; | |
| - ORM query structure; | |
| - framework template; | |
| - role names; | |
| - tenant IDs; | |
| - object ownership patterns; | |
| - file layout; | |
| - visible test coverage; | |
| - hidden invariant seeds. | |
| The runtime treats curriculum and cache sampling as first-class scenario inputs: | |
| - `CurriculumController` tracks target weakness mastery, recent reward trend, failure counts, and difficulty tier. | |
| - Offline cache prep uses the configured LLM author, deterministic compiler, verifier, and baseline-agent difficulty calibrator. | |
| - `ScenarioCache` stores validated bundles by split, difficulty, family, generator version, verifier version, and scenario hash. | |
| - Hidden-eval episodes hold out scenario families, not only seeds, by marking evaluation-only scenario-family metadata in state rather than observations. | |
| Cache keys include: | |
| ```text | |
| difficulty_level | |
| authz_bug_type | |
| app_family | |
| framework | |
| policy_shape | |
| tenant_model | |
| exploit_depth | |
| patch_scope | |
| regression_risk | |
| generator_version | |
| verifier_version | |
| scenario_hash | |
| ``` | |
| ### 3.2 Policy Graph Generator | |
| The policy graph is the ground truth for intended behavior. | |
| Example internal representation: | |
| ```yaml | |
| resources: | |
| invoice: | |
| owner_field: owner_user_id | |
| tenant_field: tenant_id | |
| roles: | |
| user: | |
| can: | |
| - read:invoice where owner_user_id == actor.user_id | |
| - update:invoice where owner_user_id == actor.user_id and status != locked | |
| support: | |
| can: | |
| - read:invoice where tenant_id == actor.tenant_id | |
| admin: | |
| can: | |
| - read:any_invoice where tenant_id == actor.tenant_id | |
| - update:any_invoice where tenant_id == actor.tenant_id | |
| public_routes: | |
| - GET /health | |
| - GET /pricing | |
| forbidden: | |
| - cross_tenant_read | |
| - cross_tenant_update | |
| - user_reads_other_user_invoice | |
| ``` | |
| The policy graph prevents false rewards for over-securing intentionally public or intentionally allowed routes. | |
| ### 3.3 Bug Injector | |
| The bug injector creates controlled, defensive lab scenarios. It should only generate bugs inside local synthetic apps. | |
| MVP bug classes: | |
| | Bug class | Example failure mode | Expected fix type | | |
| |---|---|---| | |
| | Missing route guard | Protected endpoint lacks authorization middleware | Add policy check/middleware | | |
| | IDOR / ownership bug | User can access another user’s object by changing ID | Add owner check in query/policy | | |
| | Tenant leak | Tenant A can list Tenant B records | Add tenant filter | | |
| | Role confusion | Support/editor/admin boundary is wrong | Correct role-to-permission mapping | | |
| | Client-side-only auth | Server trusts UI to hide forbidden action | Enforce server-side authorization | | |
| | Query omission | List/export/search endpoint lacks auth filter | Filter query by actor permissions | | |
| | Over-broad mutation | User can update/delete forbidden object | Add mutation permission check | | |
| | Public route decoy | Agent may wrongly lock down intended public endpoint | Preserve intended public behavior | | |
| ### 3.4 OpenEnv Server | |
| The OpenEnv server should implement the standard lifecycle: | |
| - `reset()` — initialize a fresh episode from a cached scenario bundle. | |
| - `step(action)` — execute one typed action and return observation, reward, and done. | |
| - `state()` — expose episode metadata for debugging and evaluation. | |
| Recommended package/class names: | |
| ```text | |
| Repo name: CyberSecurity_OWASP | |
| Python package: cybersecurity_owasp | |
| Client class: CyberSecurityOWASPEnv | |
| Action class: CyberSecurityOWASPAction | |
| Observation: CyberSecurityOWASPObservation | |
| State: CyberSecurityOWASPState | |
| ``` | |
| ### 3.5 Tool API | |
| The agent should interact through typed actions. Keep the interface small enough for RL but expressive enough for realistic repair. | |
| ```python | |
| @dataclass | |
| class CyberSecurityOWASPAction(Action): | |
| tool_name: Literal[ | |
| "inspect_policy_graph", | |
| "list_routes", | |
| "read_openapi", | |
| "read_file", | |
| "search_code", | |
| "send_local_request", | |
| "compare_identities", | |
| "submit_diagnosis", | |
| "patch_file", | |
| "run_visible_tests", | |
| "submit_fix", | |
| "noop", | |
| ] | |
| arguments: dict | |
| ``` | |
| Recommended actions: | |
| | Action | Purpose | Safety boundary | | |
| |---|---|---| | |
| | `inspect_policy_graph` | Read intended authorization rules. | Only synthetic policy. | | |
| | `list_routes` | See local app route map. | No internet target. | | |
| | `read_file` | Inspect selected source file. | Sandbox allowlist only. | | |
| | `send_local_request` | Validate behavior against local app. | Local generated app only. | | |
| | `submit_diagnosis` | Record bug class, route, policy rule, evidence trace IDs, and fix plan. | Does not reveal hidden tests. | | |
| | `run_visible_tests` | Run visible tests. | No hidden test disclosure. | | |
| | `patch_file` | Modify source through unified diff or full content. | Patch size and file allowlist limits. | | |
| | `submit_fix` | End episode and trigger hidden eval. | Final hidden score only, no leaked test details. | | |
| ### 3.6 Observation schema | |
| Observations should be compact and structured. | |
| ```python | |
| @dataclass | |
| class CyberSecurityOWASPObservation(Observation): | |
| phase: Literal["discover", "patch", "done"] | |
| message: str | |
| task_brief: str | |
| visible_policy_hint: dict | |
| workspace_summary: dict | |
| available_actions: list[str] | |
| last_tool_result: str | |
| visible_test_result: str | None = None | |
| reward_breakdown: dict[str, float] = field(default_factory=dict) | |
| done_reason: str | None = None | |
| ``` | |
| The policy hint is deliberately partial. It may include product rules, fixture aliases, route summaries, and public-route intent, but it must not expose the hidden oracle matrix, hidden test bodies, injected bug labels, or held-out family labels. | |
| ### 3.7 State schema | |
| State should support debugging and training analytics. | |
| ```python | |
| @dataclass | |
| class CyberSecurityOWASPState(State): | |
| episode_id: str | |
| task_id: str | |
| split: Literal["train", "validation", "hidden_eval"] | |
| step_count: int = 0 | |
| max_steps: int = 40 | |
| difficulty_tier: str = "warmup" | |
| scenario_family: str = "" | |
| template_id: str = "fastapi_basic" | |
| target_weakness: str = "" | |
| curriculum_snapshot: dict = field(default_factory=dict) | |
| verification_summary: dict = field(default_factory=dict) | |
| patch_diff: str = "" | |
| episode_artifact_path: str | None = None | |
| accumulated_reward: float = 0.0 | |
| ``` | |
| ## 4. Episode lifecycle | |
| ```text | |
| 1. reset() | |
| - curriculum selects difficulty tier and target weakness | |
| - runtime samples or directly loads a validated cached bundle | |
| - clone cached `app_source/` into an isolated ephemeral workspace | |
| - initialize fixture state, cache metadata, and sandbox handles | |
| - return initial observation | |
| 2. agent loop | |
| - inspect policy/routes/files | |
| - send local requests only inside sandbox | |
| - run public tests | |
| - apply one or more patches | |
| - rerun public tests | |
| 3. submit_fix | |
| - freeze patch | |
| - run public tests | |
| - run hidden authorization invariants | |
| - run policy-oracle matrix | |
| - run regression and public-route preservation tests | |
| - compute deterministic reward | |
| - return final observation, reward, done=True | |
| 4. logging | |
| - append JSONL artifact with scenario metadata, action trace, observations, patch diff, verifier result, and reward components | |
| - feed terminal success/failure back into curriculum mastery tracking | |
| - send metrics to Trackio during training/eval | |
| ``` | |
| `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` is mandatory for Modal smoke, training, and evaluation. In that mode a missing cache bundle is a hard failure. Local development may use `fallback`, which compiles deterministically on a miss, but that path is not allowed for meaningful training. | |
| ## 5. Reward design | |
| The reward should be deterministic, decomposed, and resistant to reward hacking. The maximum terminal reward remains **15.0** and high reward requires deterministic verifier success, not explanation quality. | |
| Stable reward keys: | |
| ```python | |
| { | |
| "discovery": 0.0, | |
| "security": 0.0, | |
| "regression": 0.0, | |
| "public_routes": 0.0, | |
| "patch_quality": 0.0, | |
| "visible_tests": 0.0, | |
| "safety": 0.0, | |
| "anti_cheat": 0.0, | |
| "terminal_total": 0.0, | |
| "progressive": 0.0, | |
| "step_penalty": 0.0, | |
| "speed_bonus": 0.0, | |
| "token_penalty": 0.0, | |
| "behavior_penalty": 0.0, | |
| "train_total": 0.0, | |
| "total": 0.0, | |
| } | |
| ``` | |
| Sparse evaluation uses `terminal_total` as `total`. Dense training uses | |
| `terminal_total + shaping_weight * progressive + efficiency - penalties` as `total`, | |
| with all reward values and short descriptions configured in | |
| `training/configs/grpo_small.yaml`. | |
| ### Reward components | |
| | Component | Purpose | | |
| |---|---| | |
| | `discovery` | Valid local evidence and correct violated policy rule. | | |
| | `security` | Hidden exploit blocking plus policy-oracle matrix pass. | | |
| | `regression` | Legitimate owner/admin/support flows still work. | | |
| | `public_routes` | Intentionally public routes remain public. | | |
| | `patch_quality` | Localized policy-aligned patch and efficient phase order. | | |
| | `visible_tests` | Visible tests pass and app still boots. | | |
| | `safety` | Penalizes invalid action patterns, unsafe targets, timeouts, and deny-all behavior. | | |
| | `anti_cheat` | Penalizes hidden-file probing, hardcoded fixture IDs, and test/oracle tampering. | | |
| ### Penalties | |
| | Penalty | Trigger | | |
| |---|---| | |
| | public route penalty | Breaks a route intentionally marked public. | | |
| | anti-cheat penalty | Deletes or probes tests, hidden files, reward code, oracle data, or host paths. | | |
| | hardcoding penalty | Hardcodes seed-specific IDs, users, tenants, or hidden assumptions. | | |
| | safety penalty | Over-broad denial, malformed/invalid actions, repeated failed actions, or external target attempts. | | |
| The LLM judge, if used at all, should only annotate trace quality for analysis. It must not decide security-critical reward. | |
| ## 6. Hidden tests and anti-overfitting | |
| Hidden tests are necessary because visible tests can be gamed or memorized. They should test policy invariants rather than exact implementation details. | |
| Use **4 anti-overfitting layers**: | |
| 1. **Seed diversity** — route names, user IDs, tenant IDs, object names, and schemas change every episode. | |
| 2. **Template diversity** — same policy bug appears in different frameworks and file layouts. | |
| 3. **Hidden invariant tests** — final reward uses unseen authorization cases. | |
| 4. **Held-out eval split** — at least 20% of scenario families/seeds are never used in training. | |
| Recommended split: | |
| ```text | |
| Train: 70% | |
| Validation: 10% | |
| Held-out: 20% | |
| ``` | |
| ## 7. Evaluation plan | |
| Run before/after evaluation on the same held-out suite. | |
| ### Metrics | |
| | Metric | Meaning | | |
| |---|---| | |
| | `episode_success_rate` | Public + hidden + regression tests pass. | | |
| | `hidden_authz_pass_rate` | Security-critical hidden checks pass. | | |
| | `regression_pass_rate` | Normal valid behavior remains intact. | | |
| | `oversecure_rate` | Agent blocks intended legitimate/public behavior. | | |
| | `patch_compile_rate` | Patch applies and app still runs. | | |
| | `median_steps_to_submit` | Efficiency of the repair workflow. | | |
| | `median_files_changed` | Patch focus/minimality. | | |
| | `reward_hacking_rate` | Attempts to delete tests, hardcode fixtures, or bypass eval. | | |
| ### Eval table template | |
| | Model | Split | Success | Hidden authz | Regression | Oversecure | Median steps | Median files changed | | |
| |---|---|---:|---:|---:|---:|---:|---:| | |
| | Base model | heldout | TBD | TBD | TBD | TBD | TBD | TBD | | |
| | RL-trained model | heldout | TBD | TBD | TBD | TBD | TBD | TBD | | |
| ## 8. Training flow | |
| Rendered asset: | |
|  | |
| Editable source: `assets/env_rl_training_flow_diagram.mmd` | |
| ```text | |
| 1. Build CyberSecurity_OWASP OpenEnv server. | |
| 2. Prepare validated scenario cache once per generator/verifier version. | |
| 3. Run baseline eval with cached validation/held-out bundles. | |
| 4. Train with GRPO/TRL or Unsloth using cached rollout episodes. | |
| 5. Log reward components, pass rates, reset latency, and cache hit metrics to Trackio. | |
| 6. Run held-out eval every N training steps. | |
| 7. Inspect failure clusters and cache sampling weights. | |
| 8. Refresh only 5-10% of scenarios per epoch when new weak spots are found. | |
| 9. Produce final demo: before/after trace + reward curve + held-out eval table. | |
| ``` | |
| Recommended initial training setup (Modal-first): | |
| ```text | |
| Model: unsloth/gemma-4-E2B-it | |
| Algorithm: GRPO via TRL or Unsloth-compatible loop | |
| Dataset prompt: repeated task instruction with randomized scenario IDs | |
| Max steps per episode: 30 | |
| Rollouts per prompt: 2-4 | |
| Logging: Trackio | |
| Primary eval: held-out deterministic test pass rate | |
| Scenario cache mode: require | |
| Scenario cache volume: CyberSecurity_OWASP-scenario-cache | |
| Training execution is expected to run on Modal (persistent or ephemeral) rather than locally. | |
| ``` | |
| ## 9. Deployment architecture | |
| The environment should be runnable in 3 modes: | |
| | Mode | Purpose | | |
| |---|---| | |
| | Local Uvicorn | Fast engineer iteration. | | |
| | Docker | Reproducible local training/eval. | | |
| | Hugging Face Spaces | Public hackathon demo and OpenEnv-compliant hosting. | | |
| Expected endpoints: | |
| ```text | |
| /ws OpenEnv client session | |
| /health health check | |
| /reset debug reset | |
| /step debug step | |
| /state debug state | |
| /docs FastAPI docs | |
| /web optional web UI | |
| ``` | |
| ## 10. Implementation milestones | |
| ### Milestone 1 — Skeleton environment | |
| - `models.py` | |
| - `client.py` | |
| - `server/environment.py` | |
| - `server/app.py` | |
| - `server/Dockerfile` | |
| - `openenv.yaml` | |
| - health check | |
| - one hand-written scenario | |
| ### Milestone 2 — Scenario compiler | |
| - policy graph format | |
| - app template renderer | |
| - bug injector | |
| - DB fixture generator | |
| - public and hidden test generator | |
| ### Milestone 3 — Reward engine | |
| - public test score | |
| - hidden invariant score | |
| - regression score | |
| - patch minimality score | |
| - safety/reward-hacking penalties | |
| - reward component logging | |
| ### Milestone 4 — Training script | |
| - rollout loop | |
| - GRPO/TRL or Unsloth training script | |
| - Trackio logging | |
| - checkpoint save/push | |
| - baseline and post-training eval | |
| ### Milestone 5 — Hackathon demo | |
| - HF Spaces deployment | |
| - mini-blog | |
| - 2-minute video | |
| - before/after traces | |
| - reward curve | |
| - held-out eval table | |
| ## 11. Engineering notes | |
| - Keep scenario apps small: ideally 5-15 files each. | |
| - Prefer deterministic tests over LLM judging. | |
| - Hide final hidden test details from observations. | |
| - Log enough trace data to debug failures but never leak hidden tests to the agent. | |
| - Include intentionally public routes and allowed cross-role cases so the model does not learn “add auth everywhere.” | |
| - The best demo is not just “agent finds bug,” but “agent learns not to break valid business behavior.” | |
| ## 12. Source notes and credibility | |
| | Source | How it informs this architecture | Credibility | | |
| |---|---|---:| | |
| | OWASP Top 10 2025 / A01 Broken Access Control | Confirms why access control is the right security focus. | 10/10 | | |
| | OWASP ASVS access-control guidance | Informs policy invariants and server-side authorization checks. | 9.5/10 | | |
| | OpenEnv environment-building docs | Defines required models, reset/step/state, FastAPI server, Docker, and client. | 8.5/10 | | |
| | OpenEnv quickstart/architecture docs | Informs WebSocket client/server design, typed EnvClient, and container isolation. | 8.5/10 | | |
| | OpenEnv deployment docs | Informs HF Spaces deployment, endpoints, Docker workflow, and installable client package. | 8.5/10 | | |
| | Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 | | |
| | TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 | | |
| | Kube SRE Gym README | Informs the closed-loop pattern: adversarial scenario design, curriculum mastery tracking, real tool interaction, verification, and artifact-driven storytelling. | 8/10 | | |
| | DeepSeek-V4-Pro Hugging Face model card and encoding notes | Informs the default offline scenario-author config and the note that prompt handling should not assume a Jinja chat template. | 8/10 | | |