# 01_ARCHITECTURE.md # CyberSecurity_OWASP — Architecture ## 1. System goal `CyberSecurity_OWASP` is an OpenEnv environment for training a **single LLM policy** to perform a complete defensive authorization-repair workflow: ```text Understand policy → discover local evidence → patch code → validate → submit ``` The environment is intentionally not a two-agent red-team/blue-team setup. The agent is one model with one trajectory. It must learn both sides of the defensive workflow: finding the policy violation and fixing it safely. ## 2. Final architecture diagram Rendered asset: ![CyberSecurity_OWASP architecture](assets/architecture_diagram.svg) Editable source: `assets/architecture_diagram.mmd` ```mermaid flowchart TB subgraph A[Async Scenario Authoring + Curriculum Factory] A1[Config-guided LLM Scenario Author\nDeepSeek-V4-Pro default] A2[ScenarioSpec JSON\npolicy, app family, bug target] A3[Template + A01 Mutator\nFastAPI code variants] A4[Deterministic Compiler\nexecutable bundle] A5[Static + Dynamic Verifier\nsolvable, safe, hidden/visible tests] A6[Difficulty Calibrator\nbaseline pass-rate buckets] A7[Versioned Scenario Cache\nsplit, difficulty, family, hash] A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7 end subgraph B[CyberSecurity_OWASP OpenEnv Runtime] B1[reset\(seed, difficulty, family_budget\)\ncache lookup only] B2[Curriculum Sampler\nvalidated cache slice] B3[Episode State Store\nphase, history, cache metadata, patch diff] B4[Typed Action Tools\ninspect, request, patch, visible tests] B5[Ephemeral App Sandbox\ncloned cached workspace + fixtures] B6[Multi-layer Verifier\nvisible, hidden, oracle, regression] B7[Deterministic Reward Engine\nstable components + penalties] B8[Episode Artifact Logger\nJSONL transcript + verifier + diff] B1 --> B2 --> B3 --> B4 B4 <--> B5 B5 --> B6 --> B7 --> B3 B3 --> B8 end subgraph C[Single LLM Agent] C1[Observation Parser] C2[AuthZ + Code Reasoning] C3[Discover → Diagnose → Patch → Test\none JSON action] C1 --> C2 --> C3 end subgraph D[Training + Evaluation + Demo] D1[Parallel Rollouts\nfast cached reset] D2[TRL GRPO + LoRA] D3[Trackio Curves\nreward, pass rates, cache metrics] D4[Held-out Family Eval\nbase vs trained model] D5[Demo Artifacts\nbefore/after traces + JSONL] D1 --> D2 --> D3 --> D4 --> D5 end subgraph E[Feedback / Adaptation Loop] E1[Episode logs + failures] E2[Mastery Model\nweakness and plateau tracking] E3[Cache Sampling Weights\nnew generation queue] E1 --> E2 --> E3 end A7 --> B1 C3 -->|typed action| B4 B4 -->|observation + reward + done| C1 B7 --> D1 D2 --> C1 B8 --> E1 E3 --> A1 ``` ## 3. Component responsibilities ### 3.1 Async Scenario Authoring Plane Scenario generation is offline, asynchronous, validated, and cached. Runtime `reset()` must not call an LLM and must not compile a fresh app during Modal smoke, training, or evaluation runs. The scenario authoring plane outputs complete executable bundles: - `scenario.json`; - `app_source/`; - `policy_graph.json`; - `visible_tests.py`; - `hidden_tests.py`; - `oracle_tests.py`; - `expected_exploit_trace.json`; - `reward_config.json`; - `metadata.json`. The default scenario/curriculum author is configured in `configs/scenario_authoring.small.json`: ```yaml provider: huggingface model_id: deepseek-ai/DeepSeek-V4-Pro thinking_mode: thinking reasoning_effort: high temperature: 1.0 top_p: 1.0 ``` DeepSeek-V4-Pro is only used for offline scenario/curriculum authoring. It is not the RL policy model unless explicitly selected for training. The compiler remains the main anti-overfitting mechanism. It should vary: - route names; - schema names; - ORM query structure; - framework template; - role names; - tenant IDs; - object ownership patterns; - file layout; - visible test coverage; - hidden invariant seeds. The runtime treats curriculum and cache sampling as first-class scenario inputs: - `CurriculumController` tracks target weakness mastery, recent reward trend, failure counts, and difficulty tier. - Offline cache prep uses the configured LLM author, deterministic compiler, verifier, and baseline-agent difficulty calibrator. - `ScenarioCache` stores validated bundles by split, difficulty, family, generator version, verifier version, and scenario hash. - Hidden-eval episodes hold out scenario families, not only seeds, by marking evaluation-only scenario-family metadata in state rather than observations. Cache keys include: ```text difficulty_level authz_bug_type app_family framework policy_shape tenant_model exploit_depth patch_scope regression_risk generator_version verifier_version scenario_hash ``` ### 3.2 Policy Graph Generator The policy graph is the ground truth for intended behavior. Example internal representation: ```yaml resources: invoice: owner_field: owner_user_id tenant_field: tenant_id roles: user: can: - read:invoice where owner_user_id == actor.user_id - update:invoice where owner_user_id == actor.user_id and status != locked support: can: - read:invoice where tenant_id == actor.tenant_id admin: can: - read:any_invoice where tenant_id == actor.tenant_id - update:any_invoice where tenant_id == actor.tenant_id public_routes: - GET /health - GET /pricing forbidden: - cross_tenant_read - cross_tenant_update - user_reads_other_user_invoice ``` The policy graph prevents false rewards for over-securing intentionally public or intentionally allowed routes. ### 3.3 Bug Injector The bug injector creates controlled, defensive lab scenarios. It should only generate bugs inside local synthetic apps. MVP bug classes: | Bug class | Example failure mode | Expected fix type | |---|---|---| | Missing route guard | Protected endpoint lacks authorization middleware | Add policy check/middleware | | IDOR / ownership bug | User can access another user’s object by changing ID | Add owner check in query/policy | | Tenant leak | Tenant A can list Tenant B records | Add tenant filter | | Role confusion | Support/editor/admin boundary is wrong | Correct role-to-permission mapping | | Client-side-only auth | Server trusts UI to hide forbidden action | Enforce server-side authorization | | Query omission | List/export/search endpoint lacks auth filter | Filter query by actor permissions | | Over-broad mutation | User can update/delete forbidden object | Add mutation permission check | | Public route decoy | Agent may wrongly lock down intended public endpoint | Preserve intended public behavior | ### 3.4 OpenEnv Server The OpenEnv server should implement the standard lifecycle: - `reset()` — initialize a fresh episode from a cached scenario bundle. - `step(action)` — execute one typed action and return observation, reward, and done. - `state()` — expose episode metadata for debugging and evaluation. Recommended package/class names: ```text Repo name: CyberSecurity_OWASP Python package: cybersecurity_owasp Client class: CyberSecurityOWASPEnv Action class: CyberSecurityOWASPAction Observation: CyberSecurityOWASPObservation State: CyberSecurityOWASPState ``` ### 3.5 Tool API The agent should interact through typed actions. Keep the interface small enough for RL but expressive enough for realistic repair. ```python @dataclass class CyberSecurityOWASPAction(Action): tool_name: Literal[ "inspect_policy_graph", "list_routes", "read_openapi", "read_file", "search_code", "send_local_request", "compare_identities", "submit_diagnosis", "patch_file", "run_visible_tests", "submit_fix", "noop", ] arguments: dict ``` Recommended actions: | Action | Purpose | Safety boundary | |---|---|---| | `inspect_policy_graph` | Read intended authorization rules. | Only synthetic policy. | | `list_routes` | See local app route map. | No internet target. | | `read_file` | Inspect selected source file. | Sandbox allowlist only. | | `send_local_request` | Validate behavior against local app. | Local generated app only. | | `submit_diagnosis` | Record bug class, route, policy rule, evidence trace IDs, and fix plan. | Does not reveal hidden tests. | | `run_visible_tests` | Run visible tests. | No hidden test disclosure. | | `patch_file` | Modify source through unified diff or full content. | Patch size and file allowlist limits. | | `submit_fix` | End episode and trigger hidden eval. | Final hidden score only, no leaked test details. | ### 3.6 Observation schema Observations should be compact and structured. ```python @dataclass class CyberSecurityOWASPObservation(Observation): phase: Literal["discover", "patch", "done"] message: str task_brief: str visible_policy_hint: dict workspace_summary: dict available_actions: list[str] last_tool_result: str visible_test_result: str | None = None reward_breakdown: dict[str, float] = field(default_factory=dict) done_reason: str | None = None ``` The policy hint is deliberately partial. It may include product rules, fixture aliases, route summaries, and public-route intent, but it must not expose the hidden oracle matrix, hidden test bodies, injected bug labels, or held-out family labels. ### 3.7 State schema State should support debugging and training analytics. ```python @dataclass class CyberSecurityOWASPState(State): episode_id: str task_id: str split: Literal["train", "validation", "hidden_eval"] step_count: int = 0 max_steps: int = 40 difficulty_tier: str = "warmup" scenario_family: str = "" template_id: str = "fastapi_basic" target_weakness: str = "" curriculum_snapshot: dict = field(default_factory=dict) verification_summary: dict = field(default_factory=dict) patch_diff: str = "" episode_artifact_path: str | None = None accumulated_reward: float = 0.0 ``` ## 4. Episode lifecycle ```text 1. reset() - curriculum selects difficulty tier and target weakness - runtime samples or directly loads a validated cached bundle - clone cached `app_source/` into an isolated ephemeral workspace - initialize fixture state, cache metadata, and sandbox handles - return initial observation 2. agent loop - inspect policy/routes/files - send local requests only inside sandbox - run public tests - apply one or more patches - rerun public tests 3. submit_fix - freeze patch - run public tests - run hidden authorization invariants - run policy-oracle matrix - run regression and public-route preservation tests - compute deterministic reward - return final observation, reward, done=True 4. logging - append JSONL artifact with scenario metadata, action trace, observations, patch diff, verifier result, and reward components - feed terminal success/failure back into curriculum mastery tracking - send metrics to Trackio during training/eval ``` `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` is mandatory for Modal smoke, training, and evaluation. In that mode a missing cache bundle is a hard failure. Local development may use `fallback`, which compiles deterministically on a miss, but that path is not allowed for meaningful training. ## 5. Reward design The reward should be deterministic, decomposed, and resistant to reward hacking. The maximum terminal reward remains **15.0** and high reward requires deterministic verifier success, not explanation quality. Stable reward keys: ```python { "discovery": 0.0, "security": 0.0, "regression": 0.0, "public_routes": 0.0, "patch_quality": 0.0, "visible_tests": 0.0, "safety": 0.0, "anti_cheat": 0.0, "terminal_total": 0.0, "progressive": 0.0, "step_penalty": 0.0, "speed_bonus": 0.0, "token_penalty": 0.0, "behavior_penalty": 0.0, "train_total": 0.0, "total": 0.0, } ``` Sparse evaluation uses `terminal_total` as `total`. Dense training uses `terminal_total + shaping_weight * progressive + efficiency - penalties` as `total`, with all reward values and short descriptions configured in `training/configs/grpo_small.yaml`. ### Reward components | Component | Purpose | |---|---| | `discovery` | Valid local evidence and correct violated policy rule. | | `security` | Hidden exploit blocking plus policy-oracle matrix pass. | | `regression` | Legitimate owner/admin/support flows still work. | | `public_routes` | Intentionally public routes remain public. | | `patch_quality` | Localized policy-aligned patch and efficient phase order. | | `visible_tests` | Visible tests pass and app still boots. | | `safety` | Penalizes invalid action patterns, unsafe targets, timeouts, and deny-all behavior. | | `anti_cheat` | Penalizes hidden-file probing, hardcoded fixture IDs, and test/oracle tampering. | ### Penalties | Penalty | Trigger | |---|---| | public route penalty | Breaks a route intentionally marked public. | | anti-cheat penalty | Deletes or probes tests, hidden files, reward code, oracle data, or host paths. | | hardcoding penalty | Hardcodes seed-specific IDs, users, tenants, or hidden assumptions. | | safety penalty | Over-broad denial, malformed/invalid actions, repeated failed actions, or external target attempts. | The LLM judge, if used at all, should only annotate trace quality for analysis. It must not decide security-critical reward. ## 6. Hidden tests and anti-overfitting Hidden tests are necessary because visible tests can be gamed or memorized. They should test policy invariants rather than exact implementation details. Use **4 anti-overfitting layers**: 1. **Seed diversity** — route names, user IDs, tenant IDs, object names, and schemas change every episode. 2. **Template diversity** — same policy bug appears in different frameworks and file layouts. 3. **Hidden invariant tests** — final reward uses unseen authorization cases. 4. **Held-out eval split** — at least 20% of scenario families/seeds are never used in training. Recommended split: ```text Train: 70% Validation: 10% Held-out: 20% ``` ## 7. Evaluation plan Run before/after evaluation on the same held-out suite. ### Metrics | Metric | Meaning | |---|---| | `episode_success_rate` | Public + hidden + regression tests pass. | | `hidden_authz_pass_rate` | Security-critical hidden checks pass. | | `regression_pass_rate` | Normal valid behavior remains intact. | | `oversecure_rate` | Agent blocks intended legitimate/public behavior. | | `patch_compile_rate` | Patch applies and app still runs. | | `median_steps_to_submit` | Efficiency of the repair workflow. | | `median_files_changed` | Patch focus/minimality. | | `reward_hacking_rate` | Attempts to delete tests, hardcode fixtures, or bypass eval. | ### Eval table template | Model | Split | Success | Hidden authz | Regression | Oversecure | Median steps | Median files changed | |---|---|---:|---:|---:|---:|---:|---:| | Base model | heldout | TBD | TBD | TBD | TBD | TBD | TBD | | RL-trained model | heldout | TBD | TBD | TBD | TBD | TBD | TBD | ## 8. Training flow Rendered asset: ![CyberSecurity_OWASP RL training flow](assets/env_rl_training_flow_diagram.svg) Editable source: `assets/env_rl_training_flow_diagram.mmd` ```text 1. Build CyberSecurity_OWASP OpenEnv server. 2. Prepare validated scenario cache once per generator/verifier version. 3. Run baseline eval with cached validation/held-out bundles. 4. Train with GRPO/TRL or Unsloth using cached rollout episodes. 5. Log reward components, pass rates, reset latency, and cache hit metrics to Trackio. 6. Run held-out eval every N training steps. 7. Inspect failure clusters and cache sampling weights. 8. Refresh only 5-10% of scenarios per epoch when new weak spots are found. 9. Produce final demo: before/after trace + reward curve + held-out eval table. ``` Recommended initial training setup (Modal-first): ```text Model: unsloth/gemma-4-E2B-it Algorithm: GRPO via TRL or Unsloth-compatible loop Dataset prompt: repeated task instruction with randomized scenario IDs Max steps per episode: 30 Rollouts per prompt: 2-4 Logging: Trackio Primary eval: held-out deterministic test pass rate Scenario cache mode: require Scenario cache volume: CyberSecurity_OWASP-scenario-cache Training execution is expected to run on Modal (persistent or ephemeral) rather than locally. ``` ## 9. Deployment architecture The environment should be runnable in 3 modes: | Mode | Purpose | |---|---| | Local Uvicorn | Fast engineer iteration. | | Docker | Reproducible local training/eval. | | Hugging Face Spaces | Public hackathon demo and OpenEnv-compliant hosting. | Expected endpoints: ```text /ws OpenEnv client session /health health check /reset debug reset /step debug step /state debug state /docs FastAPI docs /web optional web UI ``` ## 10. Implementation milestones ### Milestone 1 — Skeleton environment - `models.py` - `client.py` - `server/environment.py` - `server/app.py` - `server/Dockerfile` - `openenv.yaml` - health check - one hand-written scenario ### Milestone 2 — Scenario compiler - policy graph format - app template renderer - bug injector - DB fixture generator - public and hidden test generator ### Milestone 3 — Reward engine - public test score - hidden invariant score - regression score - patch minimality score - safety/reward-hacking penalties - reward component logging ### Milestone 4 — Training script - rollout loop - GRPO/TRL or Unsloth training script - Trackio logging - checkpoint save/push - baseline and post-training eval ### Milestone 5 — Hackathon demo - HF Spaces deployment - mini-blog - 2-minute video - before/after traces - reward curve - held-out eval table ## 11. Engineering notes - Keep scenario apps small: ideally 5-15 files each. - Prefer deterministic tests over LLM judging. - Hide final hidden test details from observations. - Log enough trace data to debug failures but never leak hidden tests to the agent. - Include intentionally public routes and allowed cross-role cases so the model does not learn “add auth everywhere.” - The best demo is not just “agent finds bug,” but “agent learns not to break valid business behavior.” ## 12. Source notes and credibility | Source | How it informs this architecture | Credibility | |---|---|---:| | OWASP Top 10 2025 / A01 Broken Access Control | Confirms why access control is the right security focus. | 10/10 | | OWASP ASVS access-control guidance | Informs policy invariants and server-side authorization checks. | 9.5/10 | | OpenEnv environment-building docs | Defines required models, reset/step/state, FastAPI server, Docker, and client. | 8.5/10 | | OpenEnv quickstart/architecture docs | Informs WebSocket client/server design, typed EnvClient, and container isolation. | 8.5/10 | | OpenEnv deployment docs | Informs HF Spaces deployment, endpoints, Docker workflow, and installable client package. | 8.5/10 | | Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 | | TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 | | Kube SRE Gym README | Informs the closed-loop pattern: adversarial scenario design, curriculum mastery tracking, real tool interaction, verification, and artifact-driven storytelling. | 8/10 | | DeepSeek-V4-Pro Hugging Face model card and encoding notes | Informs the default offline scenario-author config and the note that prompt handling should not assume a Jinja chat template. | 8/10 |