Spaces:
Sleeping
01_ARCHITECTURE.md
CyberSecurity_OWASP — Architecture
1. System goal
CyberSecurity_OWASP is an OpenEnv environment for training a single LLM policy to perform a complete defensive authorization-repair workflow:
Understand policy → discover local evidence → patch code → validate → submit
The environment is intentionally not a two-agent red-team/blue-team setup. The agent is one model with one trajectory. It must learn both sides of the defensive workflow: finding the policy violation and fixing it safely.
2. Final architecture diagram
flowchart TB
%% =========================
%% Offline Build Layer
%% =========================
subgraph A[Offline Scenario Factory]
A1[Policy Graph Generator\nroles, users, tenants, ownership, route intent]
A2[App Template Library\nFastAPI, Express, Django MVP templates]
A3[Bug Injector\nmissing guard, IDOR, tenant leak, role confusion, query omission]
A4[Scenario Compiler\nmaterializes app + DB + public tests + hidden invariants]
A5[Split Manager\ntrain seeds, validation seeds, hidden held-out seeds]
A1 --> A4
A2 --> A4
A3 --> A4
A5 --> A4
end
%% =========================
%% OpenEnv Runtime
%% =========================
subgraph B[CyberSecurity_OWASP OpenEnv Server]
B1[reset\(\)\nselect scenario + start sandbox]
B2[Sandbox App Runtime\nlocal app, DB fixture, logs, route map]
B3[Tool API exposed through step\(action\)\nReadFile, ListRoutes, SendLocalRequest, RunTests, ApplyPatch, SubmitFix]
B4[State Store\nepisode_id, step_count, scenario_id, patch diff, test history]
B5[Deterministic Reward Engine\npolicy tests + hidden tests + regression tests + penalties]
B6[state\(\)\nstructured metadata for debugging/eval]
B1 --> B2
B2 --> B3
B3 --> B4
B4 --> B5
B4 --> B6
end
%% =========================
%% Agent + Training
%% =========================
subgraph C[Single LLM Agent]
C1[Observation Parser]
C2[Planner\npolicy reasoning + patch strategy]
C3[Action Generator\nchooses next OpenEnv action]
C1 --> C2 --> C3
end
subgraph D[Training + Evaluation]
D1[Rollout Loop\nreset → step* → final reward]
D2[GRPO / TRL / Unsloth Training]
D3[Trackio Metrics\nreward curves, pass rates, patch size, steps]
D4[Held-out Eval Suite\nunseen templates, seeds, names, route structures]
D5[Demo Artifacts\nbefore/after traces, mini-blog, 2-minute video]
D1 --> D2 --> D3
D3 --> D4 --> D5
end
A4 --> B1
C3 -->|typed action| B3
B3 -->|observation + reward + done| C1
B5 --> D1
D2 --> C1
B5 --> D4
3. Component responsibilities
3.1 Scenario Factory
The scenario factory generates many small but realistic web apps from a structured authorization policy.
It should output:
- application code;
- route map;
- database fixture;
- user/session/token fixtures;
- policy graph;
- intentionally injected access-control bug;
- public tests visible to the agent;
- hidden tests invisible to the agent;
- metadata for eval and debugging.
The scenario compiler is the main anti-overfitting mechanism. It should vary:
- route names;
- schema names;
- ORM query structure;
- framework template;
- role names;
- tenant IDs;
- object ownership patterns;
- file layout;
- visible test coverage;
- hidden invariant seeds.
3.2 Policy Graph Generator
The policy graph is the ground truth for intended behavior.
Example internal representation:
resources:
invoice:
owner_field: owner_user_id
tenant_field: tenant_id
roles:
user:
can:
- read:invoice where owner_user_id == actor.user_id
- update:invoice where owner_user_id == actor.user_id and status != locked
support:
can:
- read:invoice where tenant_id == actor.tenant_id
admin:
can:
- read:any_invoice where tenant_id == actor.tenant_id
- update:any_invoice where tenant_id == actor.tenant_id
public_routes:
- GET /health
- GET /pricing
forbidden:
- cross_tenant_read
- cross_tenant_update
- user_reads_other_user_invoice
The policy graph prevents false rewards for over-securing intentionally public or intentionally allowed routes.
3.3 Bug Injector
The bug injector creates controlled, defensive lab scenarios. It should only generate bugs inside local synthetic apps.
MVP bug classes:
| Bug class | Example failure mode | Expected fix type |
|---|---|---|
| Missing route guard | Protected endpoint lacks authorization middleware | Add policy check/middleware |
| IDOR / ownership bug | User can access another user’s object by changing ID | Add owner check in query/policy |
| Tenant leak | Tenant A can list Tenant B records | Add tenant filter |
| Role confusion | Support/editor/admin boundary is wrong | Correct role-to-permission mapping |
| Client-side-only auth | Server trusts UI to hide forbidden action | Enforce server-side authorization |
| Query omission | List/export/search endpoint lacks auth filter | Filter query by actor permissions |
| Over-broad mutation | User can update/delete forbidden object | Add mutation permission check |
| Public route decoy | Agent may wrongly lock down intended public endpoint | Preserve intended public behavior |
3.4 OpenEnv Server
The OpenEnv server should implement the standard lifecycle:
reset()— initialize a fresh scenario instance.step(action)— execute one typed action and return observation, reward, and done.state()— expose episode metadata for debugging and evaluation.
Recommended package/class names:
Repo name: CyberSecurity_OWASP
Python package: cybersecurity_owasp
Client class: CyberSecurityOWASPEnv
Action class: CyberSecurityOWASPAction
Observation: CyberSecurityOWASPObservation
State: CyberSecurityOWASPState
3.5 Tool API
The agent should interact through typed actions. Keep the interface small enough for RL but expressive enough for realistic repair.
@dataclass
class CyberSecurityOWASPAction(Action):
action_type: Literal[
"read_file",
"list_files",
"list_routes",
"inspect_policy",
"send_local_request",
"run_public_tests",
"apply_patch",
"submit_fix",
]
arguments: dict
Recommended actions:
| Action | Purpose | Safety boundary |
|---|---|---|
inspect_policy |
Read intended authorization rules. | Only synthetic policy. |
list_routes |
See local app route map. | No internet target. |
read_file |
Inspect selected source file. | Sandbox allowlist only. |
send_local_request |
Validate behavior against local app. | Local generated app only. |
run_public_tests |
Run visible tests. | No hidden test disclosure. |
apply_patch |
Modify source through unified diff. | Patch size and file allowlist limits. |
submit_fix |
End episode and trigger hidden eval. | Final hidden score only, no leaked test details. |
3.6 Observation schema
Observations should be compact and structured.
@dataclass
class CyberSecurityOWASPObservation(Observation):
message: str
visible_policy_summary: str
route_summary: list[dict]
last_action_result: dict
public_test_summary: dict
patch_summary: dict
done_reason: str | None = None
Do not expose hidden test bodies, hidden expected outputs, or seed-specific solution hints.
3.7 State schema
State should support debugging and training analytics.
@dataclass
class CyberSecurityOWASPState(State):
episode_id: str
scenario_id: str
split: Literal["train", "validation", "heldout"]
step_count: int = 0
max_steps: int = 30
scenario_family: str = ""
app_template: str = ""
files_touched: list[str] = field(default_factory=list)
public_tests_passed: int = 0
public_tests_total: int = 0
hidden_tests_passed: int = 0
hidden_tests_total: int = 0
accumulated_reward: float = 0.0
4. Episode lifecycle
1. reset()
- sample train/validation scenario seed
- compile app from policy graph + template + injected bug
- start local sandbox app and DB fixture
- return initial observation
2. agent loop
- inspect policy/routes/files
- send local requests only inside sandbox
- run public tests
- apply one or more patches
- rerun public tests
3. submit_fix
- freeze patch
- run public tests
- run hidden authorization invariants
- run regression tests
- compute deterministic reward
- return final observation, reward, done=True
4. logging
- record scenario_id, action trace, patch diff, reward components
- send metrics to Trackio during training/eval
5. Reward design
The reward should be deterministic, decomposed, and resistant to reward hacking.
Recommended reward formula:
R = 0.35 * public_policy_tests
+ 0.30 * hidden_authz_invariants
+ 0.15 * regression_preservation
+ 0.10 * evidence_quality
+ 0.05 * patch_minimality
+ 0.05 * efficiency
- penalties
Reward components
| Component | Weight | What it rewards |
|---|---|---|
| Public policy tests | 0.35 | Agent fixes known failing behavior. |
| Hidden authz invariants | 0.30 | Patch generalizes beyond visible tests. |
| Regression preservation | 0.15 | Valid user flows and intended public routes still work. |
| Evidence quality | 0.10 | Agent gathered relevant policy/test/file evidence before patching. |
| Patch minimality | 0.05 | Small focused patches instead of broad rewrites. |
| Efficiency | 0.05 | Fewer wasted steps and repeated actions. |
Penalties
| Penalty | Trigger |
|---|---|
-0.25 |
Breaks public route intentionally marked public. |
-0.25 |
Deletes tests, policy file, or route instead of fixing authorization. |
-0.20 |
Hardcodes seed-specific IDs, users, tenants, or hidden assumptions. |
-0.15 |
Over-broad denial that blocks legitimate authorized users. |
-0.10 |
Patch exceeds file or diff-size budget. |
-1.00 |
Attempts external network access, credential extraction, persistence, or unsafe behavior. |
The LLM judge, if used at all, should only annotate trace quality for analysis. It must not decide security-critical reward.
6. Hidden tests and anti-overfitting
Hidden tests are necessary because visible tests can be gamed or memorized. They should test policy invariants rather than exact implementation details.
Use 4 anti-overfitting layers:
- Seed diversity — route names, user IDs, tenant IDs, object names, and schemas change every episode.
- Template diversity — same policy bug appears in different frameworks and file layouts.
- Hidden invariant tests — final reward uses unseen authorization cases.
- Held-out eval split — at least 20% of scenario families/seeds are never used in training.
Recommended split:
Train: 70%
Validation: 10%
Held-out: 20%
7. Evaluation plan
Run before/after evaluation on the same held-out suite.
Metrics
| Metric | Meaning |
|---|---|
episode_success_rate |
Public + hidden + regression tests pass. |
hidden_authz_pass_rate |
Security-critical hidden checks pass. |
regression_pass_rate |
Normal valid behavior remains intact. |
oversecure_rate |
Agent blocks intended legitimate/public behavior. |
patch_compile_rate |
Patch applies and app still runs. |
median_steps_to_submit |
Efficiency of the repair workflow. |
median_files_changed |
Patch focus/minimality. |
reward_hacking_rate |
Attempts to delete tests, hardcode fixtures, or bypass eval. |
Eval table template
| Model | Split | Success | Hidden authz | Regression | Oversecure | Median steps | Median files changed |
|---|---|---|---|---|---|---|---|
| Base model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |
| RL-trained model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |
8. Training flow
1. Build CyberSecurity_OWASP OpenEnv server.
2. Generate 600 MVP scenarios.
3. Run baseline eval with the base model.
4. Train with GRPO/TRL or Unsloth using rollout episodes.
5. Log reward components to Trackio.
6. Run held-out eval every N training steps.
7. Inspect failure clusters.
8. Add scenario mutations only if failures reveal overfitting.
9. Produce final demo: before/after trace + reward curve + held-out eval table.
Recommended initial training setup:
Model: Qwen/Qwen3-1.7B or similar small instruct model
Algorithm: GRPO via TRL or Unsloth-compatible loop
Dataset prompt: repeated task instruction with randomized scenario IDs
Max steps per episode: 30
Rollouts per prompt: 2-4
Logging: Trackio
Primary eval: held-out deterministic test pass rate
9. Deployment architecture
The environment should be runnable in 3 modes:
| Mode | Purpose |
|---|---|
| Local Uvicorn | Fast engineer iteration. |
| Docker | Reproducible local training/eval. |
| Hugging Face Spaces | Public hackathon demo and OpenEnv-compliant hosting. |
Expected endpoints:
/ws OpenEnv client session
/health health check
/reset debug reset
/step debug step
/state debug state
/docs FastAPI docs
/web optional web UI
10. Implementation milestones
Milestone 1 — Skeleton environment
models.pyclient.pyserver/environment.pyserver/app.pyserver/Dockerfileopenenv.yaml- health check
- one hand-written scenario
Milestone 2 — Scenario compiler
- policy graph format
- app template renderer
- bug injector
- DB fixture generator
- public and hidden test generator
Milestone 3 — Reward engine
- public test score
- hidden invariant score
- regression score
- patch minimality score
- safety/reward-hacking penalties
- reward component logging
Milestone 4 — Training script
- rollout loop
- GRPO/TRL or Unsloth training script
- Trackio logging
- checkpoint save/push
- baseline and post-training eval
Milestone 5 — Hackathon demo
- HF Spaces deployment
- mini-blog
- 2-minute video
- before/after traces
- reward curve
- held-out eval table
11. Engineering notes
- Keep scenario apps small: ideally 5-15 files each.
- Prefer deterministic tests over LLM judging.
- Hide final hidden test details from observations.
- Log enough trace data to debug failures but never leak hidden tests to the agent.
- Include intentionally public routes and allowed cross-role cases so the model does not learn “add auth everywhere.”
- The best demo is not just “agent finds bug,” but “agent learns not to break valid business behavior.”
12. Source notes and credibility
| Source | How it informs this architecture | Credibility |
|---|---|---|
| OWASP Top 10 2025 / A01 Broken Access Control | Confirms why access control is the right security focus. | 10/10 |
| OWASP ASVS access-control guidance | Informs policy invariants and server-side authorization checks. | 9.5/10 |
| OpenEnv environment-building docs | Defines required models, reset/step/state, FastAPI server, Docker, and client. | 8.5/10 |
| OpenEnv quickstart/architecture docs | Informs WebSocket client/server design, typed EnvClient, and container isolation. | 8.5/10 |
| OpenEnv deployment docs | Informs HF Spaces deployment, endpoints, Docker workflow, and installable client package. | 8.5/10 |
| Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
| TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 |