File size: 15,798 Bytes
06bfd31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
# 01_ARCHITECTURE.md

# CyberSecurity_OWASP — Architecture

## 1. System goal

`CyberSecurity_OWASP` is an OpenEnv environment for training a **single LLM policy** to perform a complete defensive authorization-repair workflow:

```text
Understand policy → discover local evidence → patch code → validate → submit
```

The environment is intentionally not a two-agent red-team/blue-team setup. The agent is one model with one trajectory. It must learn both sides of the defensive workflow: finding the policy violation and fixing it safely.

## 2. Final architecture diagram

```mermaid
flowchart TB
    %% =========================
    %% Offline Build Layer
    %% =========================
    subgraph A[Offline Scenario Factory]
        A1[Policy Graph Generator\nroles, users, tenants, ownership, route intent]
        A2[App Template Library\nFastAPI, Express, Django MVP templates]
        A3[Bug Injector\nmissing guard, IDOR, tenant leak, role confusion, query omission]
        A4[Scenario Compiler\nmaterializes app + DB + public tests + hidden invariants]
        A5[Split Manager\ntrain seeds, validation seeds, hidden held-out seeds]
        A1 --> A4
        A2 --> A4
        A3 --> A4
        A5 --> A4
    end

    %% =========================
    %% OpenEnv Runtime
    %% =========================
    subgraph B[CyberSecurity_OWASP OpenEnv Server]
        B1[reset\(\)\nselect scenario + start sandbox]
        B2[Sandbox App Runtime\nlocal app, DB fixture, logs, route map]
        B3[Tool API exposed through step\(action\)\nReadFile, ListRoutes, SendLocalRequest, RunTests, ApplyPatch, SubmitFix]
        B4[State Store\nepisode_id, step_count, scenario_id, patch diff, test history]
        B5[Deterministic Reward Engine\npolicy tests + hidden tests + regression tests + penalties]
        B6[state\(\)\nstructured metadata for debugging/eval]
        B1 --> B2
        B2 --> B3
        B3 --> B4
        B4 --> B5
        B4 --> B6
    end

    %% =========================
    %% Agent + Training
    %% =========================
    subgraph C[Single LLM Agent]
        C1[Observation Parser]
        C2[Planner\npolicy reasoning + patch strategy]
        C3[Action Generator\nchooses next OpenEnv action]
        C1 --> C2 --> C3
    end

    subgraph D[Training + Evaluation]
        D1[Rollout Loop\nreset → step* → final reward]
        D2[GRPO / TRL / Unsloth Training]
        D3[Trackio Metrics\nreward curves, pass rates, patch size, steps]
        D4[Held-out Eval Suite\nunseen templates, seeds, names, route structures]
        D5[Demo Artifacts\nbefore/after traces, mini-blog, 2-minute video]
        D1 --> D2 --> D3
        D3 --> D4 --> D5
    end

    A4 --> B1
    C3 -->|typed action| B3
    B3 -->|observation + reward + done| C1
    B5 --> D1
    D2 --> C1
    B5 --> D4
```

## 3. Component responsibilities

### 3.1 Scenario Factory

The scenario factory generates many small but realistic web apps from a structured authorization policy.

It should output:

- application code;
- route map;
- database fixture;
- user/session/token fixtures;
- policy graph;
- intentionally injected access-control bug;
- public tests visible to the agent;
- hidden tests invisible to the agent;
- metadata for eval and debugging.

The scenario compiler is the main anti-overfitting mechanism. It should vary:

- route names;
- schema names;
- ORM query structure;
- framework template;
- role names;
- tenant IDs;
- object ownership patterns;
- file layout;
- visible test coverage;
- hidden invariant seeds.

### 3.2 Policy Graph Generator

The policy graph is the ground truth for intended behavior.

Example internal representation:

```yaml
resources:
  invoice:
    owner_field: owner_user_id
    tenant_field: tenant_id
roles:
  user:
    can:
      - read:invoice where owner_user_id == actor.user_id
      - update:invoice where owner_user_id == actor.user_id and status != locked
  support:
    can:
      - read:invoice where tenant_id == actor.tenant_id
  admin:
    can:
      - read:any_invoice where tenant_id == actor.tenant_id
      - update:any_invoice where tenant_id == actor.tenant_id
public_routes:
  - GET /health
  - GET /pricing
forbidden:
  - cross_tenant_read
  - cross_tenant_update
  - user_reads_other_user_invoice
```

The policy graph prevents false rewards for over-securing intentionally public or intentionally allowed routes.

### 3.3 Bug Injector

The bug injector creates controlled, defensive lab scenarios. It should only generate bugs inside local synthetic apps.

MVP bug classes:

| Bug class | Example failure mode | Expected fix type |
|---|---|---|
| Missing route guard | Protected endpoint lacks authorization middleware | Add policy check/middleware |
| IDOR / ownership bug | User can access another user’s object by changing ID | Add owner check in query/policy |
| Tenant leak | Tenant A can list Tenant B records | Add tenant filter |
| Role confusion | Support/editor/admin boundary is wrong | Correct role-to-permission mapping |
| Client-side-only auth | Server trusts UI to hide forbidden action | Enforce server-side authorization |
| Query omission | List/export/search endpoint lacks auth filter | Filter query by actor permissions |
| Over-broad mutation | User can update/delete forbidden object | Add mutation permission check |
| Public route decoy | Agent may wrongly lock down intended public endpoint | Preserve intended public behavior |

### 3.4 OpenEnv Server

The OpenEnv server should implement the standard lifecycle:

- `reset()` — initialize a fresh scenario instance.
- `step(action)` — execute one typed action and return observation, reward, and done.
- `state()` — expose episode metadata for debugging and evaluation.

Recommended package/class names:

```text
Repo name:      CyberSecurity_OWASP
Python package: cybersecurity_owasp
Client class:   CyberSecurityOWASPEnv
Action class:   CyberSecurityOWASPAction
Observation:    CyberSecurityOWASPObservation
State:          CyberSecurityOWASPState
```

### 3.5 Tool API

The agent should interact through typed actions. Keep the interface small enough for RL but expressive enough for realistic repair.

```python
@dataclass
class CyberSecurityOWASPAction(Action):
    action_type: Literal[
        "read_file",
        "list_files",
        "list_routes",
        "inspect_policy",
        "send_local_request",
        "run_public_tests",
        "apply_patch",
        "submit_fix",
    ]
    arguments: dict
```

Recommended actions:

| Action | Purpose | Safety boundary |
|---|---|---|
| `inspect_policy` | Read intended authorization rules. | Only synthetic policy. |
| `list_routes` | See local app route map. | No internet target. |
| `read_file` | Inspect selected source file. | Sandbox allowlist only. |
| `send_local_request` | Validate behavior against local app. | Local generated app only. |
| `run_public_tests` | Run visible tests. | No hidden test disclosure. |
| `apply_patch` | Modify source through unified diff. | Patch size and file allowlist limits. |
| `submit_fix` | End episode and trigger hidden eval. | Final hidden score only, no leaked test details. |

### 3.6 Observation schema

Observations should be compact and structured.

```python
@dataclass
class CyberSecurityOWASPObservation(Observation):
    message: str
    visible_policy_summary: str
    route_summary: list[dict]
    last_action_result: dict
    public_test_summary: dict
    patch_summary: dict
    done_reason: str | None = None
```

Do not expose hidden test bodies, hidden expected outputs, or seed-specific solution hints.

### 3.7 State schema

State should support debugging and training analytics.

```python
@dataclass
class CyberSecurityOWASPState(State):
    episode_id: str
    scenario_id: str
    split: Literal["train", "validation", "heldout"]
    step_count: int = 0
    max_steps: int = 30
    scenario_family: str = ""
    app_template: str = ""
    files_touched: list[str] = field(default_factory=list)
    public_tests_passed: int = 0
    public_tests_total: int = 0
    hidden_tests_passed: int = 0
    hidden_tests_total: int = 0
    accumulated_reward: float = 0.0
```

## 4. Episode lifecycle

```text
1. reset()
   - sample train/validation scenario seed
   - compile app from policy graph + template + injected bug
   - start local sandbox app and DB fixture
   - return initial observation

2. agent loop
   - inspect policy/routes/files
   - send local requests only inside sandbox
   - run public tests
   - apply one or more patches
   - rerun public tests

3. submit_fix
   - freeze patch
   - run public tests
   - run hidden authorization invariants
   - run regression tests
   - compute deterministic reward
   - return final observation, reward, done=True

4. logging
   - record scenario_id, action trace, patch diff, reward components
   - send metrics to Trackio during training/eval
```

## 5. Reward design

The reward should be deterministic, decomposed, and resistant to reward hacking.

Recommended reward formula:

```text
R = 0.35 * public_policy_tests
  + 0.30 * hidden_authz_invariants
  + 0.15 * regression_preservation
  + 0.10 * evidence_quality
  + 0.05 * patch_minimality
  + 0.05 * efficiency
  - penalties
```

### Reward components

| Component | Weight | What it rewards |
|---|---:|---|
| Public policy tests | 0.35 | Agent fixes known failing behavior. |
| Hidden authz invariants | 0.30 | Patch generalizes beyond visible tests. |
| Regression preservation | 0.15 | Valid user flows and intended public routes still work. |
| Evidence quality | 0.10 | Agent gathered relevant policy/test/file evidence before patching. |
| Patch minimality | 0.05 | Small focused patches instead of broad rewrites. |
| Efficiency | 0.05 | Fewer wasted steps and repeated actions. |

### Penalties

| Penalty | Trigger |
|---|---|
| `-0.25` | Breaks public route intentionally marked public. |
| `-0.25` | Deletes tests, policy file, or route instead of fixing authorization. |
| `-0.20` | Hardcodes seed-specific IDs, users, tenants, or hidden assumptions. |
| `-0.15` | Over-broad denial that blocks legitimate authorized users. |
| `-0.10` | Patch exceeds file or diff-size budget. |
| `-1.00` | Attempts external network access, credential extraction, persistence, or unsafe behavior. |

The LLM judge, if used at all, should only annotate trace quality for analysis. It must not decide security-critical reward.

## 6. Hidden tests and anti-overfitting

Hidden tests are necessary because visible tests can be gamed or memorized. They should test policy invariants rather than exact implementation details.

Use **4 anti-overfitting layers**:

1. **Seed diversity** — route names, user IDs, tenant IDs, object names, and schemas change every episode.
2. **Template diversity** — same policy bug appears in different frameworks and file layouts.
3. **Hidden invariant tests** — final reward uses unseen authorization cases.
4. **Held-out eval split** — at least 20% of scenario families/seeds are never used in training.

Recommended split:

```text
Train:      70%
Validation: 10%
Held-out:   20%
```

## 7. Evaluation plan

Run before/after evaluation on the same held-out suite.

### Metrics

| Metric | Meaning |
|---|---|
| `episode_success_rate` | Public + hidden + regression tests pass. |
| `hidden_authz_pass_rate` | Security-critical hidden checks pass. |
| `regression_pass_rate` | Normal valid behavior remains intact. |
| `oversecure_rate` | Agent blocks intended legitimate/public behavior. |
| `patch_compile_rate` | Patch applies and app still runs. |
| `median_steps_to_submit` | Efficiency of the repair workflow. |
| `median_files_changed` | Patch focus/minimality. |
| `reward_hacking_rate` | Attempts to delete tests, hardcode fixtures, or bypass eval. |

### Eval table template

| Model | Split | Success | Hidden authz | Regression | Oversecure | Median steps | Median files changed |
|---|---|---:|---:|---:|---:|---:|---:|
| Base model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |
| RL-trained model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |

## 8. Training flow

```text
1. Build CyberSecurity_OWASP OpenEnv server.
2. Generate 600 MVP scenarios.
3. Run baseline eval with the base model.
4. Train with GRPO/TRL or Unsloth using rollout episodes.
5. Log reward components to Trackio.
6. Run held-out eval every N training steps.
7. Inspect failure clusters.
8. Add scenario mutations only if failures reveal overfitting.
9. Produce final demo: before/after trace + reward curve + held-out eval table.
```

Recommended initial training setup:

```text
Model: Qwen/Qwen3-1.7B or similar small instruct model
Algorithm: GRPO via TRL or Unsloth-compatible loop
Dataset prompt: repeated task instruction with randomized scenario IDs
Max steps per episode: 30
Rollouts per prompt: 2-4
Logging: Trackio
Primary eval: held-out deterministic test pass rate
```

## 9. Deployment architecture

The environment should be runnable in 3 modes:

| Mode | Purpose |
|---|---|
| Local Uvicorn | Fast engineer iteration. |
| Docker | Reproducible local training/eval. |
| Hugging Face Spaces | Public hackathon demo and OpenEnv-compliant hosting. |

Expected endpoints:

```text
/ws       OpenEnv client session
/health   health check
/reset    debug reset
/step     debug step
/state    debug state
/docs     FastAPI docs
/web      optional web UI
```

## 10. Implementation milestones

### Milestone 1 — Skeleton environment

- `models.py`
- `client.py`
- `server/environment.py`
- `server/app.py`
- `server/Dockerfile`
- `openenv.yaml`
- health check
- one hand-written scenario

### Milestone 2 — Scenario compiler

- policy graph format
- app template renderer
- bug injector
- DB fixture generator
- public and hidden test generator

### Milestone 3 — Reward engine

- public test score
- hidden invariant score
- regression score
- patch minimality score
- safety/reward-hacking penalties
- reward component logging

### Milestone 4 — Training script

- rollout loop
- GRPO/TRL or Unsloth training script
- Trackio logging
- checkpoint save/push
- baseline and post-training eval

### Milestone 5 — Hackathon demo

- HF Spaces deployment
- mini-blog
- 2-minute video
- before/after traces
- reward curve
- held-out eval table

## 11. Engineering notes

- Keep scenario apps small: ideally 5-15 files each.
- Prefer deterministic tests over LLM judging.
- Hide final hidden test details from observations.
- Log enough trace data to debug failures but never leak hidden tests to the agent.
- Include intentionally public routes and allowed cross-role cases so the model does not learn “add auth everywhere.”
- The best demo is not just “agent finds bug,” but “agent learns not to break valid business behavior.”

## 12. Source notes and credibility

| Source | How it informs this architecture | Credibility |
|---|---|---:|
| OWASP Top 10 2025 / A01 Broken Access Control | Confirms why access control is the right security focus. | 10/10 |
| OWASP ASVS access-control guidance | Informs policy invariants and server-side authorization checks. | 9.5/10 |
| OpenEnv environment-building docs | Defines required models, reset/step/state, FastAPI server, Docker, and client. | 8.5/10 |
| OpenEnv quickstart/architecture docs | Informs WebSocket client/server design, typed EnvClient, and container isolation. | 8.5/10 |
| OpenEnv deployment docs | Informs HF Spaces deployment, endpoints, Docker workflow, and installable client package. | 8.5/10 |
| Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
| TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 |