File size: 6,971 Bytes
06bfd31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be8eade
06bfd31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# 00_PROJECT_BRIEF.md

# CyberSecurity_OWASP β€” Project Brief

## 1. One-line summary

`CyberSecurity_OWASP` is an OpenEnv reinforcement-learning environment where a **single LLM agent learns the full defensive workflow for OWASP access-control bugs**: understand the intended authorization policy, discover a broken access-control path in a local synthetic app, patch the code, and prove that the fix blocks unauthorized access without breaking valid user flows.

## 2. Problem

Broken access control remains one of the most important web-application security risks because the correct behavior is usually **application-specific**. Generic scanners can find some missing checks, but they often lack enough context to answer the real engineering question:

> β€œGiven this app’s policy, users, roles, tenants, routes, and data model, is this behavior intended or a security bug?”

Modern LLMs can read code, reason about tests, and propose patches, but they still struggle with:

- distinguishing intended public/feature behavior from accidental over-permission;
- following authorization logic across routes, middleware, ORM queries, tenants, roles, and ownership checks;
- validating that a patch fixes the bug without introducing regressions;
- avoiding reward hacking when tests are visible or too narrow;
- generalizing across app templates instead of memorizing one codebase.

`CyberSecurity_OWASP` turns this into a trainable environment.

## 3. What the environment trains

The environment trains **one agent**, not a separate red-team and blue-team pair. The same model must perform the entire secure-repair loop:

1. **Understand policy** β€” read the policy graph, user roles, route intent, tenant rules, and allowed operations.
2. **Discover evidence** β€” use safe local requests, logs, route metadata, and visible tests to identify the likely access-control failure.
3. **Patch** β€” edit application code, middleware, route guards, query filters, or policy mappings.
4. **Validate** β€” run public tests, policy checks, and regression tests.
5. **Submit** β€” final answer is judged by deterministic hidden tests and reward logic.

## 4. Scope for MVP

The MVP should focus on **OWASP A01: Broken Access Control** with ASVS-inspired access-control requirements.

Initial scenario families:

1. Missing route-level authorization check.
2. Insecure direct object reference / object ownership bug.
3. Cross-tenant data leakage.
4. Role confusion: user/admin/support/editor boundary error.
5. Client-side-only authorization assumption.
6. Query filter omission in list/search/export endpoint.
7. Over-broad update/delete permission.
8. Feature route intentionally public, so the agent must not over-secure it.

Recommended MVP size: **8 scenario families Γ— 3 app templates Γ— 25 seeds = 600 trainable scenarios**, with separate held-out families and hidden seeds for evaluation.

## 5. Why this is useful

This environment is useful because it targets a real gap between today’s scanners and useful defensive agents:

- **Scanners detect patterns.** This environment trains policy-aware reasoning.
- **Unit tests check known cases.** This environment includes hidden authorization invariants.
- **Static repair can overfit.** This environment forces the model to preserve valid business behavior.
- **One-app benchmarks are easy to memorize.** This environment prepares and caches many equivalent-but-different apps from policy graphs, templates, route shapes, schema names, and hidden test seeds, then keeps runtime `reset()` deterministic and fast.

The outcome is a model that becomes better at a practical DevSecOps workflow: safely reviewing and repairing authorization logic in small-to-medium web apps.

## 6. What success looks like

A successful submission should show **measurable reward improvement** and better held-out security behavior after RL training.

### Minimum success criteria

- Environment runs through OpenEnv `reset`, `step`, and `state` APIs.
- Hosted on Hugging Face Spaces.
- Provides a minimal GRPO/TRL or Unsloth training script.
- Tracks training/eval metrics with Trackio or equivalent.
- Shows reward curves and before/after agent behavior.
- Uses deterministic reward as the primary reward source.
- Keeps hidden tests hidden from the agent.

### Target metrics

| Metric | MVP target |
|---|---:|
| Valid episode completion rate | β‰₯ 85% |
| Hidden authorization test pass rate | β‰₯ 65% after initial RL run |
| Regression preservation rate | β‰₯ 80% |
| Held-out scenario success lift vs base model | β‰₯ +15 percentage points |
| Reward-hacking incidents found in eval | 0 critical |
| Median patch size | ≀ 3 files changed |

## 7. Core design principle

The environment should reward **correct defensive repair**, not exploit creativity. The discovery stage exists only to help the agent gather enough local evidence to make a safe patch. The reward engine must never reward real-world misuse, data exfiltration, persistence, credential theft, or evasion behavior.

## 8. Deliverables for engineers

Initial implementation should produce:

```text
CyberSecurity_OWASP/
β”œβ”€β”€ 00_PROJECT_BRIEF.md
β”œβ”€β”€ 01_ARCHITECTURE.md
β”œβ”€β”€ README.md
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ cybersecurity_owasp/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py
β”‚   β”œβ”€β”€ client.py
β”‚   β”œβ”€β”€ rewards.py
β”‚   β”œβ”€β”€ scenarios/
β”‚   β”‚   β”œβ”€β”€ compiler.py
β”‚   β”‚   β”œβ”€β”€ policy_graph.py
β”‚   β”‚   β”œβ”€β”€ templates/
β”‚   β”‚   └── seeds/
β”‚   β”œβ”€β”€ apps/
β”‚   β”‚   β”œβ”€β”€ fastapi_basic/
β”‚   β”‚   β”œβ”€β”€ express_basic/
β”‚   β”‚   └── django_basic/
β”‚   β”œβ”€β”€ evals/
β”‚   β”‚   β”œβ”€β”€ public_tests.py
β”‚   β”‚   β”œβ”€β”€ hidden_invariants.py
β”‚   β”‚   └── heldout_eval.py
β”‚   └── server/
β”‚       β”œβ”€β”€ environment.py
β”‚       β”œβ”€β”€ app.py
β”‚       β”œβ”€β”€ requirements.txt
β”‚       └── Dockerfile
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ train_grpo.py
β”‚   β”œβ”€β”€ rollout.py
β”‚   └── eval_before_after.py
└── outputs/
    β”œβ”€β”€ logs/
    β”œβ”€β”€ evals/
    └── reward_curves/
```

## 9. Source notes and credibility

| Source | How it informs this project | Credibility |
|---|---|---:|
| OWASP Top 10 2025 / A01 Broken Access Control | Confirms current relevance of Broken Access Control as a top web-app risk. | 10/10 |
| OWASP ASVS | Provides security-control requirements that can be translated into policy invariants and hidden tests. | 9.5/10 |
| OpenEnv build/deploy docs | Defines the required OpenEnv structure: models, server, client, Docker, HF Spaces deployment. | 8.5/10 |
| Hackathon judging criteria | Aligns deliverables with scoring: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
| TRL/OpenEnv GRPO example | Shows a practical pattern for environment rollouts, reward functions, and Trackio logging. | 8/10 |