Humanlearning commited on
Commit
54079d2
·
1 Parent(s): 1b6d30b

Add mini-blog

Browse files
Files changed (2) hide show
  1. README.md +2 -0
  2. blog/blog.md +270 -0
README.md CHANGED
@@ -15,6 +15,8 @@ tags:
15
 
16
  # CyberSecurity_OWASP
17
 
 
 
18
  `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
19
 
20
  ```text
 
15
 
16
  # CyberSecurity_OWASP
17
 
18
+ [Hugging Face Space](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP) | [Mini-blog](blog/blog.md)
19
+
20
  `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
21
 
22
  ```text
blog/blog.md ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # From Mythos to Mobile-Sized Defenders: Training Small Models to Repair OWASP Broken Access Control
2
+
3
+ ## Motivation
4
+
5
+ Anthropic's Project Glasswing was the moment this project clicked for me.[1]
6
+
7
+ Glasswing is aimed at securing critical software with Claude Mythos Preview, a frontier model Anthropic describes as capable of finding high-severity vulnerabilities in major operating systems and browsers.[2] That is important defensive work, but it raises an uncomfortable question:
8
+
9
+ **What about everyone else?**
10
+
11
+ Large operating systems, browsers, banks, and cloud providers may get access to frontier cybersecurity models and expensive scanning pipelines. Smaller teams, solo developers, open-source maintainers, indie hackers, and "vibe coders" are also shipping real software. Their code handles invoices, accounts, uploads, profiles, subscriptions, internal dashboards, and customer data. They face the same class of vulnerabilities, but they do not have the same budget, security staff, or model access.
12
+
13
+ So I built **CyberSecurity_OWASP** around a different idea:
14
+
15
+ > If frontier models can scale vulnerability discovery, small RL-trained defenders should scale vulnerability prevention.
16
+
17
+ The goal is not another benchmark where an LLM answers security trivia. The goal is an OpenEnv environment where a small open model can learn an actual defensive workflow: inspect an application, understand the intended authorization policy, discover a broken access control bug, patch the code, and preserve legitimate behavior.
18
+
19
+ ## Why OWASP A01?
20
+
21
+ The first target is **OWASP A01:2025 - Broken Access Control**.
22
+
23
+ OWASP ranks Broken Access Control as the number one web application security risk in the 2025 Top 10. Access control failures can let users act outside their intended permissions, including unauthorized disclosure, modification, or destruction of data.[3]
24
+
25
+ This is exactly the kind of bug that small teams often ship accidentally. A route works. The visible test passes. The developer checks authentication. But one missing ownership check lets Bob read Alice's invoice by changing an ID in the URL.
26
+
27
+ OWASP lists insecure direct object references, parameter tampering, missing API access controls, privilege escalation, and force browsing as common broken access control patterns. It also recommends enforcing access control in trusted server-side code, denying by default, and checking record ownership instead of allowing arbitrary object access.
28
+
29
+ That makes A01 a strong starting point for reinforcement learning:
30
+
31
+ - common enough to matter
32
+ - subtle enough that shallow unit tests miss it
33
+ - concrete enough for deterministic verification
34
+ - realistic enough to map to developer workflows
35
+ - expandable into many scenario families
36
+
37
+ ## What CyberSecurity_OWASP Does
38
+
39
+ **CyberSecurity_OWASP** is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent performing a defensive authorization-repair task.
40
+
41
+ The episode loop is:
42
+
43
+ ```text
44
+ inspect generated app + policy
45
+ -> discover authorization bug
46
+ -> submit diagnosis
47
+ -> patch code
48
+ -> preserve intended behavior
49
+ ```
50
+
51
+ The current MVP focuses on generated FastAPI-style invoice applications with injected OWASP A01 BOLA/IDOR defects. The agent must inspect the app, compare identities, use safe local requests, diagnose the bug, patch the vulnerable route or service code, run visible checks, and submit a final fix.
52
+
53
+ This is not a static multiple-choice benchmark. It is an interactive environment with tools, state, hidden checks, and reward feedback.
54
+
55
+ The agent can use tools such as:
56
+
57
+ ```text
58
+ inspect_policy_graph
59
+ list_routes
60
+ read_openapi
61
+ read_file
62
+ search_code
63
+ send_local_request
64
+ compare_identities
65
+ submit_diagnosis
66
+ patch_file
67
+ run_visible_tests
68
+ submit_fix
69
+ ```
70
+
71
+ The tools are phase-gated. During discovery, the agent can inspect policy, routes, files, OpenAPI summaries, and safe local request behavior. During patching, it can edit allowed app files and run visible tests. After completion, it receives a stable terminal observation.
72
+
73
+ ## The Task the Model Learns
74
+
75
+ A typical episode looks like this:
76
+
77
+ 1. The environment generates an invoices app with users, tenants, invoices, routes, and a policy graph.
78
+ 2. One authorization defect is injected.
79
+ 3. The model sees partial information: product rules, route summaries, fixture aliases, visible test results, and tool outputs.
80
+ 4. The model must infer the intended policy.
81
+ 5. It must find a route where one user can access another user's resource.
82
+ 6. It must submit a diagnosis before patching.
83
+ 7. It must patch the application without breaking valid owner, admin, or public-route behavior.
84
+ 8. It must pass visible tests and hidden verifier checks.
85
+
86
+ For example, the bug may be:
87
+
88
+ ```text
89
+ GET /invoices/{invoice_id}
90
+ ```
91
+
92
+ The route authenticates the user, loads the invoice by ID, and returns it. But it forgets to verify that the invoice belongs to the current user's tenant or that the current user is an admin. A shallow test may confirm that Alice can fetch Alice's invoice. The hidden exploit checks whether Bob can fetch Alice's invoice.
93
+
94
+ A useful model must not simply block everything. It has to preserve intended behavior:
95
+
96
+ ```text
97
+ owner can read own invoice -> allowed
98
+ admin can read tenant invoice -> allowed
99
+ other user can read invoice -> denied
100
+ public status route still works -> allowed
101
+ ```
102
+
103
+ That is the core reason this is useful for RL. The model is rewarded not for sounding secure, but for making the application secure while preserving product behavior.
104
+
105
+ ## Reward Design
106
+
107
+ The reward is decomposed so the model cannot win by shortcutting:
108
+
109
+ ```python
110
+ {
111
+ "discovery": ...,
112
+ "security": ...,
113
+ "regression": ...,
114
+ "public_routes": ...,
115
+ "patch_quality": ...,
116
+ "visible_tests": ...,
117
+ "safety": ...,
118
+ "anti_cheat": ...,
119
+ "terminal_total": ...,
120
+ }
121
+ ```
122
+
123
+ Training can also enable dense shaping signals:
124
+
125
+ ```python
126
+ {
127
+ "progressive": ...,
128
+ "step_penalty": ...,
129
+ "speed_bonus": ...,
130
+ "token_penalty": ...,
131
+ "behavior_penalty": ...,
132
+ "train_total": ...,
133
+ }
134
+ ```
135
+
136
+ The verifier checks multiple layers:
137
+
138
+ ```text
139
+ visible tests
140
+ + hidden authorization exploit tests
141
+ + policy-oracle matrix
142
+ + regression checks
143
+ + public-route preservation
144
+ + patch-quality checks
145
+ + anti-cheat rules
146
+ ```
147
+
148
+ The model is penalized for patterns like:
149
+
150
+ ```text
151
+ deny-all fixes
152
+ hardcoded user IDs
153
+ hardcoded invoice IDs
154
+ test or fixture tampering
155
+ probing hidden files
156
+ external URL attempts
157
+ invalid or repeated action loops
158
+ breaking public routes
159
+ ```
160
+
161
+ This matters because security environments are especially vulnerable to reward hacking. A model that "fixes" IDOR by returning 403 for every endpoint is not a security agent; it is a product outage generator. CyberSecurity_OWASP rewards the harder behavior: block the exploit while preserving the intended application contract.
162
+
163
+ ## Why This Is an Environment
164
+
165
+ CyberSecurity_OWASP is an environment because the model must interact with a partially observable world. It does not receive the answer. It has to gather evidence, run safe local probes, understand policy, make edits, and submit a final fix.
166
+
167
+ It is long-horizon because success requires a sequence of correct steps. Reading the wrong file, skipping diagnosis, patching the wrong layer, or breaking public routes can reduce reward.
168
+
169
+ It is self-improving because the curriculum controller can select difficulty tiers and target weak spots. Scenario generation is cache-backed, configurable, and extensible. The environment can create more authorization tasks as the model improves.
170
+
171
+ It is measurable because every episode ends with deterministic verification. The model either blocked the exploit and preserved behavior, or it did not.
172
+
173
+ ## Training Approach
174
+
175
+ The target policy model is **Gemma 4 E2B Instruct** through Unsloth.
176
+
177
+ The model choice is intentional: small, instruction-tuned, code-capable models are closer to the cost and latency profile needed for local developer workflows. Google describes Gemma 4 as an open model family designed for reasoning, agentic workflows, function calling, structured JSON output, code generation, and efficient deployment across hardware sizes.[4] Unsloth supports fine-tuning Gemma 4 E2B, including text and RL workflows.[5]
178
+
179
+ The training scaffold has two stages.
180
+
181
+ ### 1. Synthetic SFT warm start
182
+
183
+ A teacher model executes real environment trajectories. Only trajectories that pass the deterministic verifier are kept. This creates supervised data where each row teaches the model a valid step in a successful security-repair workflow.
184
+
185
+ Example actions include:
186
+
187
+ ```json
188
+ {"tool_name": "inspect_policy_graph"}
189
+ {"tool_name": "read_file", "arguments": {"path": "app/routes/invoices.py"}}
190
+ {"tool_name": "send_local_request", "arguments": {"method": "GET", "path": "/invoices/inv_alice_001"}}
191
+ {"tool_name": "submit_diagnosis", "arguments": {"bug_type": "broken_access_control"}}
192
+ {"tool_name": "patch_file", "arguments": {"path": "app/routes/invoices.py", "diff": "..."}}
193
+ {"tool_name": "submit_fix"}
194
+ ```
195
+
196
+ ### 2. GRPO reinforcement learning
197
+
198
+ After SFT, GRPO trains the model against the live OpenEnv environment. The model receives reward from the verifier, not from a preference label. This lets it optimize for real task success: discovering the bug, repairing it, and preserving behavior.
199
+
200
+ Runs are logged through Trackio. Modal launchers support cache preparation, smoke tests, SFT, and GRPO. The environment keeps scenario generation separate from training so GPU jobs do not waste time compiling scenarios during rollout.
201
+
202
+ ## Evaluation Focus
203
+
204
+ The most important metric is not whether the model can say "this is IDOR." The important metric is whether it can produce a patch that passes hidden authorization checks while keeping legitimate application behavior intact.
205
+
206
+ The evaluation suite tracks:
207
+
208
+ - average terminal reward
209
+ - exploit-block rate
210
+ - regression-preservation rate
211
+ - public-route preservation rate
212
+ - invalid-action rate
213
+ - anti-cheat pass rate
214
+ - full success rate
215
+
216
+ The blog should be paired with the latest Trackio dashboard and `outputs/evals/` summaries for concrete before-and-after numbers. This write-up intentionally avoids claiming training results that are not present in the repository.
217
+
218
+ ## What Makes This Useful
219
+
220
+ Most developer security tooling is reactive. A scanner runs after code is written. A bug bounty report arrives after deployment. A security review happens late, if it happens at all.
221
+
222
+ The long-term direction here is proactive protection.
223
+
224
+ Imagine a small local model that runs inside a developer workflow:
225
+
226
+ ```text
227
+ before commit
228
+ before deploy
229
+ inside CI
230
+ inside an IDE
231
+ inside a mobile or offline coding assistant
232
+ ```
233
+
234
+ It reads the policy, inspects changed routes, tries safe local probes, identifies authorization gaps, and proposes patches. Because the model is small, it can be cheap enough to run repeatedly. Because it is RL-trained in an environment with hidden policy-oracle checks, it can learn behavior beyond static pattern matching.
235
+
236
+ The point is not to replace professional security review. The point is to make a baseline level of defensive reasoning available to teams that currently have almost none.
237
+
238
+ ## Responsible Scope
239
+
240
+ CyberSecurity_OWASP is defensive by construction.
241
+
242
+ The environment uses synthetic generated applications and safe local requests. Hidden tests, exploit labels, oracle tuples, and scenario-family labels are not exposed to the agent. External URL attempts are penalized. The model is trained to diagnose and patch authorization bugs in a sandbox, not to attack real systems.
243
+
244
+ That boundary is important. The same AI progress that makes automated vulnerability discovery powerful also makes safe training environments more urgent.
245
+
246
+ ## Future Work
247
+
248
+ The current MVP focuses on OWASP A01 Broken Access Control, especially BOLA/IDOR-style defects. The same framework can expand to additional OWASP risk families and richer application shapes:
249
+
250
+ - more app domains
251
+ - more policy shapes
252
+ - multi-route authorization chains
253
+ - schema drift
254
+ - stronger curriculum adaptation
255
+ - realistic CI-style patch review
256
+ - larger held-out scenario families
257
+
258
+ The bigger vision is a family of small, specialized security agents that can run close to where software is written.
259
+
260
+ Project Glasswing shows what frontier models may do for the world's most critical software. CyberSecurity_OWASP asks a complementary question:
261
+
262
+ **Can we train small open models to protect the long tail of everyday software?**
263
+
264
+ This submission is a first step toward that answer.
265
+
266
+ [1]: https://www.anthropic.com/glasswing "Project Glasswing: Securing critical software for the AI era | Anthropic"
267
+ [2]: https://red.anthropic.com/2026/mythos-preview/ "Claude Mythos Preview | red.anthropic.com"
268
+ [3]: https://owasp.org/Top10/2025/A01_2025-Broken_Access_Control/ "A01 Broken Access Control - OWASP Top 10:2025"
269
+ [4]: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ "Gemma 4: Our most capable open models to date"
270
+ [5]: https://unsloth.ai/docs/models/gemma-4/train "Gemma 4 Fine-tuning Guide | Unsloth Documentation"