File size: 15,087 Bytes
401c7b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
# From Mythos to Mobile-Sized Defenders: Training Small Models to Repair OWASP 2025 top vulnerability

## Motivation

Anthropic's Project Glasswing was the moment this project clicked for me.[1]

Glasswing is aimed at securing critical software with Claude Mythos Preview, a frontier model Anthropic describes as capable of finding high-severity vulnerabilities in major operating systems and browsers.[2] That is important defensive work, but it raises an uncomfortable question:

**What about everyone else?**

Large operating systems, browsers, banks, and cloud providers may get access to frontier cybersecurity models and expensive scanning pipelines. Smaller teams, solo developers, open-source maintainers, indie hackers, and "vibe coders" are also shipping real software. Their code handles invoices, accounts, uploads, profiles, subscriptions, internal dashboards, and customer data. They face the same class of vulnerabilities, but they do not have the same budget, security staff, or model access.

So I built **CyberSecurity_OWASP** around that idea:



> If frontier models can scale vulnerability discovery, small RL-trained defenders should scale **vulnerability prevention**.



The goal is an OpenEnv environment where a **small open model** ( in this case **Gemma 4 E2B**) can learn an actual defensive workflow: inspect an application, understand the intended authorization policy, discover a broken access control bug, patch the code, and preserve legitimate behavior.



## Ablations: Where the Reward Started Working



I ran a lot of short ablations, and the full trail is in the [CyberSecurity_OWASP Trackio dashboard](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP-trackio).



![CyberSecurity_OWASP reward ablations](../assets/trackio_reward_ablations.png)



The first plain-RL rubric looked promising on paper, but the agent learned a cheap loop: inspect the policy, collect shaping reward, and stall. Tightening the rubric helped, but reward still climbed too slowly.



The useful jump came from changing the recipe: first teach the model successful trajectories with SFT, then run GRPO on that LoRA. With the updated rubric, the SFT-warm-started agent spends less time gaming the interface and more time doing the real job: find the auth bug, patch it, and keep valid behavior alive.



The plot is backed by repeatable scripts, not a one-off notebook. `scripts/modal_train_sft.py` trains the warm-start LoRA on verified trajectories, `scripts/modal_train_grpo.py` continues from that adapter with live OpenEnv rewards, and `scripts/launch_reward_ablations.ps1` launches comparable rubric trials using the YAML configs in `training/configs/reward_ablations/`.



The core handoff is intentionally simple:



```bash

uv run --extra modal modal run --detach scripts/modal_train_sft.py --push-to-hub --detach

uv run --extra modal modal run --detach scripts/modal_train_grpo.py \

  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \

  --max-steps 300 --difficulty 0 --trace-log-every 10 --detach

```



## Why OWASP A01?



The first target is **OWASP A01:2025 - Broken Access Control**.



OWASP ranks Broken Access Control as the number one web application security risk in the 2025 Top 10. Access control failures can let users act outside their intended permissions, including unauthorized disclosure, modification, or destruction of data.[3]



This is exactly the kind of bug that small teams often ship accidentally. A route works. The visible test passes. The developer checks authentication. But one missing ownership check lets Bob read Alice's invoice by changing an ID in the URL.



OWASP lists insecure direct object references, parameter tampering, missing API access controls, privilege escalation, and force browsing as common broken access control patterns. It also recommends enforcing access control in trusted server-side code, denying by default, and checking record ownership instead of allowing arbitrary object access.



That makes A01 a strong starting point for reinforcement learning:



- common enough to matter

- subtle enough that shallow unit tests miss it

- concrete enough for deterministic verification

- realistic enough to map to developer workflows

- expandable into many scenario families



## What CyberSecurity_OWASP Does



**CyberSecurity_OWASP** is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent performing a defensive authorization-repair task.



The episode loop is:



```text

inspect generated app + policy

-> discover authorization bug

-> submit diagnosis

-> patch code

-> preserve intended behavior

```



The current MVP focuses on generated FastAPI-style invoice applications with injected OWASP A01 BOLA/IDOR defects. The agent must inspect the app, compare identities, use safe local requests, diagnose the bug, patch the vulnerable route or service code, run visible checks, and submit a final fix.





## Architecture and Training Flow



[Architecture diagram](../assets/architecture_diagram.svg) | [RL training flow diagram](../assets/env_rl_training_flow_diagram.svg)



![CyberSecurity_OWASP architecture](../assets/architecture_diagram.svg)



![CyberSecurity_OWASP RL training flow](../assets/env_rl_training_flow_diagram.svg)



The agent can use tools such as:



```text

inspect_policy_graph

list_routes
read_openapi

read_file
search_code

send_local_request

compare_identities
submit_diagnosis

patch_file
run_visible_tests
submit_fix

```



The tools are phase-gated. During discovery, the agent can inspect policy, routes, files, OpenAPI summaries, and safe local request behavior. During patching, it can edit allowed app files and run visible tests. After completion, it receives a stable terminal observation.



## The Task the Model Learns



A typical episode looks like this:



1. The environment generates an invoices app with users, tenants, invoices, routes, and a policy graph.

2. One authorization defect is injected.

3. The model sees partial information: product rules, route summaries, fixture aliases, visible test results, and tool outputs.

4. The model must infer the intended policy.

5. It must find a route where one user can access another user's resource.

6. It must submit a diagnosis before patching.

7. It must patch the application without breaking valid owner, admin, or public-route behavior.

8. It must pass visible tests and hidden verifier checks.



For example, the bug may be:



```text

GET /invoices/{invoice_id}
```



The route authenticates the user, loads the invoice by ID, and returns it. But it forgets to verify that the invoice belongs to the current user's tenant or that the current user is an admin. A shallow test may confirm that Alice can fetch Alice's invoice. The hidden exploit checks whether Bob can fetch Alice's invoice.



A useful model must not simply block everything. It has to preserve intended behavior:



```text

owner can read own invoice       -> allowed

admin can read tenant invoice    -> allowed

other user can read invoice      -> denied

public status route still works  -> allowed

```

That is the core reason this is useful for RL. The model is rewarded not for sounding secure, but for making the application secure while preserving product behavior.

## Reward Design

The reward is decomposed so the model cannot win by shortcutting:

```python

{

    "discovery": ...,

    "security": ...,

    "regression": ...,

    "public_routes": ...,

    "patch_quality": ...,

    "visible_tests": ...,

    "safety": ...,

    "anti_cheat": ...,

    "terminal_total": ...,

}

```

Training can also enable dense shaping signals:

```python

{

    "progressive": ...,

    "step_penalty": ...,

    "speed_bonus": ...,

    "token_penalty": ...,

    "behavior_penalty": ...,

    "train_total": ...,

}

```

The verifier checks multiple layers:

```text

visible tests

+ hidden authorization exploit tests

+ policy-oracle matrix

+ regression checks

+ public-route preservation

+ patch-quality checks

+ anti-cheat rules

```

The model is penalized for patterns like:

```text

deny-all fixes

hardcoded user IDs

hardcoded invoice IDs

test or fixture tampering

probing hidden files

external URL attempts

invalid or repeated action loops

breaking public routes

```

This matters because security environments are especially vulnerable to reward hacking. A model that "fixes" IDOR by returning 403 for every endpoint is not a security agent; it is a product outage generator. CyberSecurity_OWASP rewards the harder behavior: block the exploit while preserving the intended application contract.



## Why This Is an Environment



CyberSecurity_OWASP is an environment because the model must interact with a partially observable world. It does not receive the answer. It has to gather evidence, run safe local probes, understand policy, make edits, and submit a final fix.

It is long-horizon because success requires a sequence of correct steps. Reading the wrong file, skipping diagnosis, patching the wrong layer, or breaking public routes can reduce reward.

It is self-improving because the curriculum controller can select difficulty tiers and target weak spots. Scenario generation is cache-backed, configurable, and extensible. The environment can create more authorization tasks as the model improves.

It is measurable because every episode ends with deterministic verification. The model either blocked the exploit and preserved behavior, or it did not.

## Training Approach

The target policy model is **Gemma 4 E2B Instruct** through Unsloth.

The model choice is intentional: small, instruction-tuned, code-capable models are closer to the cost and latency profile needed for local developer workflows. Google describes Gemma 4 as an open model family designed for reasoning, agentic workflows, function calling, structured JSON output, code generation, and efficient deployment across hardware sizes.[4] Unsloth supports fine-tuning Gemma 4 E2B, including text and RL workflows.[5]

The training scaffold has two stages.

### 1. Synthetic SFT warm start

A teacher model executes real environment trajectories. Only trajectories that pass the deterministic verifier are kept. This creates supervised data where each row teaches the model a valid step in a successful security-repair workflow.

Example actions include:

```json

{"tool_name": "inspect_policy_graph"}

{"tool_name": "read_file", "arguments": {"path": "app/routes/invoices.py"}}

{"tool_name": "send_local_request", "arguments": {"method": "GET", "path": "/invoices/inv_alice_001"}}

{"tool_name": "submit_diagnosis", "arguments": {"bug_type": "broken_access_control"}}

{"tool_name": "patch_file", "arguments": {"path": "app/routes/invoices.py", "diff": "..."}}

{"tool_name": "submit_fix"}

```

### 2. GRPO reinforcement learning

After SFT, GRPO trains the model against the live OpenEnv environment. The model receives reward from the verifier, not from a preference label. This lets it optimize for real task success: discovering the bug, repairing it, and preserving behavior.

Runs are logged through Trackio. Modal launchers support cache preparation, smoke tests, SFT, and GRPO. The environment keeps scenario generation separate from training so GPU jobs do not waste time compiling scenarios during rollout.

## Evaluation Focus

The most important metric is not whether the model can say "this is IDOR." The important metric is whether it can produce a patch that passes hidden authorization checks while keeping legitimate application behavior intact.

The evaluation suite tracks:

- average terminal reward
- exploit-block rate
- regression-preservation rate
- public-route preservation rate
- invalid-action rate
- anti-cheat pass rate
- full success rate

The blog should be paired with the latest Trackio dashboard and `outputs/evals/` summaries for concrete before-and-after numbers. This write-up intentionally avoids claiming training results that are not present in the repository.

## What Makes This Useful

Most developer security tooling is reactive. A scanner runs after code is written. A bug bounty report arrives after deployment. A security review happens late, if it happens at all.

The long-term direction here is proactive protection.

Imagine a small local model that runs inside a developer workflow:

```text

before commit

before deploy

inside CI

inside an IDE

inside a mobile or offline coding assistant

```

It reads the policy, inspects changed routes, tries safe local probes, identifies authorization gaps, and proposes patches. Because the model is small, it can be cheap enough to run repeatedly. Because it is RL-trained in an environment with hidden policy-oracle checks, it can learn behavior beyond static pattern matching.

The point is not to replace professional security review. The point is to make a baseline level of defensive reasoning available to teams that currently have almost none.

## Responsible Scope

CyberSecurity_OWASP is defensive by construction.



The environment uses synthetic generated applications and safe local requests. Hidden tests, exploit labels, oracle tuples, and scenario-family labels are not exposed to the agent. External URL attempts are penalized. The model is trained to diagnose and patch authorization bugs in a sandbox, not to attack real systems.



That boundary is important. The same AI progress that makes automated vulnerability discovery powerful also makes safe training environments more urgent.



## Future Work



The current MVP focuses on OWASP A01 Broken Access Control, especially BOLA/IDOR-style defects. The same framework can expand to additional OWASP risk families and richer application shapes:



- more app domains

- more policy shapes

- multi-route authorization chains

- schema drift

- stronger curriculum adaptation

- realistic CI-style patch review

- larger held-out scenario families



The bigger vision is a family of small, specialized security agents that can run close to where software is written.



Project Glasswing shows what frontier models may do for the world's most critical software. CyberSecurity_OWASP asks a complementary question:

**Can we train small open models to protect the long tail of everyday software?**

This submission is a first step toward that answer.

[1]: https://www.anthropic.com/glasswing "Project Glasswing: Securing critical software for the AI era | Anthropic"
[2]: https://red.anthropic.com/2026/mythos-preview/ "Claude Mythos Preview | red.anthropic.com"
[3]: https://owasp.org/Top10/2025/A01_2025-Broken_Access_Control/ "A01 Broken Access Control - OWASP Top 10:2025"
[4]: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ "Gemma 4: Our most capable open models to date"
[5]: https://unsloth.ai/docs/models/gemma-4/train "Gemma 4 Fine-tuning Guide | Unsloth Documentation"