Humanlearning commited on
Commit
401c7b1
·
verified ·
1 Parent(s): d531538

Sync updated mini-blog

Browse files
Files changed (1) hide show
  1. blog/blog.md +298 -277
blog/blog.md CHANGED
@@ -1,277 +1,298 @@
1
- # From Mythos to Mobile-Sized Defenders: Training Small Models to Repair OWASP 2025 top vulnerability
2
-
3
- ## Motivation
4
-
5
- Anthropic's Project Glasswing was the moment this project clicked for me.[1]
6
-
7
- Glasswing is aimed at securing critical software with Claude Mythos Preview, a frontier model Anthropic describes as capable of finding high-severity vulnerabilities in major operating systems and browsers.[2] That is important defensive work, but it raises an uncomfortable question:
8
-
9
- **What about everyone else?**
10
-
11
- Large operating systems, browsers, banks, and cloud providers may get access to frontier cybersecurity models and expensive scanning pipelines. Smaller teams, solo developers, open-source maintainers, indie hackers, and "vibe coders" are also shipping real software. Their code handles invoices, accounts, uploads, profiles, subscriptions, internal dashboards, and customer data. They face the same class of vulnerabilities, but they do not have the same budget, security staff, or model access.
12
-
13
- So I built **CyberSecurity_OWASP** around that idea:
14
-
15
- > If frontier models can scale vulnerability discovery, small RL-trained defenders should scale **vulnerability prevention**.
16
-
17
- The goal is an OpenEnv environment where a **small open model** ( in this case **Gemma 4 E2B**) can learn an actual defensive workflow: inspect an application, understand the intended authorization policy, discover a broken access control bug, patch the code, and preserve legitimate behavior.
18
-
19
- ## Why OWASP A01?
20
-
21
- The first target is **OWASP A01:2025 - Broken Access Control**.
22
-
23
- OWASP ranks Broken Access Control as the number one web application security risk in the 2025 Top 10. Access control failures can let users act outside their intended permissions, including unauthorized disclosure, modification, or destruction of data.[3]
24
-
25
- This is exactly the kind of bug that small teams often ship accidentally. A route works. The visible test passes. The developer checks authentication. But one missing ownership check lets Bob read Alice's invoice by changing an ID in the URL.
26
-
27
- OWASP lists insecure direct object references, parameter tampering, missing API access controls, privilege escalation, and force browsing as common broken access control patterns. It also recommends enforcing access control in trusted server-side code, denying by default, and checking record ownership instead of allowing arbitrary object access.
28
-
29
- That makes A01 a strong starting point for reinforcement learning:
30
-
31
- - common enough to matter
32
- - subtle enough that shallow unit tests miss it
33
- - concrete enough for deterministic verification
34
- - realistic enough to map to developer workflows
35
- - expandable into many scenario families
36
-
37
- ## What CyberSecurity_OWASP Does
38
-
39
- **CyberSecurity_OWASP** is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent performing a defensive authorization-repair task.
40
-
41
- The episode loop is:
42
-
43
- ```text
44
- inspect generated app + policy
45
- -> discover authorization bug
46
- -> submit diagnosis
47
- -> patch code
48
- -> preserve intended behavior
49
- ```
50
-
51
- The current MVP focuses on generated FastAPI-style invoice applications with injected OWASP A01 BOLA/IDOR defects. The agent must inspect the app, compare identities, use safe local requests, diagnose the bug, patch the vulnerable route or service code, run visible checks, and submit a final fix.
52
-
53
-
54
- ## Architecture and Training Flow
55
-
56
- [Architecture diagram](../assets/architecture_diagram.svg) | [RL training flow diagram](../assets/env_rl_training_flow_diagram.svg)
57
-
58
- ![CyberSecurity_OWASP architecture](../assets/architecture_diagram.svg)
59
-
60
- ![CyberSecurity_OWASP RL training flow](../assets/env_rl_training_flow_diagram.svg)
61
-
62
- The agent can use tools such as:
63
-
64
- ```text
65
- inspect_policy_graph
66
- list_routes
67
- read_openapi
68
- read_file
69
- search_code
70
- send_local_request
71
- compare_identities
72
- submit_diagnosis
73
- patch_file
74
- run_visible_tests
75
- submit_fix
76
- ```
77
-
78
- The tools are phase-gated. During discovery, the agent can inspect policy, routes, files, OpenAPI summaries, and safe local request behavior. During patching, it can edit allowed app files and run visible tests. After completion, it receives a stable terminal observation.
79
-
80
- ## The Task the Model Learns
81
-
82
- A typical episode looks like this:
83
-
84
- 1. The environment generates an invoices app with users, tenants, invoices, routes, and a policy graph.
85
- 2. One authorization defect is injected.
86
- 3. The model sees partial information: product rules, route summaries, fixture aliases, visible test results, and tool outputs.
87
- 4. The model must infer the intended policy.
88
- 5. It must find a route where one user can access another user's resource.
89
- 6. It must submit a diagnosis before patching.
90
- 7. It must patch the application without breaking valid owner, admin, or public-route behavior.
91
- 8. It must pass visible tests and hidden verifier checks.
92
-
93
- For example, the bug may be:
94
-
95
- ```text
96
- GET /invoices/{invoice_id}
97
- ```
98
-
99
- The route authenticates the user, loads the invoice by ID, and returns it. But it forgets to verify that the invoice belongs to the current user's tenant or that the current user is an admin. A shallow test may confirm that Alice can fetch Alice's invoice. The hidden exploit checks whether Bob can fetch Alice's invoice.
100
-
101
- A useful model must not simply block everything. It has to preserve intended behavior:
102
-
103
- ```text
104
- owner can read own invoice -> allowed
105
- admin can read tenant invoice -> allowed
106
- other user can read invoice -> denied
107
- public status route still works -> allowed
108
- ```
109
-
110
- That is the core reason this is useful for RL. The model is rewarded not for sounding secure, but for making the application secure while preserving product behavior.
111
-
112
- ## Reward Design
113
-
114
- The reward is decomposed so the model cannot win by shortcutting:
115
-
116
- ```python
117
- {
118
- "discovery": ...,
119
- "security": ...,
120
- "regression": ...,
121
- "public_routes": ...,
122
- "patch_quality": ...,
123
- "visible_tests": ...,
124
- "safety": ...,
125
- "anti_cheat": ...,
126
- "terminal_total": ...,
127
- }
128
- ```
129
-
130
- Training can also enable dense shaping signals:
131
-
132
- ```python
133
- {
134
- "progressive": ...,
135
- "step_penalty": ...,
136
- "speed_bonus": ...,
137
- "token_penalty": ...,
138
- "behavior_penalty": ...,
139
- "train_total": ...,
140
- }
141
- ```
142
-
143
- The verifier checks multiple layers:
144
-
145
- ```text
146
- visible tests
147
- + hidden authorization exploit tests
148
- + policy-oracle matrix
149
- + regression checks
150
- + public-route preservation
151
- + patch-quality checks
152
- + anti-cheat rules
153
- ```
154
-
155
- The model is penalized for patterns like:
156
-
157
- ```text
158
- deny-all fixes
159
- hardcoded user IDs
160
- hardcoded invoice IDs
161
- test or fixture tampering
162
- probing hidden files
163
- external URL attempts
164
- invalid or repeated action loops
165
- breaking public routes
166
- ```
167
-
168
- This matters because security environments are especially vulnerable to reward hacking. A model that "fixes" IDOR by returning 403 for every endpoint is not a security agent; it is a product outage generator. CyberSecurity_OWASP rewards the harder behavior: block the exploit while preserving the intended application contract.
169
-
170
- ## Why This Is an Environment
171
-
172
- CyberSecurity_OWASP is an environment because the model must interact with a partially observable world. It does not receive the answer. It has to gather evidence, run safe local probes, understand policy, make edits, and submit a final fix.
173
-
174
- It is long-horizon because success requires a sequence of correct steps. Reading the wrong file, skipping diagnosis, patching the wrong layer, or breaking public routes can reduce reward.
175
-
176
- It is self-improving because the curriculum controller can select difficulty tiers and target weak spots. Scenario generation is cache-backed, configurable, and extensible. The environment can create more authorization tasks as the model improves.
177
-
178
- It is measurable because every episode ends with deterministic verification. The model either blocked the exploit and preserved behavior, or it did not.
179
-
180
- ## Training Approach
181
-
182
- The target policy model is **Gemma 4 E2B Instruct** through Unsloth.
183
-
184
- The model choice is intentional: small, instruction-tuned, code-capable models are closer to the cost and latency profile needed for local developer workflows. Google describes Gemma 4 as an open model family designed for reasoning, agentic workflows, function calling, structured JSON output, code generation, and efficient deployment across hardware sizes.[4] Unsloth supports fine-tuning Gemma 4 E2B, including text and RL workflows.[5]
185
-
186
- The training scaffold has two stages.
187
-
188
- ### 1. Synthetic SFT warm start
189
-
190
- A teacher model executes real environment trajectories. Only trajectories that pass the deterministic verifier are kept. This creates supervised data where each row teaches the model a valid step in a successful security-repair workflow.
191
-
192
- Example actions include:
193
-
194
- ```json
195
- {"tool_name": "inspect_policy_graph"}
196
- {"tool_name": "read_file", "arguments": {"path": "app/routes/invoices.py"}}
197
- {"tool_name": "send_local_request", "arguments": {"method": "GET", "path": "/invoices/inv_alice_001"}}
198
- {"tool_name": "submit_diagnosis", "arguments": {"bug_type": "broken_access_control"}}
199
- {"tool_name": "patch_file", "arguments": {"path": "app/routes/invoices.py", "diff": "..."}}
200
- {"tool_name": "submit_fix"}
201
- ```
202
-
203
- ### 2. GRPO reinforcement learning
204
-
205
- After SFT, GRPO trains the model against the live OpenEnv environment. The model receives reward from the verifier, not from a preference label. This lets it optimize for real task success: discovering the bug, repairing it, and preserving behavior.
206
-
207
- Runs are logged through Trackio. Modal launchers support cache preparation, smoke tests, SFT, and GRPO. The environment keeps scenario generation separate from training so GPU jobs do not waste time compiling scenarios during rollout.
208
-
209
- ## Evaluation Focus
210
-
211
- The most important metric is not whether the model can say "this is IDOR." The important metric is whether it can produce a patch that passes hidden authorization checks while keeping legitimate application behavior intact.
212
-
213
- The evaluation suite tracks:
214
-
215
- - average terminal reward
216
- - exploit-block rate
217
- - regression-preservation rate
218
- - public-route preservation rate
219
- - invalid-action rate
220
- - anti-cheat pass rate
221
- - full success rate
222
-
223
- The blog should be paired with the latest Trackio dashboard and `outputs/evals/` summaries for concrete before-and-after numbers. This write-up intentionally avoids claiming training results that are not present in the repository.
224
-
225
- ## What Makes This Useful
226
-
227
- Most developer security tooling is reactive. A scanner runs after code is written. A bug bounty report arrives after deployment. A security review happens late, if it happens at all.
228
-
229
- The long-term direction here is proactive protection.
230
-
231
- Imagine a small local model that runs inside a developer workflow:
232
-
233
- ```text
234
- before commit
235
- before deploy
236
- inside CI
237
- inside an IDE
238
- inside a mobile or offline coding assistant
239
- ```
240
-
241
- It reads the policy, inspects changed routes, tries safe local probes, identifies authorization gaps, and proposes patches. Because the model is small, it can be cheap enough to run repeatedly. Because it is RL-trained in an environment with hidden policy-oracle checks, it can learn behavior beyond static pattern matching.
242
-
243
- The point is not to replace professional security review. The point is to make a baseline level of defensive reasoning available to teams that currently have almost none.
244
-
245
- ## Responsible Scope
246
-
247
- CyberSecurity_OWASP is defensive by construction.
248
-
249
- The environment uses synthetic generated applications and safe local requests. Hidden tests, exploit labels, oracle tuples, and scenario-family labels are not exposed to the agent. External URL attempts are penalized. The model is trained to diagnose and patch authorization bugs in a sandbox, not to attack real systems.
250
-
251
- That boundary is important. The same AI progress that makes automated vulnerability discovery powerful also makes safe training environments more urgent.
252
-
253
- ## Future Work
254
-
255
- The current MVP focuses on OWASP A01 Broken Access Control, especially BOLA/IDOR-style defects. The same framework can expand to additional OWASP risk families and richer application shapes:
256
-
257
- - more app domains
258
- - more policy shapes
259
- - multi-route authorization chains
260
- - schema drift
261
- - stronger curriculum adaptation
262
- - realistic CI-style patch review
263
- - larger held-out scenario families
264
-
265
- The bigger vision is a family of small, specialized security agents that can run close to where software is written.
266
-
267
- Project Glasswing shows what frontier models may do for the world's most critical software. CyberSecurity_OWASP asks a complementary question:
268
-
269
- **Can we train small open models to protect the long tail of everyday software?**
270
-
271
- This submission is a first step toward that answer.
272
-
273
- [1]: https://www.anthropic.com/glasswing "Project Glasswing: Securing critical software for the AI era | Anthropic"
274
- [2]: https://red.anthropic.com/2026/mythos-preview/ "Claude Mythos Preview | red.anthropic.com"
275
- [3]: https://owasp.org/Top10/2025/A01_2025-Broken_Access_Control/ "A01 Broken Access Control - OWASP Top 10:2025"
276
- [4]: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ "Gemma 4: Our most capable open models to date"
277
- [5]: https://unsloth.ai/docs/models/gemma-4/train "Gemma 4 Fine-tuning Guide | Unsloth Documentation"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # From Mythos to Mobile-Sized Defenders: Training Small Models to Repair OWASP 2025 top vulnerability
2
+
3
+ ## Motivation
4
+
5
+ Anthropic's Project Glasswing was the moment this project clicked for me.[1]
6
+
7
+ Glasswing is aimed at securing critical software with Claude Mythos Preview, a frontier model Anthropic describes as capable of finding high-severity vulnerabilities in major operating systems and browsers.[2] That is important defensive work, but it raises an uncomfortable question:
8
+
9
+ **What about everyone else?**
10
+
11
+ Large operating systems, browsers, banks, and cloud providers may get access to frontier cybersecurity models and expensive scanning pipelines. Smaller teams, solo developers, open-source maintainers, indie hackers, and "vibe coders" are also shipping real software. Their code handles invoices, accounts, uploads, profiles, subscriptions, internal dashboards, and customer data. They face the same class of vulnerabilities, but they do not have the same budget, security staff, or model access.
12
+
13
+ So I built **CyberSecurity_OWASP** around that idea:
14
+
15
+ > If frontier models can scale vulnerability discovery, small RL-trained defenders should scale **vulnerability prevention**.
16
+
17
+ The goal is an OpenEnv environment where a **small open model** ( in this case **Gemma 4 E2B**) can learn an actual defensive workflow: inspect an application, understand the intended authorization policy, discover a broken access control bug, patch the code, and preserve legitimate behavior.
18
+
19
+ ## Ablations: Where the Reward Started Working
20
+
21
+ I ran a lot of short ablations, and the full trail is in the [CyberSecurity_OWASP Trackio dashboard](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP-trackio).
22
+
23
+ ![CyberSecurity_OWASP reward ablations](../assets/trackio_reward_ablations.png)
24
+
25
+ The first plain-RL rubric looked promising on paper, but the agent learned a cheap loop: inspect the policy, collect shaping reward, and stall. Tightening the rubric helped, but reward still climbed too slowly.
26
+
27
+ The useful jump came from changing the recipe: first teach the model successful trajectories with SFT, then run GRPO on that LoRA. With the updated rubric, the SFT-warm-started agent spends less time gaming the interface and more time doing the real job: find the auth bug, patch it, and keep valid behavior alive.
28
+
29
+ The plot is backed by repeatable scripts, not a one-off notebook. `scripts/modal_train_sft.py` trains the warm-start LoRA on verified trajectories, `scripts/modal_train_grpo.py` continues from that adapter with live OpenEnv rewards, and `scripts/launch_reward_ablations.ps1` launches comparable rubric trials using the YAML configs in `training/configs/reward_ablations/`.
30
+
31
+ The core handoff is intentionally simple:
32
+
33
+ ```bash
34
+ uv run --extra modal modal run --detach scripts/modal_train_sft.py --push-to-hub --detach
35
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
36
+ --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
37
+ --max-steps 300 --difficulty 0 --trace-log-every 10 --detach
38
+ ```
39
+
40
+ ## Why OWASP A01?
41
+
42
+ The first target is **OWASP A01:2025 - Broken Access Control**.
43
+
44
+ OWASP ranks Broken Access Control as the number one web application security risk in the 2025 Top 10. Access control failures can let users act outside their intended permissions, including unauthorized disclosure, modification, or destruction of data.[3]
45
+
46
+ This is exactly the kind of bug that small teams often ship accidentally. A route works. The visible test passes. The developer checks authentication. But one missing ownership check lets Bob read Alice's invoice by changing an ID in the URL.
47
+
48
+ OWASP lists insecure direct object references, parameter tampering, missing API access controls, privilege escalation, and force browsing as common broken access control patterns. It also recommends enforcing access control in trusted server-side code, denying by default, and checking record ownership instead of allowing arbitrary object access.
49
+
50
+ That makes A01 a strong starting point for reinforcement learning:
51
+
52
+ - common enough to matter
53
+ - subtle enough that shallow unit tests miss it
54
+ - concrete enough for deterministic verification
55
+ - realistic enough to map to developer workflows
56
+ - expandable into many scenario families
57
+
58
+ ## What CyberSecurity_OWASP Does
59
+
60
+ **CyberSecurity_OWASP** is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent performing a defensive authorization-repair task.
61
+
62
+ The episode loop is:
63
+
64
+ ```text
65
+ inspect generated app + policy
66
+ -> discover authorization bug
67
+ -> submit diagnosis
68
+ -> patch code
69
+ -> preserve intended behavior
70
+ ```
71
+
72
+ The current MVP focuses on generated FastAPI-style invoice applications with injected OWASP A01 BOLA/IDOR defects. The agent must inspect the app, compare identities, use safe local requests, diagnose the bug, patch the vulnerable route or service code, run visible checks, and submit a final fix.
73
+
74
+
75
+ ## Architecture and Training Flow
76
+
77
+ [Architecture diagram](../assets/architecture_diagram.svg) | [RL training flow diagram](../assets/env_rl_training_flow_diagram.svg)
78
+
79
+ ![CyberSecurity_OWASP architecture](../assets/architecture_diagram.svg)
80
+
81
+ ![CyberSecurity_OWASP RL training flow](../assets/env_rl_training_flow_diagram.svg)
82
+
83
+ The agent can use tools such as:
84
+
85
+ ```text
86
+ inspect_policy_graph
87
+ list_routes
88
+ read_openapi
89
+ read_file
90
+ search_code
91
+ send_local_request
92
+ compare_identities
93
+ submit_diagnosis
94
+ patch_file
95
+ run_visible_tests
96
+ submit_fix
97
+ ```
98
+
99
+ The tools are phase-gated. During discovery, the agent can inspect policy, routes, files, OpenAPI summaries, and safe local request behavior. During patching, it can edit allowed app files and run visible tests. After completion, it receives a stable terminal observation.
100
+
101
+ ## The Task the Model Learns
102
+
103
+ A typical episode looks like this:
104
+
105
+ 1. The environment generates an invoices app with users, tenants, invoices, routes, and a policy graph.
106
+ 2. One authorization defect is injected.
107
+ 3. The model sees partial information: product rules, route summaries, fixture aliases, visible test results, and tool outputs.
108
+ 4. The model must infer the intended policy.
109
+ 5. It must find a route where one user can access another user's resource.
110
+ 6. It must submit a diagnosis before patching.
111
+ 7. It must patch the application without breaking valid owner, admin, or public-route behavior.
112
+ 8. It must pass visible tests and hidden verifier checks.
113
+
114
+ For example, the bug may be:
115
+
116
+ ```text
117
+ GET /invoices/{invoice_id}
118
+ ```
119
+
120
+ The route authenticates the user, loads the invoice by ID, and returns it. But it forgets to verify that the invoice belongs to the current user's tenant or that the current user is an admin. A shallow test may confirm that Alice can fetch Alice's invoice. The hidden exploit checks whether Bob can fetch Alice's invoice.
121
+
122
+ A useful model must not simply block everything. It has to preserve intended behavior:
123
+
124
+ ```text
125
+ owner can read own invoice -> allowed
126
+ admin can read tenant invoice -> allowed
127
+ other user can read invoice -> denied
128
+ public status route still works -> allowed
129
+ ```
130
+
131
+ That is the core reason this is useful for RL. The model is rewarded not for sounding secure, but for making the application secure while preserving product behavior.
132
+
133
+ ## Reward Design
134
+
135
+ The reward is decomposed so the model cannot win by shortcutting:
136
+
137
+ ```python
138
+ {
139
+ "discovery": ...,
140
+ "security": ...,
141
+ "regression": ...,
142
+ "public_routes": ...,
143
+ "patch_quality": ...,
144
+ "visible_tests": ...,
145
+ "safety": ...,
146
+ "anti_cheat": ...,
147
+ "terminal_total": ...,
148
+ }
149
+ ```
150
+
151
+ Training can also enable dense shaping signals:
152
+
153
+ ```python
154
+ {
155
+ "progressive": ...,
156
+ "step_penalty": ...,
157
+ "speed_bonus": ...,
158
+ "token_penalty": ...,
159
+ "behavior_penalty": ...,
160
+ "train_total": ...,
161
+ }
162
+ ```
163
+
164
+ The verifier checks multiple layers:
165
+
166
+ ```text
167
+ visible tests
168
+ + hidden authorization exploit tests
169
+ + policy-oracle matrix
170
+ + regression checks
171
+ + public-route preservation
172
+ + patch-quality checks
173
+ + anti-cheat rules
174
+ ```
175
+
176
+ The model is penalized for patterns like:
177
+
178
+ ```text
179
+ deny-all fixes
180
+ hardcoded user IDs
181
+ hardcoded invoice IDs
182
+ test or fixture tampering
183
+ probing hidden files
184
+ external URL attempts
185
+ invalid or repeated action loops
186
+ breaking public routes
187
+ ```
188
+
189
+ This matters because security environments are especially vulnerable to reward hacking. A model that "fixes" IDOR by returning 403 for every endpoint is not a security agent; it is a product outage generator. CyberSecurity_OWASP rewards the harder behavior: block the exploit while preserving the intended application contract.
190
+
191
+ ## Why This Is an Environment
192
+
193
+ CyberSecurity_OWASP is an environment because the model must interact with a partially observable world. It does not receive the answer. It has to gather evidence, run safe local probes, understand policy, make edits, and submit a final fix.
194
+
195
+ It is long-horizon because success requires a sequence of correct steps. Reading the wrong file, skipping diagnosis, patching the wrong layer, or breaking public routes can reduce reward.
196
+
197
+ It is self-improving because the curriculum controller can select difficulty tiers and target weak spots. Scenario generation is cache-backed, configurable, and extensible. The environment can create more authorization tasks as the model improves.
198
+
199
+ It is measurable because every episode ends with deterministic verification. The model either blocked the exploit and preserved behavior, or it did not.
200
+
201
+ ## Training Approach
202
+
203
+ The target policy model is **Gemma 4 E2B Instruct** through Unsloth.
204
+
205
+ The model choice is intentional: small, instruction-tuned, code-capable models are closer to the cost and latency profile needed for local developer workflows. Google describes Gemma 4 as an open model family designed for reasoning, agentic workflows, function calling, structured JSON output, code generation, and efficient deployment across hardware sizes.[4] Unsloth supports fine-tuning Gemma 4 E2B, including text and RL workflows.[5]
206
+
207
+ The training scaffold has two stages.
208
+
209
+ ### 1. Synthetic SFT warm start
210
+
211
+ A teacher model executes real environment trajectories. Only trajectories that pass the deterministic verifier are kept. This creates supervised data where each row teaches the model a valid step in a successful security-repair workflow.
212
+
213
+ Example actions include:
214
+
215
+ ```json
216
+ {"tool_name": "inspect_policy_graph"}
217
+ {"tool_name": "read_file", "arguments": {"path": "app/routes/invoices.py"}}
218
+ {"tool_name": "send_local_request", "arguments": {"method": "GET", "path": "/invoices/inv_alice_001"}}
219
+ {"tool_name": "submit_diagnosis", "arguments": {"bug_type": "broken_access_control"}}
220
+ {"tool_name": "patch_file", "arguments": {"path": "app/routes/invoices.py", "diff": "..."}}
221
+ {"tool_name": "submit_fix"}
222
+ ```
223
+
224
+ ### 2. GRPO reinforcement learning
225
+
226
+ After SFT, GRPO trains the model against the live OpenEnv environment. The model receives reward from the verifier, not from a preference label. This lets it optimize for real task success: discovering the bug, repairing it, and preserving behavior.
227
+
228
+ Runs are logged through Trackio. Modal launchers support cache preparation, smoke tests, SFT, and GRPO. The environment keeps scenario generation separate from training so GPU jobs do not waste time compiling scenarios during rollout.
229
+
230
+ ## Evaluation Focus
231
+
232
+ The most important metric is not whether the model can say "this is IDOR." The important metric is whether it can produce a patch that passes hidden authorization checks while keeping legitimate application behavior intact.
233
+
234
+ The evaluation suite tracks:
235
+
236
+ - average terminal reward
237
+ - exploit-block rate
238
+ - regression-preservation rate
239
+ - public-route preservation rate
240
+ - invalid-action rate
241
+ - anti-cheat pass rate
242
+ - full success rate
243
+
244
+ The blog should be paired with the latest Trackio dashboard and `outputs/evals/` summaries for concrete before-and-after numbers. This write-up intentionally avoids claiming training results that are not present in the repository.
245
+
246
+ ## What Makes This Useful
247
+
248
+ Most developer security tooling is reactive. A scanner runs after code is written. A bug bounty report arrives after deployment. A security review happens late, if it happens at all.
249
+
250
+ The long-term direction here is proactive protection.
251
+
252
+ Imagine a small local model that runs inside a developer workflow:
253
+
254
+ ```text
255
+ before commit
256
+ before deploy
257
+ inside CI
258
+ inside an IDE
259
+ inside a mobile or offline coding assistant
260
+ ```
261
+
262
+ It reads the policy, inspects changed routes, tries safe local probes, identifies authorization gaps, and proposes patches. Because the model is small, it can be cheap enough to run repeatedly. Because it is RL-trained in an environment with hidden policy-oracle checks, it can learn behavior beyond static pattern matching.
263
+
264
+ The point is not to replace professional security review. The point is to make a baseline level of defensive reasoning available to teams that currently have almost none.
265
+
266
+ ## Responsible Scope
267
+
268
+ CyberSecurity_OWASP is defensive by construction.
269
+
270
+ The environment uses synthetic generated applications and safe local requests. Hidden tests, exploit labels, oracle tuples, and scenario-family labels are not exposed to the agent. External URL attempts are penalized. The model is trained to diagnose and patch authorization bugs in a sandbox, not to attack real systems.
271
+
272
+ That boundary is important. The same AI progress that makes automated vulnerability discovery powerful also makes safe training environments more urgent.
273
+
274
+ ## Future Work
275
+
276
+ The current MVP focuses on OWASP A01 Broken Access Control, especially BOLA/IDOR-style defects. The same framework can expand to additional OWASP risk families and richer application shapes:
277
+
278
+ - more app domains
279
+ - more policy shapes
280
+ - multi-route authorization chains
281
+ - schema drift
282
+ - stronger curriculum adaptation
283
+ - realistic CI-style patch review
284
+ - larger held-out scenario families
285
+
286
+ The bigger vision is a family of small, specialized security agents that can run close to where software is written.
287
+
288
+ Project Glasswing shows what frontier models may do for the world's most critical software. CyberSecurity_OWASP asks a complementary question:
289
+
290
+ **Can we train small open models to protect the long tail of everyday software?**
291
+
292
+ This submission is a first step toward that answer.
293
+
294
+ [1]: https://www.anthropic.com/glasswing "Project Glasswing: Securing critical software for the AI era | Anthropic"
295
+ [2]: https://red.anthropic.com/2026/mythos-preview/ "Claude Mythos Preview | red.anthropic.com"
296
+ [3]: https://owasp.org/Top10/2025/A01_2025-Broken_Access_Control/ "A01 Broken Access Control - OWASP Top 10:2025"
297
+ [4]: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ "Gemma 4: Our most capable open models to date"
298
+ [5]: https://unsloth.ai/docs/models/gemma-4/train "Gemma 4 Fine-tuning Guide | Unsloth Documentation"