Spaces:
Sleeping
Sleeping
File size: 19,975 Bytes
06bfd31 f3080d1 06bfd31 be8eade 06bfd31 be8eade 06bfd31 6abc8c5 06bfd31 be8eade 6abc8c5 be8eade 6abc8c5 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 6abc8c5 be8eade 6abc8c5 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 be8eade 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 6abc8c5 be8eade 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 be8eade 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 6abc8c5 be8eade 6abc8c5 06bfd31 be8eade 06bfd31 6abc8c5 06bfd31 6abc8c5 06bfd31 f3080d1 06bfd31 be8eade 06bfd31 be8eade 06bfd31 b3ee507 06bfd31 be8eade 06bfd31 be8eade b3ee507 06bfd31 6abc8c5 be8eade | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 | # 01_ARCHITECTURE.md
# CyberSecurity_OWASP β Architecture
## 1. System goal
`CyberSecurity_OWASP` is an OpenEnv environment for training a **single LLM policy** to perform a complete defensive authorization-repair workflow:
```text
Understand policy β discover local evidence β patch code β validate β submit
```
The environment is intentionally not a two-agent red-team/blue-team setup. The agent is one model with one trajectory. It must learn both sides of the defensive workflow: finding the policy violation and fixing it safely.
## 2. Final architecture diagram
Rendered asset:

Editable source: `assets/architecture_diagram.mmd`
```mermaid
flowchart TB
subgraph A[Async Scenario Authoring + Curriculum Factory]
A1[Config-guided LLM Scenario Author\nDeepSeek-V4-Pro default]
A2[ScenarioSpec JSON\npolicy, app family, bug target]
A3[Template + A01 Mutator\nFastAPI code variants]
A4[Deterministic Compiler\nexecutable bundle]
A5[Static + Dynamic Verifier\nsolvable, safe, hidden/visible tests]
A6[Difficulty Calibrator\nbaseline pass-rate buckets]
A7[Versioned Scenario Cache\nsplit, difficulty, family, hash]
A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
end
subgraph B[CyberSecurity_OWASP OpenEnv Runtime]
B1[reset\(seed, difficulty, family_budget\)\ncache lookup only]
B2[Curriculum Sampler\nvalidated cache slice]
B3[Episode State Store\nphase, history, cache metadata, patch diff]
B4[Typed Action Tools\ninspect, request, patch, visible tests]
B5[Ephemeral App Sandbox\ncloned cached workspace + fixtures]
B6[Multi-layer Verifier\nvisible, hidden, oracle, regression]
B7[Deterministic Reward Engine\nstable components + penalties]
B8[Episode Artifact Logger\nJSONL transcript + verifier + diff]
B1 --> B2 --> B3 --> B4
B4 <--> B5
B5 --> B6 --> B7 --> B3
B3 --> B8
end
subgraph C[Single LLM Agent]
C1[Observation Parser]
C2[AuthZ + Code Reasoning]
C3[Discover β Diagnose β Patch β Test\none JSON action]
C1 --> C2 --> C3
end
subgraph D[Training + Evaluation + Demo]
D1[Parallel Rollouts\nfast cached reset]
D2[TRL GRPO + LoRA]
D3[Trackio Curves\nreward, pass rates, cache metrics]
D4[Held-out Family Eval\nbase vs trained model]
D5[Demo Artifacts\nbefore/after traces + JSONL]
D1 --> D2 --> D3 --> D4 --> D5
end
subgraph E[Feedback / Adaptation Loop]
E1[Episode logs + failures]
E2[Mastery Model\nweakness and plateau tracking]
E3[Cache Sampling Weights\nnew generation queue]
E1 --> E2 --> E3
end
A7 --> B1
C3 -->|typed action| B4
B4 -->|observation + reward + done| C1
B7 --> D1
D2 --> C1
B8 --> E1
E3 --> A1
```
## 3. Component responsibilities
### 3.1 Async Scenario Authoring Plane
Scenario generation is offline, asynchronous, validated, and cached. Runtime `reset()` must not call an LLM and must not compile a fresh app during Modal smoke, training, or evaluation runs.
The scenario authoring plane outputs complete executable bundles:
- `scenario.json`;
- `app_source/`;
- `policy_graph.json`;
- `visible_tests.py`;
- `hidden_tests.py`;
- `oracle_tests.py`;
- `expected_exploit_trace.json`;
- `reward_config.json`;
- `metadata.json`.
The default scenario/curriculum author is configured in `configs/scenario_authoring.small.json`:
```yaml
provider: huggingface
model_id: deepseek-ai/DeepSeek-V4-Pro
thinking_mode: thinking
reasoning_effort: high
temperature: 1.0
top_p: 1.0
```
DeepSeek-V4-Pro is only used for offline scenario/curriculum authoring. It is not the RL policy model unless explicitly selected for training.
The compiler remains the main anti-overfitting mechanism. It should vary:
- route names;
- schema names;
- ORM query structure;
- framework template;
- role names;
- tenant IDs;
- object ownership patterns;
- file layout;
- visible test coverage;
- hidden invariant seeds.
The runtime treats curriculum and cache sampling as first-class scenario inputs:
- `CurriculumController` tracks target weakness mastery, recent reward trend, failure counts, and difficulty tier.
- Offline cache prep uses the configured LLM author, deterministic compiler, verifier, and baseline-agent difficulty calibrator.
- `ScenarioCache` stores validated bundles by split, difficulty, family, generator version, verifier version, and scenario hash.
- Hidden-eval episodes hold out scenario families, not only seeds, by marking evaluation-only scenario-family metadata in state rather than observations.
Cache keys include:
```text
difficulty_level
authz_bug_type
app_family
framework
policy_shape
tenant_model
exploit_depth
patch_scope
regression_risk
generator_version
verifier_version
scenario_hash
```
### 3.2 Policy Graph Generator
The policy graph is the ground truth for intended behavior.
Example internal representation:
```yaml
resources:
invoice:
owner_field: owner_user_id
tenant_field: tenant_id
roles:
user:
can:
- read:invoice where owner_user_id == actor.user_id
- update:invoice where owner_user_id == actor.user_id and status != locked
support:
can:
- read:invoice where tenant_id == actor.tenant_id
admin:
can:
- read:any_invoice where tenant_id == actor.tenant_id
- update:any_invoice where tenant_id == actor.tenant_id
public_routes:
- GET /health
- GET /pricing
forbidden:
- cross_tenant_read
- cross_tenant_update
- user_reads_other_user_invoice
```
The policy graph prevents false rewards for over-securing intentionally public or intentionally allowed routes.
### 3.3 Bug Injector
The bug injector creates controlled, defensive lab scenarios. It should only generate bugs inside local synthetic apps.
MVP bug classes:
| Bug class | Example failure mode | Expected fix type |
|---|---|---|
| Missing route guard | Protected endpoint lacks authorization middleware | Add policy check/middleware |
| IDOR / ownership bug | User can access another userβs object by changing ID | Add owner check in query/policy |
| Tenant leak | Tenant A can list Tenant B records | Add tenant filter |
| Role confusion | Support/editor/admin boundary is wrong | Correct role-to-permission mapping |
| Client-side-only auth | Server trusts UI to hide forbidden action | Enforce server-side authorization |
| Query omission | List/export/search endpoint lacks auth filter | Filter query by actor permissions |
| Over-broad mutation | User can update/delete forbidden object | Add mutation permission check |
| Public route decoy | Agent may wrongly lock down intended public endpoint | Preserve intended public behavior |
### 3.4 OpenEnv Server
The OpenEnv server should implement the standard lifecycle:
- `reset()` β initialize a fresh episode from a cached scenario bundle.
- `step(action)` β execute one typed action and return observation, reward, and done.
- `state()` β expose episode metadata for debugging and evaluation.
Recommended package/class names:
```text
Repo name: CyberSecurity_OWASP
Python package: cybersecurity_owasp
Client class: CyberSecurityOWASPEnv
Action class: CyberSecurityOWASPAction
Observation: CyberSecurityOWASPObservation
State: CyberSecurityOWASPState
```
### 3.5 Tool API
The agent should interact through typed actions. Keep the interface small enough for RL but expressive enough for realistic repair.
```python
@dataclass
class CyberSecurityOWASPAction(Action):
tool_name: Literal[
"inspect_policy_graph",
"list_routes",
"read_openapi",
"read_file",
"search_code",
"send_local_request",
"compare_identities",
"submit_diagnosis",
"patch_file",
"run_visible_tests",
"submit_fix",
"noop",
]
arguments: dict
```
Recommended actions:
| Action | Purpose | Safety boundary |
|---|---|---|
| `inspect_policy_graph` | Read intended authorization rules. | Only synthetic policy. |
| `list_routes` | See local app route map. | No internet target. |
| `read_file` | Inspect selected source file. | Sandbox allowlist only. |
| `send_local_request` | Validate behavior against local app. | Local generated app only. |
| `submit_diagnosis` | Record bug class, route, policy rule, evidence trace IDs, and fix plan. | Does not reveal hidden tests. |
| `run_visible_tests` | Run visible tests. | No hidden test disclosure. |
| `patch_file` | Modify source through unified diff or full content. | Patch size and file allowlist limits. |
| `submit_fix` | End episode and trigger hidden eval. | Final hidden score only, no leaked test details. |
### 3.6 Observation schema
Observations should be compact and structured.
```python
@dataclass
class CyberSecurityOWASPObservation(Observation):
phase: Literal["discover", "patch", "done"]
message: str
task_brief: str
visible_policy_hint: dict
workspace_summary: dict
available_actions: list[str]
last_tool_result: str
visible_test_result: str | None = None
reward_breakdown: dict[str, float] = field(default_factory=dict)
done_reason: str | None = None
```
The policy hint is deliberately partial. It may include product rules, fixture aliases, route summaries, and public-route intent, but it must not expose the hidden oracle matrix, hidden test bodies, injected bug labels, or held-out family labels.
### 3.7 State schema
State should support debugging and training analytics.
```python
@dataclass
class CyberSecurityOWASPState(State):
episode_id: str
task_id: str
split: Literal["train", "validation", "hidden_eval"]
step_count: int = 0
max_steps: int = 40
difficulty_tier: str = "warmup"
scenario_family: str = ""
template_id: str = "fastapi_basic"
target_weakness: str = ""
curriculum_snapshot: dict = field(default_factory=dict)
verification_summary: dict = field(default_factory=dict)
patch_diff: str = ""
episode_artifact_path: str | None = None
accumulated_reward: float = 0.0
```
## 4. Episode lifecycle
```text
1. reset()
- curriculum selects difficulty tier and target weakness
- runtime samples or directly loads a validated cached bundle
- clone cached `app_source/` into an isolated ephemeral workspace
- initialize fixture state, cache metadata, and sandbox handles
- return initial observation
2. agent loop
- inspect policy/routes/files
- send local requests only inside sandbox
- run public tests
- apply one or more patches
- rerun public tests
3. submit_fix
- freeze patch
- run public tests
- run hidden authorization invariants
- run policy-oracle matrix
- run regression and public-route preservation tests
- compute deterministic reward
- return final observation, reward, done=True
4. logging
- append JSONL artifact with scenario metadata, action trace, observations, patch diff, verifier result, and reward components
- feed terminal success/failure back into curriculum mastery tracking
- send metrics to Trackio during training/eval
```
`CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` is mandatory for Modal smoke, training, and evaluation. In that mode a missing cache bundle is a hard failure. Local development may use `fallback`, which compiles deterministically on a miss, but that path is not allowed for meaningful training.
## 5. Reward design
The reward should be deterministic, decomposed, and resistant to reward hacking. The maximum terminal reward remains **15.0** and high reward requires deterministic verifier success, not explanation quality.
Stable reward keys:
```python
{
"discovery": 0.0,
"security": 0.0,
"regression": 0.0,
"public_routes": 0.0,
"patch_quality": 0.0,
"visible_tests": 0.0,
"safety": 0.0,
"anti_cheat": 0.0,
"terminal_total": 0.0,
"progressive": 0.0,
"step_penalty": 0.0,
"speed_bonus": 0.0,
"token_penalty": 0.0,
"behavior_penalty": 0.0,
"train_total": 0.0,
"total": 0.0,
}
```
Sparse evaluation uses `terminal_total` as `total`. Dense training uses
`terminal_total + shaping_weight * progressive + efficiency - penalties` as `total`,
with all reward values and short descriptions configured in
`training/configs/grpo_small.yaml`.
### Reward components
| Component | Purpose |
|---|---|
| `discovery` | Valid local evidence and correct violated policy rule. |
| `security` | Hidden exploit blocking plus policy-oracle matrix pass. |
| `regression` | Legitimate owner/admin/support flows still work. |
| `public_routes` | Intentionally public routes remain public. |
| `patch_quality` | Localized policy-aligned patch and efficient phase order. |
| `visible_tests` | Visible tests pass and app still boots. |
| `safety` | Penalizes invalid action patterns, unsafe targets, timeouts, and deny-all behavior. |
| `anti_cheat` | Penalizes hidden-file probing, hardcoded fixture IDs, and test/oracle tampering. |
### Penalties
| Penalty | Trigger |
|---|---|
| public route penalty | Breaks a route intentionally marked public. |
| anti-cheat penalty | Deletes or probes tests, hidden files, reward code, oracle data, or host paths. |
| hardcoding penalty | Hardcodes seed-specific IDs, users, tenants, or hidden assumptions. |
| safety penalty | Over-broad denial, malformed/invalid actions, repeated failed actions, or external target attempts. |
The LLM judge, if used at all, should only annotate trace quality for analysis. It must not decide security-critical reward.
## 6. Hidden tests and anti-overfitting
Hidden tests are necessary because visible tests can be gamed or memorized. They should test policy invariants rather than exact implementation details.
Use **4 anti-overfitting layers**:
1. **Seed diversity** β route names, user IDs, tenant IDs, object names, and schemas change every episode.
2. **Template diversity** β same policy bug appears in different frameworks and file layouts.
3. **Hidden invariant tests** β final reward uses unseen authorization cases.
4. **Held-out eval split** β at least 20% of scenario families/seeds are never used in training.
Recommended split:
```text
Train: 70%
Validation: 10%
Held-out: 20%
```
## 7. Evaluation plan
Run before/after evaluation on the same held-out suite.
### Metrics
| Metric | Meaning |
|---|---|
| `episode_success_rate` | Public + hidden + regression tests pass. |
| `hidden_authz_pass_rate` | Security-critical hidden checks pass. |
| `regression_pass_rate` | Normal valid behavior remains intact. |
| `oversecure_rate` | Agent blocks intended legitimate/public behavior. |
| `patch_compile_rate` | Patch applies and app still runs. |
| `median_steps_to_submit` | Efficiency of the repair workflow. |
| `median_files_changed` | Patch focus/minimality. |
| `reward_hacking_rate` | Attempts to delete tests, hardcode fixtures, or bypass eval. |
### Eval table template
| Model | Split | Success | Hidden authz | Regression | Oversecure | Median steps | Median files changed |
|---|---|---:|---:|---:|---:|---:|---:|
| Base model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |
| RL-trained model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |
## 8. Training flow
Rendered asset:

Editable source: `assets/env_rl_training_flow_diagram.mmd`
```text
1. Build CyberSecurity_OWASP OpenEnv server.
2. Prepare validated scenario cache once per generator/verifier version.
3. Run baseline eval with cached validation/held-out bundles.
4. Train with GRPO/TRL or Unsloth using cached rollout episodes.
5. Log reward components, pass rates, reset latency, and cache hit metrics to Trackio.
6. Run held-out eval every N training steps.
7. Inspect failure clusters and cache sampling weights.
8. Refresh only 5-10% of scenarios per epoch when new weak spots are found.
9. Produce final demo: before/after trace + reward curve + held-out eval table.
```
Recommended initial training setup (Modal-first):
```text
Model: unsloth/gemma-4-E2B-it
Algorithm: GRPO via TRL or Unsloth-compatible loop
Dataset prompt: repeated task instruction with randomized scenario IDs
Max steps per episode: 30
Rollouts per prompt: 2-4
Logging: Trackio
Primary eval: held-out deterministic test pass rate
Scenario cache mode: require
Scenario cache volume: CyberSecurity_OWASP-scenario-cache
Training execution is expected to run on Modal (persistent or ephemeral) rather than locally.
```
## 9. Deployment architecture
The environment should be runnable in 3 modes:
| Mode | Purpose |
|---|---|
| Local Uvicorn | Fast engineer iteration. |
| Docker | Reproducible local training/eval. |
| Hugging Face Spaces | Public hackathon demo and OpenEnv-compliant hosting. |
Expected endpoints:
```text
/ws OpenEnv client session
/health health check
/reset debug reset
/step debug step
/state debug state
/docs FastAPI docs
/web optional web UI
```
## 10. Implementation milestones
### Milestone 1 β Skeleton environment
- `models.py`
- `client.py`
- `server/environment.py`
- `server/app.py`
- `server/Dockerfile`
- `openenv.yaml`
- health check
- one hand-written scenario
### Milestone 2 β Scenario compiler
- policy graph format
- app template renderer
- bug injector
- DB fixture generator
- public and hidden test generator
### Milestone 3 β Reward engine
- public test score
- hidden invariant score
- regression score
- patch minimality score
- safety/reward-hacking penalties
- reward component logging
### Milestone 4 β Training script
- rollout loop
- GRPO/TRL or Unsloth training script
- Trackio logging
- checkpoint save/push
- baseline and post-training eval
### Milestone 5 β Hackathon demo
- HF Spaces deployment
- mini-blog
- 2-minute video
- before/after traces
- reward curve
- held-out eval table
## 11. Engineering notes
- Keep scenario apps small: ideally 5-15 files each.
- Prefer deterministic tests over LLM judging.
- Hide final hidden test details from observations.
- Log enough trace data to debug failures but never leak hidden tests to the agent.
- Include intentionally public routes and allowed cross-role cases so the model does not learn βadd auth everywhere.β
- The best demo is not just βagent finds bug,β but βagent learns not to break valid business behavior.β
## 12. Source notes and credibility
| Source | How it informs this architecture | Credibility |
|---|---|---:|
| OWASP Top 10 2025 / A01 Broken Access Control | Confirms why access control is the right security focus. | 10/10 |
| OWASP ASVS access-control guidance | Informs policy invariants and server-side authorization checks. | 9.5/10 |
| OpenEnv environment-building docs | Defines required models, reset/step/state, FastAPI server, Docker, and client. | 8.5/10 |
| OpenEnv quickstart/architecture docs | Informs WebSocket client/server design, typed EnvClient, and container isolation. | 8.5/10 |
| OpenEnv deployment docs | Informs HF Spaces deployment, endpoints, Docker workflow, and installable client package. | 8.5/10 |
| Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
| TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 |
| Kube SRE Gym README | Informs the closed-loop pattern: adversarial scenario design, curriculum mastery tracking, real tool interaction, verification, and artifact-driven storytelling. | 8/10 |
| DeepSeek-V4-Pro Hugging Face model card and encoding notes | Informs the default offline scenario-author config and the note that prompt handling should not assume a Jinja chat template. | 8/10 |
|