Cyber_analyst-round1 / 01_ARCHITECTURE.md
Humanlearning's picture
feat: enhance scenario authoring and caching mechanisms, update action submission terminology, and improve reward configuration for CyberSecurity_OWASP environment
be8eade
# 01_ARCHITECTURE.md
# CyberSecurity_OWASP — Architecture
## 1. System goal
`CyberSecurity_OWASP` is an OpenEnv environment for training a **single LLM policy** to perform a complete defensive authorization-repair workflow:
```text
Understand policy → discover local evidence → patch code → validate → submit
```
The environment is intentionally not a two-agent red-team/blue-team setup. The agent is one model with one trajectory. It must learn both sides of the defensive workflow: finding the policy violation and fixing it safely.
## 2. Final architecture diagram
Rendered asset:
![CyberSecurity_OWASP architecture](assets/architecture_diagram.svg)
Editable source: `assets/architecture_diagram.mmd`
```mermaid
flowchart TB
subgraph A[Async Scenario Authoring + Curriculum Factory]
A1[Config-guided LLM Scenario Author\nDeepSeek-V4-Pro default]
A2[ScenarioSpec JSON\npolicy, app family, bug target]
A3[Template + A01 Mutator\nFastAPI code variants]
A4[Deterministic Compiler\nexecutable bundle]
A5[Static + Dynamic Verifier\nsolvable, safe, hidden/visible tests]
A6[Difficulty Calibrator\nbaseline pass-rate buckets]
A7[Versioned Scenario Cache\nsplit, difficulty, family, hash]
A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
end
subgraph B[CyberSecurity_OWASP OpenEnv Runtime]
B1[reset\(seed, difficulty, family_budget\)\ncache lookup only]
B2[Curriculum Sampler\nvalidated cache slice]
B3[Episode State Store\nphase, history, cache metadata, patch diff]
B4[Typed Action Tools\ninspect, request, patch, visible tests]
B5[Ephemeral App Sandbox\ncloned cached workspace + fixtures]
B6[Multi-layer Verifier\nvisible, hidden, oracle, regression]
B7[Deterministic Reward Engine\nstable components + penalties]
B8[Episode Artifact Logger\nJSONL transcript + verifier + diff]
B1 --> B2 --> B3 --> B4
B4 <--> B5
B5 --> B6 --> B7 --> B3
B3 --> B8
end
subgraph C[Single LLM Agent]
C1[Observation Parser]
C2[AuthZ + Code Reasoning]
C3[Discover → Diagnose → Patch → Test\none JSON action]
C1 --> C2 --> C3
end
subgraph D[Training + Evaluation + Demo]
D1[Parallel Rollouts\nfast cached reset]
D2[TRL GRPO + LoRA]
D3[Trackio Curves\nreward, pass rates, cache metrics]
D4[Held-out Family Eval\nbase vs trained model]
D5[Demo Artifacts\nbefore/after traces + JSONL]
D1 --> D2 --> D3 --> D4 --> D5
end
subgraph E[Feedback / Adaptation Loop]
E1[Episode logs + failures]
E2[Mastery Model\nweakness and plateau tracking]
E3[Cache Sampling Weights\nnew generation queue]
E1 --> E2 --> E3
end
A7 --> B1
C3 -->|typed action| B4
B4 -->|observation + reward + done| C1
B7 --> D1
D2 --> C1
B8 --> E1
E3 --> A1
```
## 3. Component responsibilities
### 3.1 Async Scenario Authoring Plane
Scenario generation is offline, asynchronous, validated, and cached. Runtime `reset()` must not call an LLM and must not compile a fresh app during Modal smoke, training, or evaluation runs.
The scenario authoring plane outputs complete executable bundles:
- `scenario.json`;
- `app_source/`;
- `policy_graph.json`;
- `visible_tests.py`;
- `hidden_tests.py`;
- `oracle_tests.py`;
- `expected_exploit_trace.json`;
- `reward_config.json`;
- `metadata.json`.
The default scenario/curriculum author is configured in `configs/scenario_authoring.small.json`:
```yaml
provider: huggingface
model_id: deepseek-ai/DeepSeek-V4-Pro
thinking_mode: thinking
reasoning_effort: high
temperature: 1.0
top_p: 1.0
```
DeepSeek-V4-Pro is only used for offline scenario/curriculum authoring. It is not the RL policy model unless explicitly selected for training.
The compiler remains the main anti-overfitting mechanism. It should vary:
- route names;
- schema names;
- ORM query structure;
- framework template;
- role names;
- tenant IDs;
- object ownership patterns;
- file layout;
- visible test coverage;
- hidden invariant seeds.
The runtime treats curriculum and cache sampling as first-class scenario inputs:
- `CurriculumController` tracks target weakness mastery, recent reward trend, failure counts, and difficulty tier.
- Offline cache prep uses the configured LLM author, deterministic compiler, verifier, and baseline-agent difficulty calibrator.
- `ScenarioCache` stores validated bundles by split, difficulty, family, generator version, verifier version, and scenario hash.
- Hidden-eval episodes hold out scenario families, not only seeds, by marking evaluation-only scenario-family metadata in state rather than observations.
Cache keys include:
```text
difficulty_level
authz_bug_type
app_family
framework
policy_shape
tenant_model
exploit_depth
patch_scope
regression_risk
generator_version
verifier_version
scenario_hash
```
### 3.2 Policy Graph Generator
The policy graph is the ground truth for intended behavior.
Example internal representation:
```yaml
resources:
invoice:
owner_field: owner_user_id
tenant_field: tenant_id
roles:
user:
can:
- read:invoice where owner_user_id == actor.user_id
- update:invoice where owner_user_id == actor.user_id and status != locked
support:
can:
- read:invoice where tenant_id == actor.tenant_id
admin:
can:
- read:any_invoice where tenant_id == actor.tenant_id
- update:any_invoice where tenant_id == actor.tenant_id
public_routes:
- GET /health
- GET /pricing
forbidden:
- cross_tenant_read
- cross_tenant_update
- user_reads_other_user_invoice
```
The policy graph prevents false rewards for over-securing intentionally public or intentionally allowed routes.
### 3.3 Bug Injector
The bug injector creates controlled, defensive lab scenarios. It should only generate bugs inside local synthetic apps.
MVP bug classes:
| Bug class | Example failure mode | Expected fix type |
|---|---|---|
| Missing route guard | Protected endpoint lacks authorization middleware | Add policy check/middleware |
| IDOR / ownership bug | User can access another user’s object by changing ID | Add owner check in query/policy |
| Tenant leak | Tenant A can list Tenant B records | Add tenant filter |
| Role confusion | Support/editor/admin boundary is wrong | Correct role-to-permission mapping |
| Client-side-only auth | Server trusts UI to hide forbidden action | Enforce server-side authorization |
| Query omission | List/export/search endpoint lacks auth filter | Filter query by actor permissions |
| Over-broad mutation | User can update/delete forbidden object | Add mutation permission check |
| Public route decoy | Agent may wrongly lock down intended public endpoint | Preserve intended public behavior |
### 3.4 OpenEnv Server
The OpenEnv server should implement the standard lifecycle:
- `reset()` — initialize a fresh episode from a cached scenario bundle.
- `step(action)` — execute one typed action and return observation, reward, and done.
- `state()` — expose episode metadata for debugging and evaluation.
Recommended package/class names:
```text
Repo name: CyberSecurity_OWASP
Python package: cybersecurity_owasp
Client class: CyberSecurityOWASPEnv
Action class: CyberSecurityOWASPAction
Observation: CyberSecurityOWASPObservation
State: CyberSecurityOWASPState
```
### 3.5 Tool API
The agent should interact through typed actions. Keep the interface small enough for RL but expressive enough for realistic repair.
```python
@dataclass
class CyberSecurityOWASPAction(Action):
tool_name: Literal[
"inspect_policy_graph",
"list_routes",
"read_openapi",
"read_file",
"search_code",
"send_local_request",
"compare_identities",
"submit_diagnosis",
"patch_file",
"run_visible_tests",
"submit_fix",
"noop",
]
arguments: dict
```
Recommended actions:
| Action | Purpose | Safety boundary |
|---|---|---|
| `inspect_policy_graph` | Read intended authorization rules. | Only synthetic policy. |
| `list_routes` | See local app route map. | No internet target. |
| `read_file` | Inspect selected source file. | Sandbox allowlist only. |
| `send_local_request` | Validate behavior against local app. | Local generated app only. |
| `submit_diagnosis` | Record bug class, route, policy rule, evidence trace IDs, and fix plan. | Does not reveal hidden tests. |
| `run_visible_tests` | Run visible tests. | No hidden test disclosure. |
| `patch_file` | Modify source through unified diff or full content. | Patch size and file allowlist limits. |
| `submit_fix` | End episode and trigger hidden eval. | Final hidden score only, no leaked test details. |
### 3.6 Observation schema
Observations should be compact and structured.
```python
@dataclass
class CyberSecurityOWASPObservation(Observation):
phase: Literal["discover", "patch", "done"]
message: str
task_brief: str
visible_policy_hint: dict
workspace_summary: dict
available_actions: list[str]
last_tool_result: str
visible_test_result: str | None = None
reward_breakdown: dict[str, float] = field(default_factory=dict)
done_reason: str | None = None
```
The policy hint is deliberately partial. It may include product rules, fixture aliases, route summaries, and public-route intent, but it must not expose the hidden oracle matrix, hidden test bodies, injected bug labels, or held-out family labels.
### 3.7 State schema
State should support debugging and training analytics.
```python
@dataclass
class CyberSecurityOWASPState(State):
episode_id: str
task_id: str
split: Literal["train", "validation", "hidden_eval"]
step_count: int = 0
max_steps: int = 40
difficulty_tier: str = "warmup"
scenario_family: str = ""
template_id: str = "fastapi_basic"
target_weakness: str = ""
curriculum_snapshot: dict = field(default_factory=dict)
verification_summary: dict = field(default_factory=dict)
patch_diff: str = ""
episode_artifact_path: str | None = None
accumulated_reward: float = 0.0
```
## 4. Episode lifecycle
```text
1. reset()
- curriculum selects difficulty tier and target weakness
- runtime samples or directly loads a validated cached bundle
- clone cached `app_source/` into an isolated ephemeral workspace
- initialize fixture state, cache metadata, and sandbox handles
- return initial observation
2. agent loop
- inspect policy/routes/files
- send local requests only inside sandbox
- run public tests
- apply one or more patches
- rerun public tests
3. submit_fix
- freeze patch
- run public tests
- run hidden authorization invariants
- run policy-oracle matrix
- run regression and public-route preservation tests
- compute deterministic reward
- return final observation, reward, done=True
4. logging
- append JSONL artifact with scenario metadata, action trace, observations, patch diff, verifier result, and reward components
- feed terminal success/failure back into curriculum mastery tracking
- send metrics to Trackio during training/eval
```
`CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` is mandatory for Modal smoke, training, and evaluation. In that mode a missing cache bundle is a hard failure. Local development may use `fallback`, which compiles deterministically on a miss, but that path is not allowed for meaningful training.
## 5. Reward design
The reward should be deterministic, decomposed, and resistant to reward hacking. The maximum terminal reward remains **15.0** and high reward requires deterministic verifier success, not explanation quality.
Stable reward keys:
```python
{
"discovery": 0.0,
"security": 0.0,
"regression": 0.0,
"public_routes": 0.0,
"patch_quality": 0.0,
"visible_tests": 0.0,
"safety": 0.0,
"anti_cheat": 0.0,
"terminal_total": 0.0,
"progressive": 0.0,
"step_penalty": 0.0,
"speed_bonus": 0.0,
"token_penalty": 0.0,
"behavior_penalty": 0.0,
"train_total": 0.0,
"total": 0.0,
}
```
Sparse evaluation uses `terminal_total` as `total`. Dense training uses
`terminal_total + shaping_weight * progressive + efficiency - penalties` as `total`,
with all reward values and short descriptions configured in
`training/configs/grpo_small.yaml`.
### Reward components
| Component | Purpose |
|---|---|
| `discovery` | Valid local evidence and correct violated policy rule. |
| `security` | Hidden exploit blocking plus policy-oracle matrix pass. |
| `regression` | Legitimate owner/admin/support flows still work. |
| `public_routes` | Intentionally public routes remain public. |
| `patch_quality` | Localized policy-aligned patch and efficient phase order. |
| `visible_tests` | Visible tests pass and app still boots. |
| `safety` | Penalizes invalid action patterns, unsafe targets, timeouts, and deny-all behavior. |
| `anti_cheat` | Penalizes hidden-file probing, hardcoded fixture IDs, and test/oracle tampering. |
### Penalties
| Penalty | Trigger |
|---|---|
| public route penalty | Breaks a route intentionally marked public. |
| anti-cheat penalty | Deletes or probes tests, hidden files, reward code, oracle data, or host paths. |
| hardcoding penalty | Hardcodes seed-specific IDs, users, tenants, or hidden assumptions. |
| safety penalty | Over-broad denial, malformed/invalid actions, repeated failed actions, or external target attempts. |
The LLM judge, if used at all, should only annotate trace quality for analysis. It must not decide security-critical reward.
## 6. Hidden tests and anti-overfitting
Hidden tests are necessary because visible tests can be gamed or memorized. They should test policy invariants rather than exact implementation details.
Use **4 anti-overfitting layers**:
1. **Seed diversity** — route names, user IDs, tenant IDs, object names, and schemas change every episode.
2. **Template diversity** — same policy bug appears in different frameworks and file layouts.
3. **Hidden invariant tests** — final reward uses unseen authorization cases.
4. **Held-out eval split** — at least 20% of scenario families/seeds are never used in training.
Recommended split:
```text
Train: 70%
Validation: 10%
Held-out: 20%
```
## 7. Evaluation plan
Run before/after evaluation on the same held-out suite.
### Metrics
| Metric | Meaning |
|---|---|
| `episode_success_rate` | Public + hidden + regression tests pass. |
| `hidden_authz_pass_rate` | Security-critical hidden checks pass. |
| `regression_pass_rate` | Normal valid behavior remains intact. |
| `oversecure_rate` | Agent blocks intended legitimate/public behavior. |
| `patch_compile_rate` | Patch applies and app still runs. |
| `median_steps_to_submit` | Efficiency of the repair workflow. |
| `median_files_changed` | Patch focus/minimality. |
| `reward_hacking_rate` | Attempts to delete tests, hardcode fixtures, or bypass eval. |
### Eval table template
| Model | Split | Success | Hidden authz | Regression | Oversecure | Median steps | Median files changed |
|---|---|---:|---:|---:|---:|---:|---:|
| Base model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |
| RL-trained model | heldout | TBD | TBD | TBD | TBD | TBD | TBD |
## 8. Training flow
Rendered asset:
![CyberSecurity_OWASP RL training flow](assets/env_rl_training_flow_diagram.svg)
Editable source: `assets/env_rl_training_flow_diagram.mmd`
```text
1. Build CyberSecurity_OWASP OpenEnv server.
2. Prepare validated scenario cache once per generator/verifier version.
3. Run baseline eval with cached validation/held-out bundles.
4. Train with GRPO/TRL or Unsloth using cached rollout episodes.
5. Log reward components, pass rates, reset latency, and cache hit metrics to Trackio.
6. Run held-out eval every N training steps.
7. Inspect failure clusters and cache sampling weights.
8. Refresh only 5-10% of scenarios per epoch when new weak spots are found.
9. Produce final demo: before/after trace + reward curve + held-out eval table.
```
Recommended initial training setup (Modal-first):
```text
Model: unsloth/gemma-4-E2B-it
Algorithm: GRPO via TRL or Unsloth-compatible loop
Dataset prompt: repeated task instruction with randomized scenario IDs
Max steps per episode: 30
Rollouts per prompt: 2-4
Logging: Trackio
Primary eval: held-out deterministic test pass rate
Scenario cache mode: require
Scenario cache volume: CyberSecurity_OWASP-scenario-cache
Training execution is expected to run on Modal (persistent or ephemeral) rather than locally.
```
## 9. Deployment architecture
The environment should be runnable in 3 modes:
| Mode | Purpose |
|---|---|
| Local Uvicorn | Fast engineer iteration. |
| Docker | Reproducible local training/eval. |
| Hugging Face Spaces | Public hackathon demo and OpenEnv-compliant hosting. |
Expected endpoints:
```text
/ws OpenEnv client session
/health health check
/reset debug reset
/step debug step
/state debug state
/docs FastAPI docs
/web optional web UI
```
## 10. Implementation milestones
### Milestone 1 — Skeleton environment
- `models.py`
- `client.py`
- `server/environment.py`
- `server/app.py`
- `server/Dockerfile`
- `openenv.yaml`
- health check
- one hand-written scenario
### Milestone 2 — Scenario compiler
- policy graph format
- app template renderer
- bug injector
- DB fixture generator
- public and hidden test generator
### Milestone 3 — Reward engine
- public test score
- hidden invariant score
- regression score
- patch minimality score
- safety/reward-hacking penalties
- reward component logging
### Milestone 4 — Training script
- rollout loop
- GRPO/TRL or Unsloth training script
- Trackio logging
- checkpoint save/push
- baseline and post-training eval
### Milestone 5 — Hackathon demo
- HF Spaces deployment
- mini-blog
- 2-minute video
- before/after traces
- reward curve
- held-out eval table
## 11. Engineering notes
- Keep scenario apps small: ideally 5-15 files each.
- Prefer deterministic tests over LLM judging.
- Hide final hidden test details from observations.
- Log enough trace data to debug failures but never leak hidden tests to the agent.
- Include intentionally public routes and allowed cross-role cases so the model does not learn “add auth everywhere.”
- The best demo is not just “agent finds bug,” but “agent learns not to break valid business behavior.”
## 12. Source notes and credibility
| Source | How it informs this architecture | Credibility |
|---|---|---:|
| OWASP Top 10 2025 / A01 Broken Access Control | Confirms why access control is the right security focus. | 10/10 |
| OWASP ASVS access-control guidance | Informs policy invariants and server-side authorization checks. | 9.5/10 |
| OpenEnv environment-building docs | Defines required models, reset/step/state, FastAPI server, Docker, and client. | 8.5/10 |
| OpenEnv quickstart/architecture docs | Informs WebSocket client/server design, typed EnvClient, and container isolation. | 8.5/10 |
| OpenEnv deployment docs | Informs HF Spaces deployment, endpoints, Docker workflow, and installable client package. | 8.5/10 |
| Hackathon judging criteria | Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
| TRL/OpenEnv training example | Informs rollout function, decomposed reward functions, and Trackio logging pattern. | 8/10 |
| Kube SRE Gym README | Informs the closed-loop pattern: adversarial scenario design, curriculum mastery tracking, real tool interaction, verification, and artifact-driven storytelling. | 8/10 |
| DeepSeek-V4-Pro Hugging Face model card and encoding notes | Informs the default offline scenario-author config and the note that prompt handling should not assume a Jinja chat template. | 8/10 |