Cyber_analyst-round1 / 00_PROJECT_BRIEF.md
Humanlearning's picture
feat: enhance scenario authoring and caching mechanisms, update action submission terminology, and improve reward configuration for CyberSecurity_OWASP environment
be8eade
# 00_PROJECT_BRIEF.md
# CyberSecurity_OWASP β€” Project Brief
## 1. One-line summary
`CyberSecurity_OWASP` is an OpenEnv reinforcement-learning environment where a **single LLM agent learns the full defensive workflow for OWASP access-control bugs**: understand the intended authorization policy, discover a broken access-control path in a local synthetic app, patch the code, and prove that the fix blocks unauthorized access without breaking valid user flows.
## 2. Problem
Broken access control remains one of the most important web-application security risks because the correct behavior is usually **application-specific**. Generic scanners can find some missing checks, but they often lack enough context to answer the real engineering question:
> β€œGiven this app’s policy, users, roles, tenants, routes, and data model, is this behavior intended or a security bug?”
Modern LLMs can read code, reason about tests, and propose patches, but they still struggle with:
- distinguishing intended public/feature behavior from accidental over-permission;
- following authorization logic across routes, middleware, ORM queries, tenants, roles, and ownership checks;
- validating that a patch fixes the bug without introducing regressions;
- avoiding reward hacking when tests are visible or too narrow;
- generalizing across app templates instead of memorizing one codebase.
`CyberSecurity_OWASP` turns this into a trainable environment.
## 3. What the environment trains
The environment trains **one agent**, not a separate red-team and blue-team pair. The same model must perform the entire secure-repair loop:
1. **Understand policy** β€” read the policy graph, user roles, route intent, tenant rules, and allowed operations.
2. **Discover evidence** β€” use safe local requests, logs, route metadata, and visible tests to identify the likely access-control failure.
3. **Patch** β€” edit application code, middleware, route guards, query filters, or policy mappings.
4. **Validate** β€” run public tests, policy checks, and regression tests.
5. **Submit** β€” final answer is judged by deterministic hidden tests and reward logic.
## 4. Scope for MVP
The MVP should focus on **OWASP A01: Broken Access Control** with ASVS-inspired access-control requirements.
Initial scenario families:
1. Missing route-level authorization check.
2. Insecure direct object reference / object ownership bug.
3. Cross-tenant data leakage.
4. Role confusion: user/admin/support/editor boundary error.
5. Client-side-only authorization assumption.
6. Query filter omission in list/search/export endpoint.
7. Over-broad update/delete permission.
8. Feature route intentionally public, so the agent must not over-secure it.
Recommended MVP size: **8 scenario families Γ— 3 app templates Γ— 25 seeds = 600 trainable scenarios**, with separate held-out families and hidden seeds for evaluation.
## 5. Why this is useful
This environment is useful because it targets a real gap between today’s scanners and useful defensive agents:
- **Scanners detect patterns.** This environment trains policy-aware reasoning.
- **Unit tests check known cases.** This environment includes hidden authorization invariants.
- **Static repair can overfit.** This environment forces the model to preserve valid business behavior.
- **One-app benchmarks are easy to memorize.** This environment prepares and caches many equivalent-but-different apps from policy graphs, templates, route shapes, schema names, and hidden test seeds, then keeps runtime `reset()` deterministic and fast.
The outcome is a model that becomes better at a practical DevSecOps workflow: safely reviewing and repairing authorization logic in small-to-medium web apps.
## 6. What success looks like
A successful submission should show **measurable reward improvement** and better held-out security behavior after RL training.
### Minimum success criteria
- Environment runs through OpenEnv `reset`, `step`, and `state` APIs.
- Hosted on Hugging Face Spaces.
- Provides a minimal GRPO/TRL or Unsloth training script.
- Tracks training/eval metrics with Trackio or equivalent.
- Shows reward curves and before/after agent behavior.
- Uses deterministic reward as the primary reward source.
- Keeps hidden tests hidden from the agent.
### Target metrics
| Metric | MVP target |
|---|---:|
| Valid episode completion rate | β‰₯ 85% |
| Hidden authorization test pass rate | β‰₯ 65% after initial RL run |
| Regression preservation rate | β‰₯ 80% |
| Held-out scenario success lift vs base model | β‰₯ +15 percentage points |
| Reward-hacking incidents found in eval | 0 critical |
| Median patch size | ≀ 3 files changed |
## 7. Core design principle
The environment should reward **correct defensive repair**, not exploit creativity. The discovery stage exists only to help the agent gather enough local evidence to make a safe patch. The reward engine must never reward real-world misuse, data exfiltration, persistence, credential theft, or evasion behavior.
## 8. Deliverables for engineers
Initial implementation should produce:
```text
CyberSecurity_OWASP/
β”œβ”€β”€ 00_PROJECT_BRIEF.md
β”œβ”€β”€ 01_ARCHITECTURE.md
β”œβ”€β”€ README.md
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ cybersecurity_owasp/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ models.py
β”‚ β”œβ”€β”€ client.py
β”‚ β”œβ”€β”€ rewards.py
β”‚ β”œβ”€β”€ scenarios/
β”‚ β”‚ β”œβ”€β”€ compiler.py
β”‚ β”‚ β”œβ”€β”€ policy_graph.py
β”‚ β”‚ β”œβ”€β”€ templates/
β”‚ β”‚ └── seeds/
β”‚ β”œβ”€β”€ apps/
β”‚ β”‚ β”œβ”€β”€ fastapi_basic/
β”‚ β”‚ β”œβ”€β”€ express_basic/
β”‚ β”‚ └── django_basic/
β”‚ β”œβ”€β”€ evals/
β”‚ β”‚ β”œβ”€β”€ public_tests.py
β”‚ β”‚ β”œβ”€β”€ hidden_invariants.py
β”‚ β”‚ └── heldout_eval.py
β”‚ └── server/
β”‚ β”œβ”€β”€ environment.py
β”‚ β”œβ”€β”€ app.py
β”‚ β”œβ”€β”€ requirements.txt
β”‚ └── Dockerfile
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ train_grpo.py
β”‚ β”œβ”€β”€ rollout.py
β”‚ └── eval_before_after.py
└── outputs/
β”œβ”€β”€ logs/
β”œβ”€β”€ evals/
└── reward_curves/
```
## 9. Source notes and credibility
| Source | How it informs this project | Credibility |
|---|---|---:|
| OWASP Top 10 2025 / A01 Broken Access Control | Confirms current relevance of Broken Access Control as a top web-app risk. | 10/10 |
| OWASP ASVS | Provides security-control requirements that can be translated into policy invariants and hidden tests. | 9.5/10 |
| OpenEnv build/deploy docs | Defines the required OpenEnv structure: models, server, client, Docker, HF Spaces deployment. | 8.5/10 |
| Hackathon judging criteria | Aligns deliverables with scoring: innovation, storytelling, reward improvement, and training pipeline. | 9/10 |
| TRL/OpenEnv GRPO example | Shows a practical pattern for environment rollouts, reward functions, and Trackio logging. | 8/10 |