# 00_PROJECT_BRIEF.md # CyberSecurity_OWASP — Project Brief ## 1. One-line summary `CyberSecurity_OWASP` is an OpenEnv reinforcement-learning environment where a **single LLM agent learns the full defensive workflow for OWASP access-control bugs**: understand the intended authorization policy, discover a broken access-control path in a local synthetic app, patch the code, and prove that the fix blocks unauthorized access without breaking valid user flows. ## 2. Problem Broken access control remains one of the most important web-application security risks because the correct behavior is usually **application-specific**. Generic scanners can find some missing checks, but they often lack enough context to answer the real engineering question: > “Given this app’s policy, users, roles, tenants, routes, and data model, is this behavior intended or a security bug?” Modern LLMs can read code, reason about tests, and propose patches, but they still struggle with: - distinguishing intended public/feature behavior from accidental over-permission; - following authorization logic across routes, middleware, ORM queries, tenants, roles, and ownership checks; - validating that a patch fixes the bug without introducing regressions; - avoiding reward hacking when tests are visible or too narrow; - generalizing across app templates instead of memorizing one codebase. `CyberSecurity_OWASP` turns this into a trainable environment. ## 3. What the environment trains The environment trains **one agent**, not a separate red-team and blue-team pair. The same model must perform the entire secure-repair loop: 1. **Understand policy** — read the policy graph, user roles, route intent, tenant rules, and allowed operations. 2. **Discover evidence** — use safe local requests, logs, route metadata, and visible tests to identify the likely access-control failure. 3. **Patch** — edit application code, middleware, route guards, query filters, or policy mappings. 4. **Validate** — run public tests, policy checks, and regression tests. 5. **Submit** — final answer is judged by deterministic hidden tests and reward logic. ## 4. Scope for MVP The MVP should focus on **OWASP A01: Broken Access Control** with ASVS-inspired access-control requirements. Initial scenario families: 1. Missing route-level authorization check. 2. Insecure direct object reference / object ownership bug. 3. Cross-tenant data leakage. 4. Role confusion: user/admin/support/editor boundary error. 5. Client-side-only authorization assumption. 6. Query filter omission in list/search/export endpoint. 7. Over-broad update/delete permission. 8. Feature route intentionally public, so the agent must not over-secure it. Recommended MVP size: **8 scenario families × 3 app templates × 25 seeds = 600 trainable scenarios**, with separate held-out families and hidden seeds for evaluation. ## 5. Why this is useful This environment is useful because it targets a real gap between today’s scanners and useful defensive agents: - **Scanners detect patterns.** This environment trains policy-aware reasoning. - **Unit tests check known cases.** This environment includes hidden authorization invariants. - **Static repair can overfit.** This environment forces the model to preserve valid business behavior. - **One-app benchmarks are easy to memorize.** This environment prepares and caches many equivalent-but-different apps from policy graphs, templates, route shapes, schema names, and hidden test seeds, then keeps runtime `reset()` deterministic and fast. The outcome is a model that becomes better at a practical DevSecOps workflow: safely reviewing and repairing authorization logic in small-to-medium web apps. ## 6. What success looks like A successful submission should show **measurable reward improvement** and better held-out security behavior after RL training. ### Minimum success criteria - Environment runs through OpenEnv `reset`, `step`, and `state` APIs. - Hosted on Hugging Face Spaces. - Provides a minimal GRPO/TRL or Unsloth training script. - Tracks training/eval metrics with Trackio or equivalent. - Shows reward curves and before/after agent behavior. - Uses deterministic reward as the primary reward source. - Keeps hidden tests hidden from the agent. ### Target metrics | Metric | MVP target | |---|---:| | Valid episode completion rate | ≥ 85% | | Hidden authorization test pass rate | ≥ 65% after initial RL run | | Regression preservation rate | ≥ 80% | | Held-out scenario success lift vs base model | ≥ +15 percentage points | | Reward-hacking incidents found in eval | 0 critical | | Median patch size | ≤ 3 files changed | ## 7. Core design principle The environment should reward **correct defensive repair**, not exploit creativity. The discovery stage exists only to help the agent gather enough local evidence to make a safe patch. The reward engine must never reward real-world misuse, data exfiltration, persistence, credential theft, or evasion behavior. ## 8. Deliverables for engineers Initial implementation should produce: ```text CyberSecurity_OWASP/ ├── 00_PROJECT_BRIEF.md ├── 01_ARCHITECTURE.md ├── README.md ├── pyproject.toml ├── openenv.yaml ├── cybersecurity_owasp/ │ ├── __init__.py │ ├── models.py │ ├── client.py │ ├── rewards.py │ ├── scenarios/ │ │ ├── compiler.py │ │ ├── policy_graph.py │ │ ├── templates/ │ │ └── seeds/ │ ├── apps/ │ │ ├── fastapi_basic/ │ │ ├── express_basic/ │ │ └── django_basic/ │ ├── evals/ │ │ ├── public_tests.py │ │ ├── hidden_invariants.py │ │ └── heldout_eval.py │ └── server/ │ ├── environment.py │ ├── app.py │ ├── requirements.txt │ └── Dockerfile ├── training/ │ ├── train_grpo.py │ ├── rollout.py │ └── eval_before_after.py └── outputs/ ├── logs/ ├── evals/ └── reward_curves/ ``` ## 9. Source notes and credibility | Source | How it informs this project | Credibility | |---|---|---:| | OWASP Top 10 2025 / A01 Broken Access Control | Confirms current relevance of Broken Access Control as a top web-app risk. | 10/10 | | OWASP ASVS | Provides security-control requirements that can be translated into policy invariants and hidden tests. | 9.5/10 | | OpenEnv build/deploy docs | Defines the required OpenEnv structure: models, server, client, Docker, HF Spaces deployment. | 8.5/10 | | Hackathon judging criteria | Aligns deliverables with scoring: innovation, storytelling, reward improvement, and training pipeline. | 9/10 | | TRL/OpenEnv GRPO example | Shows a practical pattern for environment rollouts, reward functions, and Trackio logging. | 8/10 |