Spaces:
Sleeping
Sleeping
feat: enhance scenario authoring and caching mechanisms, update action submission terminology, and improve reward configuration for CyberSecurity_OWASP environment
be8eade | # 00_PROJECT_BRIEF.md | |
| # CyberSecurity_OWASP β Project Brief | |
| ## 1. One-line summary | |
| `CyberSecurity_OWASP` is an OpenEnv reinforcement-learning environment where a **single LLM agent learns the full defensive workflow for OWASP access-control bugs**: understand the intended authorization policy, discover a broken access-control path in a local synthetic app, patch the code, and prove that the fix blocks unauthorized access without breaking valid user flows. | |
| ## 2. Problem | |
| Broken access control remains one of the most important web-application security risks because the correct behavior is usually **application-specific**. Generic scanners can find some missing checks, but they often lack enough context to answer the real engineering question: | |
| > βGiven this appβs policy, users, roles, tenants, routes, and data model, is this behavior intended or a security bug?β | |
| Modern LLMs can read code, reason about tests, and propose patches, but they still struggle with: | |
| - distinguishing intended public/feature behavior from accidental over-permission; | |
| - following authorization logic across routes, middleware, ORM queries, tenants, roles, and ownership checks; | |
| - validating that a patch fixes the bug without introducing regressions; | |
| - avoiding reward hacking when tests are visible or too narrow; | |
| - generalizing across app templates instead of memorizing one codebase. | |
| `CyberSecurity_OWASP` turns this into a trainable environment. | |
| ## 3. What the environment trains | |
| The environment trains **one agent**, not a separate red-team and blue-team pair. The same model must perform the entire secure-repair loop: | |
| 1. **Understand policy** β read the policy graph, user roles, route intent, tenant rules, and allowed operations. | |
| 2. **Discover evidence** β use safe local requests, logs, route metadata, and visible tests to identify the likely access-control failure. | |
| 3. **Patch** β edit application code, middleware, route guards, query filters, or policy mappings. | |
| 4. **Validate** β run public tests, policy checks, and regression tests. | |
| 5. **Submit** β final answer is judged by deterministic hidden tests and reward logic. | |
| ## 4. Scope for MVP | |
| The MVP should focus on **OWASP A01: Broken Access Control** with ASVS-inspired access-control requirements. | |
| Initial scenario families: | |
| 1. Missing route-level authorization check. | |
| 2. Insecure direct object reference / object ownership bug. | |
| 3. Cross-tenant data leakage. | |
| 4. Role confusion: user/admin/support/editor boundary error. | |
| 5. Client-side-only authorization assumption. | |
| 6. Query filter omission in list/search/export endpoint. | |
| 7. Over-broad update/delete permission. | |
| 8. Feature route intentionally public, so the agent must not over-secure it. | |
| Recommended MVP size: **8 scenario families Γ 3 app templates Γ 25 seeds = 600 trainable scenarios**, with separate held-out families and hidden seeds for evaluation. | |
| ## 5. Why this is useful | |
| This environment is useful because it targets a real gap between todayβs scanners and useful defensive agents: | |
| - **Scanners detect patterns.** This environment trains policy-aware reasoning. | |
| - **Unit tests check known cases.** This environment includes hidden authorization invariants. | |
| - **Static repair can overfit.** This environment forces the model to preserve valid business behavior. | |
| - **One-app benchmarks are easy to memorize.** This environment prepares and caches many equivalent-but-different apps from policy graphs, templates, route shapes, schema names, and hidden test seeds, then keeps runtime `reset()` deterministic and fast. | |
| The outcome is a model that becomes better at a practical DevSecOps workflow: safely reviewing and repairing authorization logic in small-to-medium web apps. | |
| ## 6. What success looks like | |
| A successful submission should show **measurable reward improvement** and better held-out security behavior after RL training. | |
| ### Minimum success criteria | |
| - Environment runs through OpenEnv `reset`, `step`, and `state` APIs. | |
| - Hosted on Hugging Face Spaces. | |
| - Provides a minimal GRPO/TRL or Unsloth training script. | |
| - Tracks training/eval metrics with Trackio or equivalent. | |
| - Shows reward curves and before/after agent behavior. | |
| - Uses deterministic reward as the primary reward source. | |
| - Keeps hidden tests hidden from the agent. | |
| ### Target metrics | |
| | Metric | MVP target | | |
| |---|---:| | |
| | Valid episode completion rate | β₯ 85% | | |
| | Hidden authorization test pass rate | β₯ 65% after initial RL run | | |
| | Regression preservation rate | β₯ 80% | | |
| | Held-out scenario success lift vs base model | β₯ +15 percentage points | | |
| | Reward-hacking incidents found in eval | 0 critical | | |
| | Median patch size | β€ 3 files changed | | |
| ## 7. Core design principle | |
| The environment should reward **correct defensive repair**, not exploit creativity. The discovery stage exists only to help the agent gather enough local evidence to make a safe patch. The reward engine must never reward real-world misuse, data exfiltration, persistence, credential theft, or evasion behavior. | |
| ## 8. Deliverables for engineers | |
| Initial implementation should produce: | |
| ```text | |
| CyberSecurity_OWASP/ | |
| βββ 00_PROJECT_BRIEF.md | |
| βββ 01_ARCHITECTURE.md | |
| βββ README.md | |
| βββ pyproject.toml | |
| βββ openenv.yaml | |
| βββ cybersecurity_owasp/ | |
| β βββ __init__.py | |
| β βββ models.py | |
| β βββ client.py | |
| β βββ rewards.py | |
| β βββ scenarios/ | |
| β β βββ compiler.py | |
| β β βββ policy_graph.py | |
| β β βββ templates/ | |
| β β βββ seeds/ | |
| β βββ apps/ | |
| β β βββ fastapi_basic/ | |
| β β βββ express_basic/ | |
| β β βββ django_basic/ | |
| β βββ evals/ | |
| β β βββ public_tests.py | |
| β β βββ hidden_invariants.py | |
| β β βββ heldout_eval.py | |
| β βββ server/ | |
| β βββ environment.py | |
| β βββ app.py | |
| β βββ requirements.txt | |
| β βββ Dockerfile | |
| βββ training/ | |
| β βββ train_grpo.py | |
| β βββ rollout.py | |
| β βββ eval_before_after.py | |
| βββ outputs/ | |
| βββ logs/ | |
| βββ evals/ | |
| βββ reward_curves/ | |
| ``` | |
| ## 9. Source notes and credibility | |
| | Source | How it informs this project | Credibility | | |
| |---|---|---:| | |
| | OWASP Top 10 2025 / A01 Broken Access Control | Confirms current relevance of Broken Access Control as a top web-app risk. | 10/10 | | |
| | OWASP ASVS | Provides security-control requirements that can be translated into policy invariants and hidden tests. | 9.5/10 | | |
| | OpenEnv build/deploy docs | Defines the required OpenEnv structure: models, server, client, Docker, HF Spaces deployment. | 8.5/10 | | |
| | Hackathon judging criteria | Aligns deliverables with scoring: innovation, storytelling, reward improvement, and training pipeline. | 9/10 | | |
| | TRL/OpenEnv GRPO example | Shows a practical pattern for environment rollouts, reward functions, and Trackio logging. | 8/10 | | |