Cyber_analyst-round1 / 01_ARCHITECTURE.md
Humanlearning's picture
feat: enhance scenario authoring and caching mechanisms, update action submission terminology, and improve reward configuration for CyberSecurity_OWASP environment
be8eade

01_ARCHITECTURE.md

CyberSecurity_OWASP β€” Architecture

1. System goal

CyberSecurity_OWASP is an OpenEnv environment for training a single LLM policy to perform a complete defensive authorization-repair workflow:

Understand policy β†’ discover local evidence β†’ patch code β†’ validate β†’ submit

The environment is intentionally not a two-agent red-team/blue-team setup. The agent is one model with one trajectory. It must learn both sides of the defensive workflow: finding the policy violation and fixing it safely.

2. Final architecture diagram

Rendered asset:

CyberSecurity_OWASP architecture

Editable source: assets/architecture_diagram.mmd

flowchart TB
    subgraph A[Async Scenario Authoring + Curriculum Factory]
        A1[Config-guided LLM Scenario Author\nDeepSeek-V4-Pro default]
        A2[ScenarioSpec JSON\npolicy, app family, bug target]
        A3[Template + A01 Mutator\nFastAPI code variants]
        A4[Deterministic Compiler\nexecutable bundle]
        A5[Static + Dynamic Verifier\nsolvable, safe, hidden/visible tests]
        A6[Difficulty Calibrator\nbaseline pass-rate buckets]
        A7[Versioned Scenario Cache\nsplit, difficulty, family, hash]
        A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
    end

    subgraph B[CyberSecurity_OWASP OpenEnv Runtime]
        B1[reset\(seed, difficulty, family_budget\)\ncache lookup only]
        B2[Curriculum Sampler\nvalidated cache slice]
        B3[Episode State Store\nphase, history, cache metadata, patch diff]
        B4[Typed Action Tools\ninspect, request, patch, visible tests]
        B5[Ephemeral App Sandbox\ncloned cached workspace + fixtures]
        B6[Multi-layer Verifier\nvisible, hidden, oracle, regression]
        B7[Deterministic Reward Engine\nstable components + penalties]
        B8[Episode Artifact Logger\nJSONL transcript + verifier + diff]
        B1 --> B2 --> B3 --> B4
        B4 <--> B5
        B5 --> B6 --> B7 --> B3
        B3 --> B8
    end

    subgraph C[Single LLM Agent]
        C1[Observation Parser]
        C2[AuthZ + Code Reasoning]
        C3[Discover β†’ Diagnose β†’ Patch β†’ Test\none JSON action]
        C1 --> C2 --> C3
    end

    subgraph D[Training + Evaluation + Demo]
        D1[Parallel Rollouts\nfast cached reset]
        D2[TRL GRPO + LoRA]
        D3[Trackio Curves\nreward, pass rates, cache metrics]
        D4[Held-out Family Eval\nbase vs trained model]
        D5[Demo Artifacts\nbefore/after traces + JSONL]
        D1 --> D2 --> D3 --> D4 --> D5
    end

    subgraph E[Feedback / Adaptation Loop]
        E1[Episode logs + failures]
        E2[Mastery Model\nweakness and plateau tracking]
        E3[Cache Sampling Weights\nnew generation queue]
        E1 --> E2 --> E3
    end

    A7 --> B1
    C3 -->|typed action| B4
    B4 -->|observation + reward + done| C1
    B7 --> D1
    D2 --> C1
    B8 --> E1
    E3 --> A1

3. Component responsibilities

3.1 Async Scenario Authoring Plane

Scenario generation is offline, asynchronous, validated, and cached. Runtime reset() must not call an LLM and must not compile a fresh app during Modal smoke, training, or evaluation runs.

The scenario authoring plane outputs complete executable bundles:

  • scenario.json;
  • app_source/;
  • policy_graph.json;
  • visible_tests.py;
  • hidden_tests.py;
  • oracle_tests.py;
  • expected_exploit_trace.json;
  • reward_config.json;
  • metadata.json.

The default scenario/curriculum author is configured in configs/scenario_authoring.small.json:

provider: huggingface
model_id: deepseek-ai/DeepSeek-V4-Pro
thinking_mode: thinking
reasoning_effort: high
temperature: 1.0
top_p: 1.0

DeepSeek-V4-Pro is only used for offline scenario/curriculum authoring. It is not the RL policy model unless explicitly selected for training.

The compiler remains the main anti-overfitting mechanism. It should vary:

  • route names;
  • schema names;
  • ORM query structure;
  • framework template;
  • role names;
  • tenant IDs;
  • object ownership patterns;
  • file layout;
  • visible test coverage;
  • hidden invariant seeds.

The runtime treats curriculum and cache sampling as first-class scenario inputs:

  • CurriculumController tracks target weakness mastery, recent reward trend, failure counts, and difficulty tier.
  • Offline cache prep uses the configured LLM author, deterministic compiler, verifier, and baseline-agent difficulty calibrator.
  • ScenarioCache stores validated bundles by split, difficulty, family, generator version, verifier version, and scenario hash.
  • Hidden-eval episodes hold out scenario families, not only seeds, by marking evaluation-only scenario-family metadata in state rather than observations.

Cache keys include:

difficulty_level
authz_bug_type
app_family
framework
policy_shape
tenant_model
exploit_depth
patch_scope
regression_risk
generator_version
verifier_version
scenario_hash

3.2 Policy Graph Generator

The policy graph is the ground truth for intended behavior.

Example internal representation:

resources:
  invoice:
    owner_field: owner_user_id
    tenant_field: tenant_id
roles:
  user:
    can:
      - read:invoice where owner_user_id == actor.user_id
      - update:invoice where owner_user_id == actor.user_id and status != locked
  support:
    can:
      - read:invoice where tenant_id == actor.tenant_id
  admin:
    can:
      - read:any_invoice where tenant_id == actor.tenant_id
      - update:any_invoice where tenant_id == actor.tenant_id
public_routes:
  - GET /health
  - GET /pricing
forbidden:
  - cross_tenant_read
  - cross_tenant_update
  - user_reads_other_user_invoice

The policy graph prevents false rewards for over-securing intentionally public or intentionally allowed routes.

3.3 Bug Injector

The bug injector creates controlled, defensive lab scenarios. It should only generate bugs inside local synthetic apps.

MVP bug classes:

Bug class Example failure mode Expected fix type
Missing route guard Protected endpoint lacks authorization middleware Add policy check/middleware
IDOR / ownership bug User can access another user’s object by changing ID Add owner check in query/policy
Tenant leak Tenant A can list Tenant B records Add tenant filter
Role confusion Support/editor/admin boundary is wrong Correct role-to-permission mapping
Client-side-only auth Server trusts UI to hide forbidden action Enforce server-side authorization
Query omission List/export/search endpoint lacks auth filter Filter query by actor permissions
Over-broad mutation User can update/delete forbidden object Add mutation permission check
Public route decoy Agent may wrongly lock down intended public endpoint Preserve intended public behavior

3.4 OpenEnv Server

The OpenEnv server should implement the standard lifecycle:

  • reset() β€” initialize a fresh episode from a cached scenario bundle.
  • step(action) β€” execute one typed action and return observation, reward, and done.
  • state() β€” expose episode metadata for debugging and evaluation.

Recommended package/class names:

Repo name:      CyberSecurity_OWASP
Python package: cybersecurity_owasp
Client class:   CyberSecurityOWASPEnv
Action class:   CyberSecurityOWASPAction
Observation:    CyberSecurityOWASPObservation
State:          CyberSecurityOWASPState

3.5 Tool API

The agent should interact through typed actions. Keep the interface small enough for RL but expressive enough for realistic repair.

@dataclass
class CyberSecurityOWASPAction(Action):
    tool_name: Literal[
        "inspect_policy_graph",
        "list_routes",
        "read_openapi",
        "read_file",
        "search_code",
        "send_local_request",
        "compare_identities",
        "submit_diagnosis",
        "patch_file",
        "run_visible_tests",
        "submit_fix",
        "noop",
    ]
    arguments: dict

Recommended actions:

Action Purpose Safety boundary
inspect_policy_graph Read intended authorization rules. Only synthetic policy.
list_routes See local app route map. No internet target.
read_file Inspect selected source file. Sandbox allowlist only.
send_local_request Validate behavior against local app. Local generated app only.
submit_diagnosis Record bug class, route, policy rule, evidence trace IDs, and fix plan. Does not reveal hidden tests.
run_visible_tests Run visible tests. No hidden test disclosure.
patch_file Modify source through unified diff or full content. Patch size and file allowlist limits.
submit_fix End episode and trigger hidden eval. Final hidden score only, no leaked test details.

3.6 Observation schema

Observations should be compact and structured.

@dataclass
class CyberSecurityOWASPObservation(Observation):
    phase: Literal["discover", "patch", "done"]
    message: str
    task_brief: str
    visible_policy_hint: dict
    workspace_summary: dict
    available_actions: list[str]
    last_tool_result: str
    visible_test_result: str | None = None
    reward_breakdown: dict[str, float] = field(default_factory=dict)
    done_reason: str | None = None

The policy hint is deliberately partial. It may include product rules, fixture aliases, route summaries, and public-route intent, but it must not expose the hidden oracle matrix, hidden test bodies, injected bug labels, or held-out family labels.

3.7 State schema

State should support debugging and training analytics.

@dataclass
class CyberSecurityOWASPState(State):
    episode_id: str
    task_id: str
    split: Literal["train", "validation", "hidden_eval"]
    step_count: int = 0
    max_steps: int = 40
    difficulty_tier: str = "warmup"
    scenario_family: str = ""
    template_id: str = "fastapi_basic"
    target_weakness: str = ""
    curriculum_snapshot: dict = field(default_factory=dict)
    verification_summary: dict = field(default_factory=dict)
    patch_diff: str = ""
    episode_artifact_path: str | None = None
    accumulated_reward: float = 0.0

4. Episode lifecycle

1. reset()
   - curriculum selects difficulty tier and target weakness
   - runtime samples or directly loads a validated cached bundle
   - clone cached `app_source/` into an isolated ephemeral workspace
   - initialize fixture state, cache metadata, and sandbox handles
   - return initial observation

2. agent loop
   - inspect policy/routes/files
   - send local requests only inside sandbox
   - run public tests
   - apply one or more patches
   - rerun public tests

3. submit_fix
   - freeze patch
   - run public tests
   - run hidden authorization invariants
   - run policy-oracle matrix
   - run regression and public-route preservation tests
   - compute deterministic reward
   - return final observation, reward, done=True

4. logging
   - append JSONL artifact with scenario metadata, action trace, observations, patch diff, verifier result, and reward components
   - feed terminal success/failure back into curriculum mastery tracking
   - send metrics to Trackio during training/eval

CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require is mandatory for Modal smoke, training, and evaluation. In that mode a missing cache bundle is a hard failure. Local development may use fallback, which compiles deterministically on a miss, but that path is not allowed for meaningful training.

5. Reward design

The reward should be deterministic, decomposed, and resistant to reward hacking. The maximum terminal reward remains 15.0 and high reward requires deterministic verifier success, not explanation quality.

Stable reward keys:

{
    "discovery": 0.0,
    "security": 0.0,
    "regression": 0.0,
    "public_routes": 0.0,
    "patch_quality": 0.0,
    "visible_tests": 0.0,
    "safety": 0.0,
    "anti_cheat": 0.0,
    "terminal_total": 0.0,
    "progressive": 0.0,
    "step_penalty": 0.0,
    "speed_bonus": 0.0,
    "token_penalty": 0.0,
    "behavior_penalty": 0.0,
    "train_total": 0.0,
    "total": 0.0,
}

Sparse evaluation uses terminal_total as total. Dense training uses terminal_total + shaping_weight * progressive + efficiency - penalties as total, with all reward values and short descriptions configured in training/configs/grpo_small.yaml.

Reward components

Component Purpose
discovery Valid local evidence and correct violated policy rule.
security Hidden exploit blocking plus policy-oracle matrix pass.
regression Legitimate owner/admin/support flows still work.
public_routes Intentionally public routes remain public.
patch_quality Localized policy-aligned patch and efficient phase order.
visible_tests Visible tests pass and app still boots.
safety Penalizes invalid action patterns, unsafe targets, timeouts, and deny-all behavior.
anti_cheat Penalizes hidden-file probing, hardcoded fixture IDs, and test/oracle tampering.

Penalties

Penalty Trigger
public route penalty Breaks a route intentionally marked public.
anti-cheat penalty Deletes or probes tests, hidden files, reward code, oracle data, or host paths.
hardcoding penalty Hardcodes seed-specific IDs, users, tenants, or hidden assumptions.
safety penalty Over-broad denial, malformed/invalid actions, repeated failed actions, or external target attempts.

The LLM judge, if used at all, should only annotate trace quality for analysis. It must not decide security-critical reward.

6. Hidden tests and anti-overfitting

Hidden tests are necessary because visible tests can be gamed or memorized. They should test policy invariants rather than exact implementation details.

Use 4 anti-overfitting layers:

  1. Seed diversity β€” route names, user IDs, tenant IDs, object names, and schemas change every episode.
  2. Template diversity β€” same policy bug appears in different frameworks and file layouts.
  3. Hidden invariant tests β€” final reward uses unseen authorization cases.
  4. Held-out eval split β€” at least 20% of scenario families/seeds are never used in training.

Recommended split:

Train:      70%
Validation: 10%
Held-out:   20%

7. Evaluation plan

Run before/after evaluation on the same held-out suite.

Metrics

Metric Meaning
episode_success_rate Public + hidden + regression tests pass.
hidden_authz_pass_rate Security-critical hidden checks pass.
regression_pass_rate Normal valid behavior remains intact.
oversecure_rate Agent blocks intended legitimate/public behavior.
patch_compile_rate Patch applies and app still runs.
median_steps_to_submit Efficiency of the repair workflow.
median_files_changed Patch focus/minimality.
reward_hacking_rate Attempts to delete tests, hardcode fixtures, or bypass eval.

Eval table template

Model Split Success Hidden authz Regression Oversecure Median steps Median files changed
Base model heldout TBD TBD TBD TBD TBD TBD
RL-trained model heldout TBD TBD TBD TBD TBD TBD

8. Training flow

Rendered asset:

CyberSecurity_OWASP RL training flow

Editable source: assets/env_rl_training_flow_diagram.mmd

1. Build CyberSecurity_OWASP OpenEnv server.
2. Prepare validated scenario cache once per generator/verifier version.
3. Run baseline eval with cached validation/held-out bundles.
4. Train with GRPO/TRL or Unsloth using cached rollout episodes.
5. Log reward components, pass rates, reset latency, and cache hit metrics to Trackio.
6. Run held-out eval every N training steps.
7. Inspect failure clusters and cache sampling weights.
8. Refresh only 5-10% of scenarios per epoch when new weak spots are found.
9. Produce final demo: before/after trace + reward curve + held-out eval table.

Recommended initial training setup (Modal-first):

Model: unsloth/gemma-4-E2B-it
Algorithm: GRPO via TRL or Unsloth-compatible loop
Dataset prompt: repeated task instruction with randomized scenario IDs
Max steps per episode: 30
Rollouts per prompt: 2-4
Logging: Trackio
Primary eval: held-out deterministic test pass rate
Scenario cache mode: require
Scenario cache volume: CyberSecurity_OWASP-scenario-cache

Training execution is expected to run on Modal (persistent or ephemeral) rather than locally.

9. Deployment architecture

The environment should be runnable in 3 modes:

Mode Purpose
Local Uvicorn Fast engineer iteration.
Docker Reproducible local training/eval.
Hugging Face Spaces Public hackathon demo and OpenEnv-compliant hosting.

Expected endpoints:

/ws       OpenEnv client session
/health   health check
/reset    debug reset
/step     debug step
/state    debug state
/docs     FastAPI docs
/web      optional web UI

10. Implementation milestones

Milestone 1 β€” Skeleton environment

  • models.py
  • client.py
  • server/environment.py
  • server/app.py
  • server/Dockerfile
  • openenv.yaml
  • health check
  • one hand-written scenario

Milestone 2 β€” Scenario compiler

  • policy graph format
  • app template renderer
  • bug injector
  • DB fixture generator
  • public and hidden test generator

Milestone 3 β€” Reward engine

  • public test score
  • hidden invariant score
  • regression score
  • patch minimality score
  • safety/reward-hacking penalties
  • reward component logging

Milestone 4 β€” Training script

  • rollout loop
  • GRPO/TRL or Unsloth training script
  • Trackio logging
  • checkpoint save/push
  • baseline and post-training eval

Milestone 5 β€” Hackathon demo

  • HF Spaces deployment
  • mini-blog
  • 2-minute video
  • before/after traces
  • reward curve
  • held-out eval table

11. Engineering notes

  • Keep scenario apps small: ideally 5-15 files each.
  • Prefer deterministic tests over LLM judging.
  • Hide final hidden test details from observations.
  • Log enough trace data to debug failures but never leak hidden tests to the agent.
  • Include intentionally public routes and allowed cross-role cases so the model does not learn β€œadd auth everywhere.”
  • The best demo is not just β€œagent finds bug,” but β€œagent learns not to break valid business behavior.”

12. Source notes and credibility

Source How it informs this architecture Credibility
OWASP Top 10 2025 / A01 Broken Access Control Confirms why access control is the right security focus. 10/10
OWASP ASVS access-control guidance Informs policy invariants and server-side authorization checks. 9.5/10
OpenEnv environment-building docs Defines required models, reset/step/state, FastAPI server, Docker, and client. 8.5/10
OpenEnv quickstart/architecture docs Informs WebSocket client/server design, typed EnvClient, and container isolation. 8.5/10
OpenEnv deployment docs Informs HF Spaces deployment, endpoints, Docker workflow, and installable client package. 8.5/10
Hackathon judging criteria Informs demo priorities: innovation, storytelling, reward improvement, and training pipeline. 9/10
TRL/OpenEnv training example Informs rollout function, decomposed reward functions, and Trackio logging pattern. 8/10
Kube SRE Gym README Informs the closed-loop pattern: adversarial scenario design, curriculum mastery tracking, real tool interaction, verification, and artifact-driven storytelling. 8/10
DeepSeek-V4-Pro Hugging Face model card and encoding notes Informs the default offline scenario-author config and the note that prompt handling should not assume a Jinja chat template. 8/10