Humanlearning commited on
Commit
63a6397
·
verified ·
1 Parent(s): 252f427

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,158 +1,227 @@
1
- ---
2
- title: Cyber Analyst Environment Server
3
- emoji: 🎯
4
- colorFrom: pink
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- base_path: /web
10
- tags:
11
- - openenv
12
- ---
13
-
14
- # Cyber Analyst Environment
15
-
16
- Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym" design from `docs/deep-research-report.md`. It benchmarks a bounded, safe security triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report.
17
-
18
- The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data.
19
-
20
- ## Tasks
21
-
22
- The manifest ships three graded tasks:
23
-
24
- | Task id | Difficulty | Goal |
25
- | --- | --- | --- |
26
- | `secret_exposure_easy` | easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. |
27
- | `missing_security_headers_medium` | medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. |
28
- | `authz_boundary_hard` | hard | Detect an admin route role-policy mismatch without exploitation. |
29
-
30
- ## Action Contract
31
-
32
- Use one bounded tool call per `step`:
33
-
34
- ```python
35
- CyberAnalystAction(
36
- tool_name="search_repo",
37
- args={"query": "api key"},
38
- )
39
- ```
40
-
41
- Approved tools:
42
-
43
- - `list_assets()`
44
- - `get_log_events(service_id, query)`
45
- - `check_security_headers(service_id)`
46
- - `search_repo(query)`
47
- - `scan_dependencies()`
48
- - `create_finding(finding_type, evidence_ids, severity_guess, remediation)`
49
- - `validate_finding(finding_id)`
50
- - `submit_report(report_json)`
51
-
52
- ## Observation Contract
53
-
54
- Each observation includes:
55
-
56
- - `alert`: task prompt
57
- - `tool_catalog`: approved tool list
58
- - `tool_result`: latest tool result
59
- - `evidence_ids`: discovered evidence IDs
60
- - `candidate_findings`: created findings
61
- - `verified_findings`: verifier-confirmed findings
62
- - `score_breakdown`: deterministic scoring explanation
63
- - `step_budget_remaining`, `error`, `done`, and `reward`
64
-
65
- Rewards and final scores are clamped to `0.01..0.99` for validator compatibility.
66
-
67
- `submit_report` also returns `trajectory_jsonl`, a JSONL export of the episode
68
- events up to report submission. This is intended for offline inspection and
69
- future training data extraction.
70
-
71
- ## Local Run
72
-
73
- From this directory:
74
-
75
- ```bash
76
- uv run server
77
- ```
78
-
79
- Then connect with the client:
80
-
81
- ```python
82
- from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv
83
-
84
- with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env:
85
- result = env.reset(task_id="secret_exposure_easy", seed=7)
86
- result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}))
87
- print(result.observation.tool_result)
88
- ```
89
-
90
- ## Baseline Inference
91
-
92
- `inference.py` runs a deterministic scripted baseline and prints strict parser-friendly logs:
93
-
94
- ```text
95
- [START] task=<task_id> env=Cyber_analyst model=<model_name>
96
- [STEP] step=<n> action=<tool_name> reward=<0.00> done=<true|false> error=<msg|null>
97
- [END] task=<task_id> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
98
- ```
99
-
100
- LLM calls are not enabled by default. The script already includes OpenAI SDK configuration compatible with Hugging Face Inference Providers so model-backed report drafting can be added later:
101
-
102
- ```bash
103
- set ENV_URL=http://localhost:8000
104
- set API_BASE_URL=https://router.huggingface.co/v1
105
- set MODEL_NAME=openai/gpt-oss-120b:novita
106
- set HF_TOKEN=<your-hugging-face-token>
107
- python inference.py
108
- ```
109
-
110
- ## Validation
111
-
112
- Useful local checks:
113
-
114
- ```bash
115
- python -m py_compile server/Cyber_analyst_environment.py inference.py
116
- python -m pytest tests
117
- .\.venv\Scripts\openenv.exe validate . --json
118
- ```
119
-
120
- ## Docker
121
-
122
- Build the environment image from this directory:
123
-
124
- ```bash
125
- docker build -t cyber-analyst-env:latest -f server/Dockerfile .
126
- ```
127
-
128
- Run:
129
-
130
- ```bash
131
- docker run -p 8000:8000 cyber-analyst-env:latest
132
- ```
133
-
134
- Health check:
135
-
136
- ```bash
137
- curl http://localhost:8000/health
138
- ```
139
-
140
- ## Deployment
141
-
142
- Deploy to Hugging Face Spaces with OpenEnv:
143
-
144
- ```bash
145
- openenv push --repo-id <your-hf-username>/Cyber_analyst
146
- ```
147
-
148
- The deployed Space exposes `/health`, `/docs`, `/ws`, and the optional `/web` interface when web UI support is enabled by the OpenEnv runtime.
149
-
150
- ## Adding Scenarios
151
-
152
- Add new safe scenarios in `server/tasks.py` by extending `SCENARIOS` with:
153
-
154
- - a stable `task_id`
155
- - synthetic `assets`, `repo`, `logs`, `headers`, and `dependencies` entries
156
- - `ground_truth_id`, `finding_type`, `required_evidence`, `impact_keywords`, and `remediation_keywords`
157
-
158
- Then add a grader adapter in `server/graders.py` and a matching `tasks` entry in `openenv.yaml`. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Cyber Analyst Environment Server
3
+ emoji: 🎯
4
+ colorFrom: pink
5
+ colorTo: red
6
+ sdk: docker
7
+ pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
+ ---
13
+
14
+ # Cyber Analyst Environment
15
+
16
+ Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym". It benchmarks a bounded, safe security-triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report.
17
+
18
+ The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data.
19
+
20
+ ## Motivation
21
+
22
+ Frontier models are becoming much stronger at security-relevant reasoning. Anthropic's April 7, 2026 report, [Assessing Claude Mythos Preview's cybersecurity capabilities](https://red.anthropic.com/2026/mythos-preview/), describes a model that can identify and exploit subtle vulnerabilities across real software targets, and argues that the same capability jump should be directed toward defense.
23
+
24
+ That creates a practical gap: many modern applications are built quickly, including "vibe coded" apps whose security review may not keep pace with generation speed. This environment is a small, safe training and evaluation surface for the defensive side of that gap. The goal is to help train and benchmark smaller, more accessible models to behave like careful application-security analysts: gather evidence, avoid unsupported claims, validate findings, and recommend concrete fixes.
25
+
26
+ ## Environment Description
27
+
28
+ Each episode simulates a synthetic microservice organization with three services:
29
+
30
+ - `gateway`
31
+ - `profile-service`
32
+ - `admin-service`
33
+
34
+ The agent starts from an alert and can inspect only closed-world artifact collections:
35
+
36
+ - `repo_snapshot`: static code/config snippets
37
+ - `telemetry`: sanitized log events
38
+ - `headers`: static response-header snapshots
39
+ - `dependencies`: static dependency manifest excerpts
40
+
41
+ The episode budget is 12 steps. Seeds deterministically vary benign details such as service aliases and evidence ordering while keeping the same task ground truth reproducible.
42
+
43
+ ## Tasks
44
+
45
+ The manifest ships three graded tasks:
46
+
47
+ | Task id | Difficulty | Task description | Expected difficulty |
48
+ | --- | --- | --- | --- |
49
+ | `secret_exposure_easy` | easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. | Easiest path: one focused `search_repo` call can surface the relevant evidence, then the agent must create, validate, and report the finding. |
50
+ | `missing_security_headers_medium` | medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. | Requires choosing the purpose-built `check_security_headers` tool and mapping missing headers to remediation instead of over-searching unrelated artifacts. |
51
+ | `authz_boundary_hard` | hard | Detect an admin route role-policy mismatch without exploitation. | Requires correlating route/role policy evidence with a supporting log event and recommending least-privilege policy remediation plus regression testing. |
52
+
53
+ ## Action Space
54
+
55
+ Each `step` accepts exactly one bounded simulator tool call:
56
+
57
+ ```python
58
+ CyberAnalystAction(
59
+ tool_name="search_repo",
60
+ args={"query": "api key"},
61
+ )
62
+ ```
63
+
64
+ Approved tools:
65
+
66
+ | Tool | Arguments | Purpose |
67
+ | --- | --- | --- |
68
+ | `list_assets` | `{}` | List synthetic services, routes, and artifact collections. |
69
+ | `get_log_events` | `{"service_id": "str", "query": "str"}` | Return sanitized telemetry evidence IDs for a service/query. |
70
+ | `check_security_headers` | `{"service_id": "str"}` | Inspect a service header snapshot and return pass/fail evidence. |
71
+ | `search_repo` | `{"query": "str"}` | Search synthetic repo/config snippets for evidence IDs. |
72
+ | `scan_dependencies` | `{}` | Inspect a synthetic dependency manifest excerpt. |
73
+ | `create_finding` | `{"finding_type": "str", "evidence_ids": ["str"], "severity_guess": "str", "remediation": "str"}` | Store a candidate finding for verifier review. |
74
+ | `validate_finding` | `{"finding_id": "str"}` | Run the deterministic verifier for a candidate finding. |
75
+ | `submit_report` | `{"report_json": {"findings": [...]}}` | Submit the final structured report and end the episode. |
76
+
77
+ Unsupported tools return an observation error instead of running arbitrary commands. Repeating the exact same action is penalized, and six repeated identical actions hard-stop the episode.
78
+
79
+ ## Observation Space
80
+
81
+ Each observation is a `CyberAnalystObservation` with:
82
+
83
+ | Field | Definition |
84
+ | --- | --- |
85
+ | `task_id` | Current benchmark task ID. |
86
+ | `alert` | Initial alert or task prompt. |
87
+ | `phase` | Current episode phase, usually `investigate` or `done`. |
88
+ | `tool_catalog` | Approved tool list and argument schemas. |
89
+ | `tool_result` | Result returned by the latest tool call. |
90
+ | `evidence_ids` | Evidence IDs discovered so far. |
91
+ | `candidate_findings` | Candidate findings created by the agent. |
92
+ | `verified_findings` | Verifier-confirmed findings. |
93
+ | `step_budget_remaining` | Steps remaining before timeout. |
94
+ | `score_breakdown` | Deterministic final scoring explanation after report submission. |
95
+ | `error` | Non-fatal environment error, if any. |
96
+ | `done` | Whether the episode has ended. |
97
+ | `reward` | Step reward clamped to the validator-compatible range. |
98
+
99
+ `submit_report` also returns `trajectory_jsonl`, a JSONL export of the episode events up to report submission. This is intended for offline inspection and future training data extraction.
100
+
101
+ ## Scoring
102
+
103
+ Final reports are scored deterministically:
104
+
105
+ - base score: `0.05`
106
+ - verified correct finding with matching report impact: `+0.60`
107
+ - valid evidence ID in the report: `+0.15`
108
+ - actionable remediation keywords: `+0.15`
109
+ - hallucinated or unverified finding claims: `-0.40` each
110
+ - submitting without verifier validation: `-0.20`
111
+
112
+ Rewards and final scores are clamped to `0.01..0.99` for validator compatibility.
113
+
114
+ ## Baseline Scores
115
+
116
+ The current deterministic oracle rollout follows the intended evidence -> finding -> validation -> report path for each task. These scores were measured locally against the environment with `seed=7`.
117
+
118
+ | Task id | Baseline type | Steps | Final score | Step rewards |
119
+ | --- | --- | ---: | ---: | --- |
120
+ | `secret_exposure_easy` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
121
+ | `missing_security_headers_medium` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
122
+ | `authz_boundary_hard` | deterministic oracle | 6 | `0.95` | `0.03, 0.05, 0.05, 0.06, 0.11, 0.98` |
123
+
124
+ A hallucinated one-step report scores `0.01`; repeated identical actions hard-stop at a low score.
125
+
126
+ ## Setup
127
+
128
+ From this directory, install dependencies:
129
+
130
+ ```bash
131
+ uv sync
132
+ ```
133
+
134
+ Run the local server:
135
+
136
+ ```bash
137
+ uv run server
138
+ ```
139
+
140
+ Health check:
141
+
142
+ ```bash
143
+ curl http://localhost:8000/health
144
+ ```
145
+
146
+ Then connect with the client:
147
+
148
+ ```python
149
+ from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv
150
+
151
+ with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env:
152
+ result = env.reset(task_id="secret_exposure_easy", seed=7)
153
+ result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}))
154
+ print(result.observation.tool_result)
155
+ ```
156
+
157
+ ## Baseline Inference
158
+
159
+ `inference.py` runs a model-backed baseline over the configured task set and prints strict parser-friendly logs:
160
+
161
+ ```text
162
+ [START] task=<task_id> env=Cyber_analyst model=<model_name>
163
+ [STEP] step=<n> action=<compact_json_action> reward=<0.00> done=<true|false> error=<msg|null>
164
+ [END] task=<task_id> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
165
+ ```
166
+
167
+ The script uses the OpenAI SDK with Hugging Face Inference Providers by default:
168
+
169
+ ```powershell
170
+ $env:ENV_URL = "http://localhost:8000"
171
+ $env:API_BASE_URL = "https://router.huggingface.co/v1"
172
+ $env:MODEL_NAME = "google/gemma-4-31B-it:fastest"
173
+ $env:HF_TOKEN = "<your-hugging-face-token>"
174
+ python inference.py
175
+ ```
176
+
177
+ Use `$env:TASK_NAME = "<task_id>"` to run one task instead of all three.
178
+
179
+ ## Validation
180
+
181
+ Useful local checks:
182
+
183
+ ```bash
184
+ python -m py_compile server/Cyber_analyst_environment.py inference.py
185
+ python -m pytest tests
186
+ .\.venv\Scripts\openenv.exe validate . --json
187
+ ```
188
+
189
+ ## Docker
190
+
191
+ Build the environment image from this directory:
192
+
193
+ ```bash
194
+ docker build -t cyber-analyst-env:latest -f server/Dockerfile .
195
+ ```
196
+
197
+ Run:
198
+
199
+ ```bash
200
+ docker run -p 8000:8000 cyber-analyst-env:latest
201
+ ```
202
+
203
+ Health check:
204
+
205
+ ```bash
206
+ curl http://localhost:8000/health
207
+ ```
208
+
209
+ ## Deployment
210
+
211
+ Deploy to Hugging Face Spaces with OpenEnv:
212
+
213
+ ```bash
214
+ openenv push --repo-id <your-hf-username>/Cyber_analyst
215
+ ```
216
+
217
+ The deployed Space exposes `/health`, `/docs`, `/ws`, and the optional `/web` interface when web UI support is enabled by the OpenEnv runtime.
218
+
219
+ ## Adding Scenarios
220
+
221
+ Add new safe scenarios in `server/tasks.py` by extending `SCENARIOS` with:
222
+
223
+ - a stable `task_id`
224
+ - synthetic `assets`, `repo`, `logs`, `headers`, and `dependencies` entries
225
+ - `ground_truth_id`, `finding_type`, `required_evidence`, `impact_keywords`, and `remediation_keywords`
226
+
227
+ Then add a grader adapter in `server/graders.py` and a matching `tasks` entry in `openenv.yaml`. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.
__init__.py CHANGED
@@ -1,17 +1,17 @@
1
- # Copyright (c) Meta Platforms, Inc. and affiliates.
2
- # All rights reserved.
3
- #
4
- # This source code is licensed under the BSD-style license found in the
5
- # LICENSE file in the root directory of this source tree.
6
-
7
- """Cyber Analyst Environment."""
8
-
9
- from .client import CyberAnalystEnv
10
- from .models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
11
-
12
- __all__ = [
13
- "CyberAnalystAction",
14
- "CyberAnalystObservation",
15
- "CyberAnalystState",
16
- "CyberAnalystEnv",
17
- ]
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Cyber Analyst Environment."""
8
+
9
+ from .client import CyberAnalystEnv
10
+ from .models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
11
+
12
+ __all__ = [
13
+ "CyberAnalystAction",
14
+ "CyberAnalystObservation",
15
+ "CyberAnalystState",
16
+ "CyberAnalystEnv",
17
+ ]
client.py CHANGED
@@ -1,39 +1,39 @@
1
- # Copyright (c) Meta Platforms, Inc. and affiliates.
2
- # All rights reserved.
3
- #
4
- # This source code is licensed under the BSD-style license found in the
5
- # LICENSE file in the root directory of this source tree.
6
-
7
- """Cyber Analyst Environment client."""
8
-
9
- from typing import Any
10
-
11
- from openenv.core import EnvClient
12
- from openenv.core.client_types import StepResult
13
-
14
- from .models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
15
-
16
-
17
- class CyberAnalystEnv(
18
- EnvClient[CyberAnalystAction, CyberAnalystObservation, CyberAnalystState]
19
- ):
20
- """WebSocket client for the Cyber Analyst OpenEnv environment."""
21
-
22
- def _step_payload(self, action: CyberAnalystAction) -> dict[str, Any]:
23
- return action.model_dump(exclude_none=True)
24
-
25
- def _parse_result(
26
- self, payload: dict[str, Any]
27
- ) -> StepResult[CyberAnalystObservation]:
28
- obs_data = dict(payload.get("observation", {}))
29
- obs_data["done"] = payload.get("done", False)
30
- obs_data["reward"] = payload.get("reward")
31
- observation = CyberAnalystObservation.model_validate(obs_data)
32
- return StepResult(
33
- observation=observation,
34
- reward=payload.get("reward"),
35
- done=payload.get("done", False),
36
- )
37
-
38
- def _parse_state(self, payload: dict[str, Any]) -> CyberAnalystState:
39
- return CyberAnalystState.model_validate(payload)
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Cyber Analyst Environment client."""
8
+
9
+ from typing import Any
10
+
11
+ from openenv.core import EnvClient
12
+ from openenv.core.client_types import StepResult
13
+
14
+ from .models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
15
+
16
+
17
+ class CyberAnalystEnv(
18
+ EnvClient[CyberAnalystAction, CyberAnalystObservation, CyberAnalystState]
19
+ ):
20
+ """WebSocket client for the Cyber Analyst OpenEnv environment."""
21
+
22
+ def _step_payload(self, action: CyberAnalystAction) -> dict[str, Any]:
23
+ return action.model_dump(exclude_none=True)
24
+
25
+ def _parse_result(
26
+ self, payload: dict[str, Any]
27
+ ) -> StepResult[CyberAnalystObservation]:
28
+ obs_data = dict(payload.get("observation", {}))
29
+ obs_data["done"] = payload.get("done", False)
30
+ obs_data["reward"] = payload.get("reward")
31
+ observation = CyberAnalystObservation.model_validate(obs_data)
32
+ return StepResult(
33
+ observation=observation,
34
+ reward=payload.get("reward"),
35
+ done=payload.get("done", False),
36
+ )
37
+
38
+ def _parse_state(self, payload: dict[str, Any]) -> CyberAnalystState:
39
+ return CyberAnalystState.model_validate(payload)
docs/deep-research-report.md CHANGED
@@ -1,531 +1,531 @@
1
- # OpenEnv Hackathon Execution Pipeline for a Safe Cybersecurity Analyst Environment
2
-
3
- ## Executive decision and scope selection
4
-
5
- **SECTION 1 — Executive Decision**
6
-
7
- [SOURCED] The hackathon’s *validator + judging* constraints strongly favour environments that: (a) simulate a real-world task (not “games/toys”), (b) are fully OpenEnv-compliant, (c) ship with **≥3 tasks with graders**, (d) produce **scores/rewards in the 0–1 range**, (e) include a reproducible `inference.py` that uses the **OpenAI client** (for any LLM calls) and prints strict `[START]/[STEP]/[END]` logs, and (f) run within a ~**20 minute** budget on ~**2 vCPU / 8 GB** infra. citeturn3view6turn3view7turn22view1turn22view2
8
-
9
- [INFERENCE] Under these realities, your cybersecurity-analyst direction is *not* automatically the best, but it *can* become a high-probability-to-win choice if—and only if—you narrow to a deterministic, bounded, “investigate → cite evidence → verify → remediate” loop where (i) tools are tightly sandboxed, (ii) graders are deterministic, and (iii) the action space is small enough to be learnable and demo-able under the runtime cap.
10
-
11
- [PROPOSAL] **Decision:** keep the cybersecurity direction, but **narrow aggressively** to a V1 environment that benchmarks **disciplined security triage + evidence-grounded reporting**, not pentesting/exploitation. The V1 I recommend building is:
12
-
13
- [PROPOSAL] **“SecOps Evidence Gym”** — a safe, isolated OpenEnv environment where an agent investigates a *synthetic* microservice “organisation” via a **bounded tool API**, collects **evidence IDs**, validates candidate findings through **deterministic verifiers**, and submits a structured remediation report.
14
-
15
- [SOURCED] This matches strong “winner DNA” seen in `kube-sre-gym` (realistic professional workflow + verification + narrative clarity) while remaining implementable in a hackathon budget. citeturn10view0turn18view0
16
-
17
- [PROPOSAL] **What to cut entirely in V1 (non-negotiable):**
18
- - “Live target” behaviour; no external network targets; no arbitrary HTTP to the internet. 🔒
19
- - Any exploit payload recipes, exploit chains, privilege-escalation playbooks, or “how to hack X”. 🔒
20
- - Arbitrary shell access (`bash`, `kubectl`, `nmap`, etc.) inside the environment. (Action space explosion + safety risk.)
21
- - LLM-only grading/judging for correctness. (Reward hacking + non-determinism.)
22
-
23
- [SOURCED] **What to keep (but narrow):** tool-using investigation, multi-step interaction, and deterministic verification—these are consistent with what OpenEnv is designed to support (typed `reset/step/state`, isolated server, type-safe schemas). citeturn18view0turn19search1
24
-
25
- **SECTION 3 — Candidate Scope Comparison**
26
-
27
- [SOURCED] The scoring below is anchored on hackathon validator requirements (3+ graded tasks, 0–1 scoring, strict inference logging, runtime limits) plus OpenEnv’s scaffolding/CLI/deployment model. citeturn3view6turn18view0turn22view1
28
-
29
- [PROPOSAL] Weighted criteria (sum=1.00): judging fit 0.14, OpenEnv fit 0.12, grader determinism 0.14, implementation risk 0.12, runtime feasibility 0.10, demoability 0.10, real-world usefulness 0.10, novelty 0.08, training usefulness 0.06, shipping-on-time likelihood 0.04.
30
-
31
- | Candidate scope | Judging fit | OpenEnv fit | Determinism | Impl risk (lower=better) | Runtime | Demo | Real-world use | Novelty | Training use | Ship-on-time | Weighted total |
32
- |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
33
- | **A. Your original direction:** “disciplined cyber analyst investigating a sandbox” (broad) | 8 | 7 | 6 | 4 | 6 | 8 | 8 | 8 | 8 | 4 | **6.7** |
34
- | **B. Narrow cyber variant (recommended):** evidence-first triage lab with bounded tools + deterministic verifiers + structured report | 9 | 8 | 9 | 7 | 9 | 9 | 8 | 7 | 8 | 8 | **8.4** |
35
- | **C. Adjacent: SRE incident triage (single-turn, deterministic logs → RCA)** | 9 | 8 | 10 | 9 | 10 | 8 | 8 | 5 | 6 | 9 | **8.3** |
36
- | **D. Adjacent: prompt-injection “WAF” benchmark** | 7 | 8 | 8 | 7 | 9 | 9 | 7 | 7 | 6 | 7 | **7.6** |
37
-
38
- [INFERENCE] Candidate C (SRE triage) is extremely validator-safe (many examples already pass deep validation), but it is likely more saturated and less “new” in judging. Candidate B keeps your cybersecurity theme while retaining the determinism and boundedness that the validator and time budget demand, so it is the best balance for **winning + real-world usefulness**.
39
-
40
- **SECTION 4 — Final V1 Problem Statement**
41
-
42
- [PROPOSAL] **One-sentence version:** Build a safe OpenEnv environment that trains and benchmarks an agent to perform **evidence-grounded security triage** on a bounded synthetic system and produce a **remediation-oriented report** without hallucinating.
43
-
44
- [PROPOSAL] **Short pitch (demo-ready):** “SecOps Evidence Gym” gives the model an alert and a constrained set of investigation tools. The model must gather evidence, validate findings with deterministic verifiers, and submit a structured report. Scores reward verified correctness, penalise hallucinated claims and wasted steps, and remain strictly within (0,1) to satisfy the hackathon validator. ✅🔒
45
-
46
- [PROPOSAL] **Precise implementation version:** A FastAPI OpenEnv server exposing typed `reset()`, `step()`, `state()` and a manifest `openenv.yaml` with at least three tasks (easy/medium/hard), each associated with a grader. The environment implements multi-step tool calls inside `step()`, and uses deterministic verifiers plus a strict score clamp to `(0,1)` for validator compatibility. citeturn18view0turn19search1turn23view0turn26view0turn27search0
47
-
48
- ## Winner patterns and judging fit extraction
49
-
50
- **SECTION 2 — Winner Pattern Extraction**
51
-
52
- [SOURCED] `kube-sre-gym` (OpenEnv Hackathon SF winner) demonstrates several patterns that appear strongly aligned with what judges value: a **realistic professional task**, an explicit **multi-step workflow** (triage → investigate → fix → verify), **multi-layer verification** (programmatic health checks + judge), a strong narrative that explains learning dynamics, and deployment on **Hugging Face Spaces**. citeturn10view0turn14view0
53
-
54
- image_group{"layout":"carousel","aspect_ratio":"16:9","query":["kube-sre-gym OpenEnv Hackathon winner screenshot","OpenEnv framework architecture diagram","Hugging Face Spaces OpenEnv environment web demo","Incident Triage Environment OpenEnv Hackathon 2026 screenshot"],"num_per_query":1}
55
-
56
- [SOURCED] Concretely, `kube-sre-gym` highlights: “real cluster/tool interaction”, verification layers to prevent false success, curriculum progression, and strong documentation that makes the environment’s value obvious quickly. citeturn10view0
57
-
58
- [SOURCED] A 2026 hackathon submission that explicitly claims Phase-2 deep-validation success (`Incident-Triage-Environment`) reveals particularly transferable “validator-winning” patterns:
59
- - A manifest with `spec_version`, `runtime`, `app`, `port`, and a **`tasks:` list where each task has a `grader:` pointing to importable Python functions**. citeturn23view0
60
- - Deterministic graders that clamp outputs to a validator-friendly range. citeturn26view0turn26view3
61
- - An `inference.py` that uses the OpenAI SDK and prints the strict stdout protocol with `[START]`, `[STEP]`, `[END]` lines in a stable key=value format. citeturn22view1turn22view2turn22view4
62
-
63
- [SOURCED] The official OpenEnv repo reinforces what is transferable: type-safe action/observation/state contracts, containerised isolation, and standard Gym-like APIs. It also explicitly says OpenEnv is experimental and APIs can change, which increases the value of a minimal, validator-first build loop. citeturn18view0turn19search2
64
-
65
- [INFERENCE] Given hackathon evaluation combines programmatic checks with LLM scoring, you must optimise for **deterministic correctness** *and* a compelling narrative/demo. citeturn3view7turn3view6
66
-
67
- [PROPOSAL] Transferable “winner patterns” you should copy (selectively):
68
- - **Strong “professional workflow” framing** (SRE, security analyst, triage) with clear step boundaries.
69
- - **Small, discoverable tool set** that mirrors real practice (logs, config, policy checks) while staying bounded.
70
- - **Deterministic verification** (programmatic checks) as the source of truth for correctness.
71
- - **Narrative traceability**: logs, episode IDs, and a short “watch the agent work” demo.
72
- - **Deployment excellence**: clean Docker build, working `/health`, working `/web` UI if enabled, and reproducible inference.
73
-
74
- [PROPOSAL] What *not* to copy blindly:
75
- - The “real cluster” dependency (e.g., GKE) is high operational burden and can fail the hackathon’s limited infra budget. citeturn10view0turn3view6
76
- - LLM-as-judge for correctness (too easy to reward-hack; non-deterministic). (Use it, at most, for *format/style*, not correctness.)
77
-
78
- ## Core V1 environment design and benchmark tasks
79
-
80
- **SECTION 8 — Core Environment Design**
81
-
82
- **V1 concept (aggressively narrow).**
83
- [PROPOSAL] Your environment is a **synthetic organisation** with a small, fixed topology (three “services” + artefacts). The agent receives an alert. It can only interact via **approved tools** (implemented inside the simulator). It must (a) gather evidence IDs, (b) validate candidate findings, and (c) submit a report.
84
-
85
- **Topology (V1).**
86
- [PROPOSAL] Fixed components (no containers inside containers):
87
- - `gateway` (public entry), `profile-service`, `admin-service`
88
- - `repo_snapshot` (static code/config excerpts)
89
- - `telemetry` (sanitised logs + “header snapshot” + “dependency manifest snapshot”)
90
-
91
- **Reset logic.**
92
- [PROPOSAL] `reset(task_id=..., seed=...)` selects a scenario variant and initialises:
93
- - episode ID, step count
94
- - scenario ground truth (one injected issue per episode in V1)
95
- - tool budgets + “allowed scope” banner
96
- - an evidence registry mapping `EVID-### → artefact snippet`
97
- Return an initial observation containing the alert, the tool catalogue, and an empty “verified findings” list.
98
-
99
- **Randomisation strategy.**
100
- [PROPOSAL] Use seed-driven, deterministic randomisation:
101
- - rename services/routes/IDs (`profile-service` might become `user-profile`),
102
- - shuffle benign log lines around the key evidence,
103
- - vary exact header sets / dependency versions within a small closed set,
104
- - keep each scenario **fully reproducible from the seed**.
105
-
106
- [SOURCED] Benchmark generators (e.g., AMaze) exist specifically to create diverse but controlled environments for evaluating generalisation, supporting the idea of seeded procedural variation rather than a single static scenario. citeturn16search7turn16search1
107
-
108
- **Safety boundaries.**
109
- [PROPOSAL] The sandbox contains **no live targets**, no real secrets, and no outbound network. “Secrets” are synthetic strings with an explicit “DO NOT USE OUTSIDE LAB” marker. Tools return synthetic results only. 🔒
110
-
111
- [SOURCED] NIST’s cyber range guidance emphasises cyber ranges as safe and legal environments for training and assessment; separate research also discusses that cyber ranges themselves have security risks that must be mitigated (e.g., leakage/misuse), reinforcing the need for strict isolation and artefact sanitisation. citeturn29search1turn29search2
112
-
113
- **How state is exposed to the agent.**
114
- [PROPOSAL] Expose only a concise state summary: current phase, step budget remaining, tools remaining, verified findings count, and recent evidence IDs. Keep full ground truth hidden.
115
-
116
- **Tool/action design (bounded action space).**
117
- [PROPOSAL] V1 tool list (keep it ≤8 tools):
118
- 1) `list_assets()` → returns asset IDs and route IDs
119
- 2) `get_log_events(service_id, query)` → returns evidence IDs
120
- 3) `check_security_headers(service_id)` → returns evidence IDs + pass/fail list
121
- 4) `search_repo(query)` → returns evidence IDs from code snippets
122
- 5) `scan_dependencies()` → returns evidence IDs from a lockfile excerpt
123
- 6) `create_finding(finding_type, evidence_ids, severity_guess, remediation)` → stores candidate finding
124
- 7) `validate_finding(finding_id)` → deterministic verifier; returns `(verified, matching_gt_id)`
125
- 8) `submit_report(report_json)` → terminal action
126
-
127
- **Anti-loop logic.**
128
- [PROPOSAL] Track action signatures `(tool_name, args_hash)` and:
129
- - apply increasing penalties for repeats,
130
- - hard-stop an episode if identical actions repeat ≥6 times, returning `done=True` with a low score,
131
- - always return a valid observation (never a server crash) to preserve training rollouts.
132
-
133
- [SOURCED] OpenEnv’s environment-creation guidance strongly implies you should implement robust behaviour around `reset/step/state` with typed contracts and predictable server behaviour. citeturn19search1turn18view0
134
-
135
- **SECTION 9 — Tasks / Benchmarks**
136
-
137
- [SOURCED] The hackathon requires **at least 3 tasks with graders** and explicitly checks the tasks registry. citeturn3view6turn27search0
138
-
139
- [PROPOSAL] V1 ships exactly **3 flagship tasks**, difficulty-tiered, each with deterministic success criteria and intermediate milestones.
140
-
141
- **Flagship tasks (easy/medium/hard).**
142
- [PROPOSAL] Each task is a *family* with small seeded variants.
143
-
144
- **Easy: Secret exposure in repo snapshot**
145
- - Goal: identify a leaked synthetic API key in a config file excerpt; propose rotation/removal.
146
- - Deterministic success: report includes the correct finding type `secret_exposure`, includes ≥1 correct evidence ID, and remediation mentions rotation + removal.
147
- - Intermediate rewards: `search_repo()` surfaces the evidence ID; `create_finding()` with correct type gets partial credit; `validate_finding()` confirms.
148
- - False-positive check: claiming *additional* vulnerabilities not verified triggers penalty.
149
-
150
- **Medium: Missing security headers**
151
- - Goal: detect missing/weak security headers in a service “header snapshot”; propose remediation.
152
- - Deterministic success: correct missing header set identification (from a fixed list), plus remediation mapping (e.g., add HSTS, CSP) within the environment’s rubric.
153
- - Intermediate rewards: correct tool usage (`check_security_headers()`), correct mapping to finding type, successful verifier validation.
154
- - Generalisation: header ordering/extra benign headers vary by seed.
155
-
156
- **Hard: Authorisation boundary misconfiguration**
157
- - Goal: detect an access control policy bug in a route/role matrix (modelled safely, without exploitation).
158
- - Deterministic success: evidence IDs must show the policy mismatch; report must describe impact and remediation (principle of least privilege + policy fix + regression test).
159
- - Intermediate rewards: `list_assets()` + `get_log_events()` reveal the mismatch pattern; candidate finding validated.
160
- - False-positive guardrail: generic “SQLi/RCE” claims penalised unless evidence supports (it won’t, by design).
161
-
162
- **Stretch tasks (post-V1, not for hackathon critical path).**
163
- [PROPOSAL] Dependency-risk identification (synthetic CVE mapping), error-handling info leak, prioritisation under strict budget, and multi-finding episodes (2 findings) — but only once the validator-safe V1 is shipped.
164
-
165
- ## OpenEnv compliance blueprint and repo plan
166
-
167
- **SECTION 6 — OpenEnv Compliance Blueprint**
168
-
169
- [SOURCED] OpenEnv’s core contract is Gymnasium-like APIs (`reset()`, `step()`, `state()`), with type-safe models, packaged behind a FastAPI server and typically accessed via an EnvClient. citeturn18view0turn19search1
170
-
171
- [SOURCED] For environment creators, OpenEnv explicitly supports `openenv init`, and documents a canonical structure: `models.py`, `client.py`, `server/app.py`, `server/<environment>.py`, plus `openenv.yaml` and packaging metadata. citeturn18view0turn18view1
172
-
173
- [SOURCED] OpenEnv provides CLI commands including `openenv init` and `openenv push` for deploying to **Hugging Face Spaces**. citeturn18view0turn17view0
174
-
175
- [SOURCED] The OpenEnv repo’s environment-building guide demonstrates typed models (Action/Observation/State) as Python dataclasses and a `create_fastapi_app(...)` helper to serve the environment. citeturn19search1
176
-
177
- [SOURCED] The OpenEnv repo explicitly warns *not* to copy outdated manifest patterns; current examples use `spec_version`, `type`, `runtime`, `app`, `port`. citeturn19search2turn23view0
178
-
179
- **Validator-sensitive details you must implement (non-negotiable).**
180
- [PROPOSAL] Based on official requirements + observed validator behaviour:
181
- - Provide `openenv.yaml` with `spec_version: 1`, `name`, `runtime: fastapi`, `app: server.app:app`, `port: <int>`, and a `tasks:` list with **≥3 tasks each having `id`, `description`, `grader`**. citeturn23view0turn19search2
182
- - Ensure each task’s final score is **strictly within (0,1)** to avoid fail-fast validation errors. citeturn27search0turn26view0
183
- - Implement an `inference.py` that prints `[START]/[STEP]/[END]` lines exactly and uses the OpenAI SDK for LLM calls (if any), reading `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`. citeturn3view6turn22view1turn22view2
184
- - Provide a `/health` endpoint that returns 200 once ready (commonly used in examples and deployment docs). citeturn17view0turn20view0
185
-
186
- **Sync vs async.**
187
- [SOURCED] OpenEnv supports async-first clients with a `.sync()` wrapper for synchronous usage. For hackathon inference scripts, synchronous control flow is often simpler and widely used in examples. citeturn18view0turn22view4
188
-
189
- **What not to copy from older examples.**
190
- [SOURCED] Some course material shows a simplified `openenv.yaml` (`name/version/description`), but the repo’s skill guidance explicitly warns against outdated manifests; follow the current spec-style manifest used in validated examples. citeturn19search2turn19search11turn23view0
191
-
192
- **SECTION 7 — Repo / File Tree Plan**
193
-
194
- [SOURCED] OpenEnv’s scaffold and common community submissions converge on a predictable repository layout and file naming. citeturn18view0turn20view0turn23view0
195
-
196
- [PROPOSAL] Recommended repo structure (submission-ready):
197
-
198
- ```
199
- secops_evidence_gym/
200
- openenv.yaml # REQUIRED: spec_version, runtime, app, port, tasks+graders
201
- pyproject.toml # REQUIRED: package metadata + deps
202
- README.md # REQUIRED: judging narrative + quickstart + safety boundaries
203
- inference.py # REQUIRED: strict stdout logs + OpenAI client usage
204
- models.py # REQUIRED: typed Action/Observation/State dataclasses
205
- client.py # REQUIRED: EnvClient wrapper (sync + async)
206
- __init__.py # REQUIRED: export Env + models for pip install
207
-
208
- server/
209
- app.py # REQUIRED: create_fastapi_app(...) wiring + /health
210
- environment.py # REQUIRED: SecOpsEvidenceGymEnvironment(reset/step/state)
211
- graders.py # REQUIRED: grade_easy/medium/hard + safe_reward clamp
212
- tasks.py # OPTIONAL (high-leverage): scenario registry + seed sampling
213
- safety.py # OPTIONAL (high-leverage): tool allowlist + sanitisation helpers
214
- requirements.txt # OPTIONAL (if Docker build uses it)
215
- Dockerfile # REQUIRED (practically): HF Spaces docker build
216
-
217
- tests/
218
- test_api_contract.py # smoke: reset/step/state doesn’t crash; reward range
219
- test_graders.py # unit: deterministic scoring + strict (0,1) clamp
220
- test_seed_determinism.py # unit: same seed → same evidence IDs
221
- ```
222
-
223
- [PROPOSAL] Mandatory for hackathon success: `openenv.yaml`, server app wiring, three tasks+graders, Docker build success, `inference.py` with strict logs, and a README that makes the environment’s value obvious in <60 seconds.
224
-
225
- ## Reward, grading, and anti-hallucination design
226
-
227
- **SECTION 10 — Reward Design**
228
-
229
- [SOURCED] OpenEnv leaves reward semantics to the environment; you are responsible for correctness scoring and determinism. citeturn18view0turn19search1
230
-
231
- [SOURCED] Hackathon validation has shown strict “score must be between 0 and 1 (not 0.0 and not 1.0)” behaviour, and teams clamp rewards (e.g., 0.01–0.99). citeturn27search0turn26view0
232
-
233
- [SOURCED] Empirical RL research in other domains (e.g., autonomous racing) shows reward design choices materially affect performance and generalisation, supporting the need for careful shaping rather than a single sparse terminal reward. citeturn15view2
234
-
235
- [PROPOSAL] **Core principle:** correctness is **verifier-gated**, not language-judged. You can optionally add *format/style* checks, but never allow style to dominate correctness reward.
236
-
237
- ### Reward structure (practical V1)
238
-
239
- [PROPOSAL] Normalise the final *task score* into `(0,1)` and keep per-step rewards small enough that summed episode reward stays in `(0,1)` as well (or only final reward is used, depending on your environment semantics). Use a single “score” to satisfy the validator and expose detailed breakdowns in `observation.metadata`.
240
-
241
- **Terminal (sparse) components** ✅
242
- [PROPOSAL]
243
- - `+0.60` if at least one ground-truth finding is verified and correctly described (type + impact).
244
- - `+0.15` if the report includes **≥1 valid evidence ID** per finding and those IDs correspond to the right artefacts.
245
- - `+0.15` if remediation is actionable (specific control, config, test).
246
- - `-0.40` per hallucinated/unverified finding claimed in the report.
247
- - `-0.20` if the agent fails to run `validate_finding()` before `submit_report()`.
248
-
249
- **Intermediate (dense) components** 🧭
250
- [PROPOSAL]
251
- - `+0.02` for discovering a *new* relevant evidence ID (first time only).
252
- - `+0.03` for creating a well-formed candidate finding that references evidence IDs.
253
- - `-0.01` per step (efficiency pressure).
254
- - `-0.03` for repeating the same tool call (exact same args) beyond 2 times.
255
-
256
- **False-positive penalties / anti-hallucination** 🧯
257
- [PROPOSAL] A “hallucination” is operationally defined as: the report asserts a finding that is not in the environment’s `verified_findings` list. This is easy to compute deterministically and maps directly to your stated goal (“avoid hallucinating findings”).
258
-
259
- ### Avoiding reward hacking
260
-
261
- [PROPOSAL] Hardening rules:
262
- - Cap rewards from verbosity: extra words do not add points.
263
- - Make evidence IDs required for high scores (prevents purely rhetorical “security speak”).
264
- - Penalise calling `validate_finding()` repeatedly without new evidence.
265
- - Reject “kitchen sink” reporting by penalising extra unverified findings.
266
-
267
- ### Binary vs shaped reward
268
-
269
- [PROPOSAL] **Binary-only** (0/1) will be easy to implement but brittle for multi-step tool use; the agent gets no gradient for *how* to investigate efficiently.
270
-
271
- [PROPOSAL] **Lightly shaped** (recommended) keeps correctness deterministic while providing enough signal to train investigation workflow (evidence collection, validation order, loop avoidance). This mirrors the broader lesson from reward engineering research: shaping and tuning can significantly alter learning outcomes. citeturn15view2
272
-
273
- ### Deterministic judge vs hybrid judge
274
-
275
- [PROPOSAL]
276
- - **Strict deterministic judge (recommended V1):** all correctness via verifiers + string/structure checks.
277
- - **Hybrid (stretch):** add a small LLM-based style score (e.g., clarity), heavily downweighted (≤0.05 of total) and never affecting pass/fail correctness.
278
-
279
- ## Baseline inference pipeline and strict stdout logging
280
-
281
- **SECTION 11 — Baseline Inference Pipeline**
282
-
283
- [SOURCED] Hackathon requirements include: a reproducible `inference.py`, the OpenAI client requirement for LLM calls (using provided env vars), and strict stdout logging. citeturn3view6
284
-
285
- [SOURCED] A concrete, hackathon-aligned stdout format has been used by validated submissions (example):
286
- - `[START] task=<name> env=<benchmark> model=<model_name>`
287
- - `[STEP] step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>`
288
- - `[END] task=<name> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>` citeturn22view1turn22view2
289
-
290
- [SOURCED] The same example inference uses the OpenAI SDK, reading `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`. citeturn22view1turn22view4
291
-
292
- ### Responsibilities of `inference.py`
293
-
294
- [PROPOSAL] `inference.py` should:
295
- - read env vars: `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`, `ENV_URL` (and optionally `TASK_NAME` override),
296
- - connect to the env via `.sync()` client,
297
- - run tasks in a fixed order (easy → medium → hard),
298
- - execute a bounded number of steps per task,
299
- - log exactly one `[START]...` per task, one `[END]...` per task, and a `[STEP]...` per environment step,
300
- - always exit with code 0 (even on failures) and log errors in the `[STEP] error=` field to avoid hard crashes.
301
-
302
- ### Control flow (V1 baseline strategy)
303
-
304
- [PROPOSAL] Use a **hybrid baseline** that is reliable under time constraints:
305
- - scripted tool sequence per task (fast, deterministic),
306
- - one LLM call (optional) to draft the final report from gathered evidence (so the demo shows “agentic reasoning”),
307
- - temperature fixed to 0 for reproducibility (and lower variance).
308
-
309
- [SOURCED] Deterministic inference settings like `TEMPERATURE=0.0` are used in competitive OpenEnv hackathon baselines. citeturn20view0turn22view4
310
-
311
- ### Minimum viable baseline (must ship)
312
-
313
- [PROPOSAL] For each task:
314
- 1) `reset(task_id=<tier>)`
315
- 2) run 2–4 tool calls that are always relevant (e.g., `check_security_headers`, `search_repo`, etc.)
316
- 3) `create_finding(...)` using evidence IDs
317
- 4) `validate_finding(finding_id)`
318
- 5) `submit_report(report_json)`
319
-
320
- ### Stronger baseline (only if time permits)
321
-
322
- [PROPOSAL] Add one planning LLM call that chooses among tools based on the alert type, but still keep a hard step limit, and always include verifier validation before reporting.
323
-
324
- ## Complete build, validation, deployment, and submission pipeline
325
-
326
- **SECTION 5 — Complete End-to-End Pipeline**
327
-
328
- [SOURCED] This pipeline is built to satisfy both OpenEnv conventions (init/push, typed models, FastAPI server) and hackathon validation constraints (tasks/graders, inference logging, runtime budgets). citeturn18view0turn19search2turn3view6turn22view1
329
-
330
- ### Phase goals, deliverables, verification (execution-ready)
331
-
332
- [PROPOSAL] The table below is the “do-this-in-order” execution plan. It is intentionally validator-first.
333
-
334
- | Phase | Goal | Deliverables | Files touched | Acceptance criteria | Main risks | How to verify |
335
- |---|---|---|---|---|---|---|
336
- | Scope lock | Freeze V1 to 3 tasks + bounded tools | 1-page spec + non-goals | README.md | No pentest/exploit scope; 3 tasks defined | Scope creep | Manual checklist |
337
- | Scaffold | Generate OpenEnv skeleton | Working importable package | openenv.yaml, models.py, client.py, server/* | `python -c "import ..."` succeeds | Wrong template/paths | Local import smoke test |
338
- | Environment core | Implement reset/step/state; tool router | Simulator runs end-to-end | server/environment.py | reset+step returns typed observation; no crashes | Action validation crashes | manual `curl` + python client |
339
- | Tasks + graders | Implement 3 graders + strict (0,1) clamp | `grade_easy/medium/hard` | server/graders.py, openenv.yaml | tasks discoverable; scores strictly in (0,1) | Validator fail-fast | unit tests + manual checks |
340
- | Baseline inference | Make inference reproducible + strict logs | inference.py | inference.py | prints correct `[START]/[STEP]/[END]` | log-parser failure | run script locally |
341
- | Local validation | Run OpenEnv build & validate | passes `openenv validate` | Dockerfile, server/app.py | validate passes locally | port mismatch | `openenv validate --url ...` |
342
- | Docker + HF | Deploy to Spaces | live endpoint | openenv push output | `/health` 200; reset+step works remotely | HF port/env mismatch | curl + python client |
343
- | Submission | Final narrative + demo | polished README + screenshots | README.md | demo works in <2 min | unclear story | run “demo script” |
344
-
345
- ### Concrete build plan with commands
346
-
347
- [SOURCED] OpenEnv supports `openenv init` and `openenv push` and documents this as the standard creator workflow. citeturn18view0turn17view0
348
- [SOURCED] The OpenEnv course also provides a grounded dev loop: `uv sync`, `uv run server`, `curl /health`, and Docker build/run commands. citeturn17view0
349
-
350
- [PROPOSAL] Commands (copy/paste order):
351
-
352
- 1) **Scaffold**
353
- ```bash
354
- pip install openenv-core
355
- openenv init secops_evidence_gym
356
- cd secops_evidence_gym
357
- ```
358
- [SOURCED] `openenv init` is the documented way to scaffold a new environment. citeturn18view0turn18view2
359
-
360
- 2) **Local dev install + run**
361
- ```bash
362
- uv sync
363
- uv run server
364
- curl http://localhost:8000/health
365
- ```
366
- [SOURCED] `uv run server` and `/health` checks are part of the recommended iteration loop in OpenEnv course materials. citeturn17view0
367
-
368
- 3) **Implement core files (edit)**
369
- - `models.py`: define `Action/Observation/State` dataclasses
370
- - `server/environment.py`: implement reset/step/state + tool routing
371
- - `server/graders.py`: implement `grade_easy/grade_medium/grade_hard` + `safe_reward()`
372
- - `openenv.yaml`: add `tasks:` with grader import paths
373
-
374
- [SOURCED] OpenEnv’s environment-building guide explicitly directs you to define models and implement `reset/step/state`, then wire a FastAPI app. citeturn19search1
375
- [SOURCED] A validator-aligned `openenv.yaml` with `spec_version`, `runtime`, `app`, `port`, and `tasks` exists in deep-validation passing examples. citeturn23view0
376
-
377
- 4) **Build + validate (local)**
378
- ```bash
379
- openenv build
380
- openenv validate --verbose
381
- ```
382
- [SOURCED] `openenv build` and `openenv validate` are part of OpenEnv’s recommended validation workflow. citeturn19search2
383
-
384
- 5) **Docker build/run smoke test**
385
- ```bash
386
- docker build -t secops-evidence-gym:latest -f server/Dockerfile .
387
- docker run -p 8000:8000 secops-evidence-gym:latest
388
- curl http://localhost:8000/health
389
- ```
390
- [SOURCED] This `docker build -f server/Dockerfile .` pattern is directly shown in OpenEnv deployment course material. citeturn17view0
391
-
392
- 6) **Run inference locally**
393
- ```bash
394
- export HF_TOKEN="..."
395
- export API_BASE_URL="..."
396
- export MODEL_NAME="..."
397
- export ENV_URL="http://localhost:8000"
398
- python inference.py
399
- ```
400
- [SOURCED] These env var names and OpenAI SDK usage are consistent with hackathon guidance and existing inference implementations. citeturn3view6turn22view4
401
-
402
- 7) **Deploy to Hugging Face Spaces**
403
- ```bash
404
- openenv push --repo-id <your-hf-username>/secops-evidence-gym
405
- ```
406
- [SOURCED] `openenv push` is described as the fastest path to deploy to **Hugging Face Spaces**. citeturn17view0turn18view0
407
-
408
- ### Testing and validation plan (high-signal)
409
-
410
- [SOURCED] OpenEnv stresses predictable API behaviour and type-safe contracts; hackathon validation is fail-fast. citeturn18view0turn27search0
411
-
412
- [PROPOSAL] Test layers (in priority order):
413
- - **API contract smoke tests:** reset/step/state return valid JSON; never crash on invalid tool name (should return an observation with an error field).
414
- - **Grader tests:** for each task, verify (a) correctness cases score high, (b) hallucination cases score low, (c) score always ∈ (0,1).
415
- - **Seed determinism tests:** same `seed` produces same evidence IDs and same verifier outputs.
416
- - **Runtime test:** run `inference.py` end-to-end and assert wall-clock < 2 minutes locally; assume < 20 minutes on grader infra even with cold starts. citeturn3view6turn22view4
417
- - **Reward sanity tests:** ensure reward increases monotonically with verified correctness; fails if verbosity alone increases reward.
418
-
419
- ## Submission packaging, execution roadmap, real-world usefulness, and failure modes
420
-
421
- **SECTION 14 — README / Demo / Submission Narrative**
422
- [SOURCED] Judges likely assess both the environment’s technical correctness (programmatic checks) and qualitative merit (LLM scoring / narrative). citeturn3view7
423
-
424
- [PROPOSAL] README structure that “feels like a winner” 🏆:
425
- - **Hero block:** one-paragraph pitch + why it’s real-world + safety claim.
426
- - **Two-minute demo:** copy/paste commands + expected output snippet with `[START]/[STEP]/[END]`.
427
- - **Environment contract:** action schema, observation schema, task list.
428
- - **Grading:** explain deterministic verifiers + hallucination penalties.
429
- - **Safety & isolation:** explicit exclusions (no egress, no shell, synthetic artefacts).
430
- - **Real-world relevance:** how this benchmarks/reporting maps to security workflows (triage, evidence, remediation).
431
- - **Screenshots:** web UI (optional) + an evidence trace + one scored report example.
432
-
433
- **SECTION 15 — Project Management Plan**
434
- [PROPOSAL] Day-by-day (assuming a hackathon-style sprint):
435
-
436
- - **Day 0 (scope lock + scaffold):** environment skeleton, `openenv.yaml` with 3 tasks, stub graders returning 0.5 (clamped), server runs locally.
437
- - **Day 1 (determinism + validator):** implement scenario generator, evidence registry, verifiers, and strict (0,1) scoring; pass `openenv validate`.
438
- - **Day 2 (baseline + polish):** implement `inference.py` strict logs; deploy to Spaces; polish README + demo artefacts.
439
-
440
- [PROPOSAL] Critical path: `openenv.yaml tasks+graders` → grader clamp `(0,1)` → inference stdout format → Docker+Spaces deployment. (Everything else is secondary.)
441
-
442
- **SECTION 16 — Real-World Usefulness Plan**
443
- [SOURCED] NIST’s testing guide emphasises planning, conducting tests, analysing findings, and developing mitigation strategies; your environment’s “evidence → remediation” focus aligns with that lifecycle without requiring offensive exploitation. citeturn29search8turn29search0
444
-
445
- [PROPOSAL] Who would care after the hackathon:
446
- - security engineering teams evaluating agentic “triage + reporting” reliability,
447
- - LLM tooling teams wanting benchmarks for **non-hallucinating, evidence-grounded** outputs,
448
- - training teams building safe cyber ranges (without weaponisation).
449
-
450
- [PROPOSAL] Post-hackathon upgrades (highest leverage):
451
- - export trajectories as JSONL for offline training,
452
- - add more scenario families (still safe) and a held-out split for generalisation,
453
- - integrate with RL trainers (e.g., TRL’s OpenEnv integration) to show real training curves. citeturn19search6turn10view0
454
-
455
- [SOURCED] PenGym provides evidence that realism/faithfulness of environments can affect transfer and stability when moving from simulation to more realistic settings—so you should roadmap a “higher fidelity mode” (still safe) later, not in V1. citeturn15view0
456
-
457
- **SECTION 17 — Why the naive version would fail**
458
- [PROPOSAL] Top failure patterns (and why they kill submissions):
459
- - Too broad (full cyber range, live services): fails time/infra constraints. citeturn3view6turn10view0
460
- - Fuzzy grading (LLM-only judging): non-deterministic, easy to game.
461
- - Unbounded tools (shell/network): unsafe + untrainable action space.
462
- - Scores at exactly 0.0 or 1.0: fail-fast “out of range” validator. citeturn27search0turn26view0
463
- - Inference logs not parseable: phase-1 failure even if env is good. citeturn3view6turn22view1
464
- - Port / health issues on Spaces: container “works locally” but fails remotely. citeturn17view0turn20view0
465
-
466
- **SECTION 18 — Final Recommendation**
467
-
468
- [PROPOSAL] **What should you build?**
469
- Build **SecOps Evidence Gym**: a deterministic, safe, sandbox-only cyber analyst environment focused on evidence collection, verifier validation, and remediation reporting.
470
-
471
- [PROPOSAL] **What should V1 include? (minimum winning set)**
472
- - OpenEnv-compliant FastAPI env with typed models and `reset/step/state`. citeturn18view0turn19search1
473
- - `openenv.yaml` with **3 tasks + graders**. citeturn23view0turn3view6
474
- - Deterministic verifiers + strict score clamp to `(0,1)`. citeturn27search0turn26view0
475
- - Baseline `inference.py` with strict `[START]/[STEP]/[END]` logging + OpenAI SDK usage for any LLM calls. citeturn3view6turn22view1turn22view4
476
- - HF Spaces deployment with a working `/health`. citeturn17view0turn20view0
477
-
478
- [PROPOSAL] **What should you cut?**
479
- - Any real pentesting/offensive content, any arbitrary command execution, any live targets, any correctness scoring via an LLM judge.
480
-
481
- [PROPOSAL] **Top 5 implementation decisions that matter most**
482
- 1) Validator-safe `openenv.yaml` tasks+graders wiring. citeturn23view0
483
- 2) Score/range compliance: clamp to `(0,1)` everywhere. citeturn27search0turn26view0
484
- 3) Strict stdout format in `inference.py`. citeturn22view1turn22view2
485
- 4) Deterministic verifiers as the source of truth.
486
- 5) Bounded tool set (≤8 tools) with anti-loop penalties.
487
-
488
- [PROPOSAL] **Minimum viable winning submission**
489
- A V1 with 3 tasks, deterministic graders, bounded tools, strict inference logging, and a polished README + demo trace.
490
-
491
- [PROPOSAL] **Minimum viable real-world useful submission**
492
- The same V1, plus: seed determinism, trajectory export, and a clear “how to add new scenarios” contributor guide.
493
-
494
- [PROPOSAL] **If you only have time for 20% of ambition—do this exact 20%:**
495
- - Implement **one** robust multi-step loop (tools → validate → report)
496
- - Implement **exactly 3** tasks (easy/medium/hard)
497
- - Make graders deterministic and validator-safe
498
- - Make deployment + inference bulletproof
499
- Everything else is stretch.
500
-
501
- **Confidence (my estimate): 8.4/10** ✅🔥
502
-
503
- ## Sources and credibility ratings (with exact links)
504
-
505
- [SOURCED] Ratings are my judgement of authority + relevance for this hackathon context (0–10). URLs are provided verbatim in code form.
506
-
507
- ### Tier 1 (official OpenEnv + hackathon dashboard)
508
- - Credibility **9.5/10** — `https://github.com/meta-pytorch/OpenEnv` citeturn18view0
509
- - Credibility **9.0/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/envs/README.md` citeturn19search1
510
- - Credibility **8.5/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/.claude/skills/generate-openenv-env/SKILL.md` citeturn19search2
511
- - Credibility **9.0/10** — `https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard` citeturn1view0turn3view6turn3view7
512
-
513
- ### Tier 2 (strong community exemplars)
514
- - Credibility **8.5/10** — `https://github.com/sid-rp/kube-sre-gym` citeturn10view0
515
- - Credibility **8.0/10** — `https://huggingface.co/openenv-community` citeturn14view0
516
- - Credibility **7.5/10** — `https://github.com/Harikishanth/Incident-Triage-Environment` citeturn20view0turn23view0turn22view1
517
-
518
- ### Tier 3 (peer-reviewed / primary references for design constraints)
519
- - Credibility **8.5/10** — PenGym (Computers & Security, open access): `https://www.sciencedirect.com/science/article/pii/S0167404824004450` citeturn15view0
520
- - Credibility **8.0/10** — Reward design + generalisation (Scientific Reports, 2025): `https://www.nature.com/articles/s41598-025-27702-6` citeturn15view2
521
- - Credibility **8.5/10** — AMaze (JOSS, 2025): `https://joss.theoj.org/papers/10.21105/joss.07208` citeturn16search7
522
- - Credibility **9.5/10** — NIST SP 800-115: `https://csrc.nist.gov/pubs/sp/800/115/final` citeturn29search8
523
- - Credibility **9.0/10** — NIST “Cyber Range: A Guide” (PDF landing): `https://www.nist.gov/document/cyber-range` citeturn29search1
524
- - Credibility **7.5/10** — “Cybersecurity of Cyber Ranges: Threats and Mitigations” (IJISR, 2022 PDF): `https://infonomics-society.org/wp-content/uploads/Cybersecurity-of-Cyber-Ranges.pdf` citeturn29search2
525
-
526
- ### Tier 4 (useful validator “ground truth” signals from the field)
527
- - Credibility **6.5/10** — Validator failure mode discussion (score must be strictly between 0 and 1): `https://www.reddit.com/r/pytorch/comments/1shi767/meta_x_pytorch_x_sst_x_openenv_hackathon_phase_2/` citeturn27search0
528
- - Credibility **7.0/10** — Strict logging format reference via a verified submission’s `inference.py`: `https://github.com/Harikishanth/Incident-Triage-Environment/blob/main/inference.py` citeturn22view1turn22view2
529
-
530
- ### Uploaded reference you provided
531
  - Credibility **7.0/10** (useful as a design draft; not independently authoritative) — `deep-research-report (2).md` fileciteturn2file0
 
1
+ # OpenEnv Hackathon Execution Pipeline for a Safe Cybersecurity Analyst Environment
2
+
3
+ ## Executive decision and scope selection
4
+
5
+ **SECTION 1 — Executive Decision**
6
+
7
+ [SOURCED] The hackathon’s *validator + judging* constraints strongly favour environments that: (a) simulate a real-world task (not “games/toys”), (b) are fully OpenEnv-compliant, (c) ship with **≥3 tasks with graders**, (d) produce **scores/rewards in the 0–1 range**, (e) include a reproducible `inference.py` that uses the **OpenAI client** (for any LLM calls) and prints strict `[START]/[STEP]/[END]` logs, and (f) run within a ~**20 minute** budget on ~**2 vCPU / 8 GB** infra. citeturn3view6turn3view7turn22view1turn22view2
8
+
9
+ [INFERENCE] Under these realities, your cybersecurity-analyst direction is *not* automatically the best, but it *can* become a high-probability-to-win choice if—and only if—you narrow to a deterministic, bounded, “investigate → cite evidence → verify → remediate” loop where (i) tools are tightly sandboxed, (ii) graders are deterministic, and (iii) the action space is small enough to be learnable and demo-able under the runtime cap.
10
+
11
+ [PROPOSAL] **Decision:** keep the cybersecurity direction, but **narrow aggressively** to a V1 environment that benchmarks **disciplined security triage + evidence-grounded reporting**, not pentesting/exploitation. The V1 I recommend building is:
12
+
13
+ [PROPOSAL] **“SecOps Evidence Gym”** — a safe, isolated OpenEnv environment where an agent investigates a *synthetic* microservice “organisation” via a **bounded tool API**, collects **evidence IDs**, validates candidate findings through **deterministic verifiers**, and submits a structured remediation report.
14
+
15
+ [SOURCED] This matches strong “winner DNA” seen in `kube-sre-gym` (realistic professional workflow + verification + narrative clarity) while remaining implementable in a hackathon budget. citeturn10view0turn18view0
16
+
17
+ [PROPOSAL] **What to cut entirely in V1 (non-negotiable):**
18
+ - “Live target” behaviour; no external network targets; no arbitrary HTTP to the internet. 🔒
19
+ - Any exploit payload recipes, exploit chains, privilege-escalation playbooks, or “how to hack X”. 🔒
20
+ - Arbitrary shell access (`bash`, `kubectl`, `nmap`, etc.) inside the environment. (Action space explosion + safety risk.)
21
+ - LLM-only grading/judging for correctness. (Reward hacking + non-determinism.)
22
+
23
+ [SOURCED] **What to keep (but narrow):** tool-using investigation, multi-step interaction, and deterministic verification—these are consistent with what OpenEnv is designed to support (typed `reset/step/state`, isolated server, type-safe schemas). citeturn18view0turn19search1
24
+
25
+ **SECTION 3 — Candidate Scope Comparison**
26
+
27
+ [SOURCED] The scoring below is anchored on hackathon validator requirements (3+ graded tasks, 0–1 scoring, strict inference logging, runtime limits) plus OpenEnv’s scaffolding/CLI/deployment model. citeturn3view6turn18view0turn22view1
28
+
29
+ [PROPOSAL] Weighted criteria (sum=1.00): judging fit 0.14, OpenEnv fit 0.12, grader determinism 0.14, implementation risk 0.12, runtime feasibility 0.10, demoability 0.10, real-world usefulness 0.10, novelty 0.08, training usefulness 0.06, shipping-on-time likelihood 0.04.
30
+
31
+ | Candidate scope | Judging fit | OpenEnv fit | Determinism | Impl risk (lower=better) | Runtime | Demo | Real-world use | Novelty | Training use | Ship-on-time | Weighted total |
32
+ |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
33
+ | **A. Your original direction:** “disciplined cyber analyst investigating a sandbox” (broad) | 8 | 7 | 6 | 4 | 6 | 8 | 8 | 8 | 8 | 4 | **6.7** |
34
+ | **B. Narrow cyber variant (recommended):** evidence-first triage lab with bounded tools + deterministic verifiers + structured report | 9 | 8 | 9 | 7 | 9 | 9 | 8 | 7 | 8 | 8 | **8.4** |
35
+ | **C. Adjacent: SRE incident triage (single-turn, deterministic logs → RCA)** | 9 | 8 | 10 | 9 | 10 | 8 | 8 | 5 | 6 | 9 | **8.3** |
36
+ | **D. Adjacent: prompt-injection “WAF” benchmark** | 7 | 8 | 8 | 7 | 9 | 9 | 7 | 7 | 6 | 7 | **7.6** |
37
+
38
+ [INFERENCE] Candidate C (SRE triage) is extremely validator-safe (many examples already pass deep validation), but it is likely more saturated and less “new” in judging. Candidate B keeps your cybersecurity theme while retaining the determinism and boundedness that the validator and time budget demand, so it is the best balance for **winning + real-world usefulness**.
39
+
40
+ **SECTION 4 — Final V1 Problem Statement**
41
+
42
+ [PROPOSAL] **One-sentence version:** Build a safe OpenEnv environment that trains and benchmarks an agent to perform **evidence-grounded security triage** on a bounded synthetic system and produce a **remediation-oriented report** without hallucinating.
43
+
44
+ [PROPOSAL] **Short pitch (demo-ready):** “SecOps Evidence Gym” gives the model an alert and a constrained set of investigation tools. The model must gather evidence, validate findings with deterministic verifiers, and submit a structured report. Scores reward verified correctness, penalise hallucinated claims and wasted steps, and remain strictly within (0,1) to satisfy the hackathon validator. ✅🔒
45
+
46
+ [PROPOSAL] **Precise implementation version:** A FastAPI OpenEnv server exposing typed `reset()`, `step()`, `state()` and a manifest `openenv.yaml` with at least three tasks (easy/medium/hard), each associated with a grader. The environment implements multi-step tool calls inside `step()`, and uses deterministic verifiers plus a strict score clamp to `(0,1)` for validator compatibility. citeturn18view0turn19search1turn23view0turn26view0turn27search0
47
+
48
+ ## Winner patterns and judging fit extraction
49
+
50
+ **SECTION 2 — Winner Pattern Extraction**
51
+
52
+ [SOURCED] `kube-sre-gym` (OpenEnv Hackathon SF winner) demonstrates several patterns that appear strongly aligned with what judges value: a **realistic professional task**, an explicit **multi-step workflow** (triage → investigate → fix → verify), **multi-layer verification** (programmatic health checks + judge), a strong narrative that explains learning dynamics, and deployment on **Hugging Face Spaces**. citeturn10view0turn14view0
53
+
54
+ image_group{"layout":"carousel","aspect_ratio":"16:9","query":["kube-sre-gym OpenEnv Hackathon winner screenshot","OpenEnv framework architecture diagram","Hugging Face Spaces OpenEnv environment web demo","Incident Triage Environment OpenEnv Hackathon 2026 screenshot"],"num_per_query":1}
55
+
56
+ [SOURCED] Concretely, `kube-sre-gym` highlights: “real cluster/tool interaction”, verification layers to prevent false success, curriculum progression, and strong documentation that makes the environment’s value obvious quickly. citeturn10view0
57
+
58
+ [SOURCED] A 2026 hackathon submission that explicitly claims Phase-2 deep-validation success (`Incident-Triage-Environment`) reveals particularly transferable “validator-winning” patterns:
59
+ - A manifest with `spec_version`, `runtime`, `app`, `port`, and a **`tasks:` list where each task has a `grader:` pointing to importable Python functions**. citeturn23view0
60
+ - Deterministic graders that clamp outputs to a validator-friendly range. citeturn26view0turn26view3
61
+ - An `inference.py` that uses the OpenAI SDK and prints the strict stdout protocol with `[START]`, `[STEP]`, `[END]` lines in a stable key=value format. citeturn22view1turn22view2turn22view4
62
+
63
+ [SOURCED] The official OpenEnv repo reinforces what is transferable: type-safe action/observation/state contracts, containerised isolation, and standard Gym-like APIs. It also explicitly says OpenEnv is experimental and APIs can change, which increases the value of a minimal, validator-first build loop. citeturn18view0turn19search2
64
+
65
+ [INFERENCE] Given hackathon evaluation combines programmatic checks with LLM scoring, you must optimise for **deterministic correctness** *and* a compelling narrative/demo. citeturn3view7turn3view6
66
+
67
+ [PROPOSAL] Transferable “winner patterns” you should copy (selectively):
68
+ - **Strong “professional workflow” framing** (SRE, security analyst, triage) with clear step boundaries.
69
+ - **Small, discoverable tool set** that mirrors real practice (logs, config, policy checks) while staying bounded.
70
+ - **Deterministic verification** (programmatic checks) as the source of truth for correctness.
71
+ - **Narrative traceability**: logs, episode IDs, and a short “watch the agent work” demo.
72
+ - **Deployment excellence**: clean Docker build, working `/health`, working `/web` UI if enabled, and reproducible inference.
73
+
74
+ [PROPOSAL] What *not* to copy blindly:
75
+ - The “real cluster” dependency (e.g., GKE) is high operational burden and can fail the hackathon’s limited infra budget. citeturn10view0turn3view6
76
+ - LLM-as-judge for correctness (too easy to reward-hack; non-deterministic). (Use it, at most, for *format/style*, not correctness.)
77
+
78
+ ## Core V1 environment design and benchmark tasks
79
+
80
+ **SECTION 8 — Core Environment Design**
81
+
82
+ **V1 concept (aggressively narrow).**
83
+ [PROPOSAL] Your environment is a **synthetic organisation** with a small, fixed topology (three “services” + artefacts). The agent receives an alert. It can only interact via **approved tools** (implemented inside the simulator). It must (a) gather evidence IDs, (b) validate candidate findings, and (c) submit a report.
84
+
85
+ **Topology (V1).**
86
+ [PROPOSAL] Fixed components (no containers inside containers):
87
+ - `gateway` (public entry), `profile-service`, `admin-service`
88
+ - `repo_snapshot` (static code/config excerpts)
89
+ - `telemetry` (sanitised logs + “header snapshot” + “dependency manifest snapshot”)
90
+
91
+ **Reset logic.**
92
+ [PROPOSAL] `reset(task_id=..., seed=...)` selects a scenario variant and initialises:
93
+ - episode ID, step count
94
+ - scenario ground truth (one injected issue per episode in V1)
95
+ - tool budgets + “allowed scope” banner
96
+ - an evidence registry mapping `EVID-### → artefact snippet`
97
+ Return an initial observation containing the alert, the tool catalogue, and an empty “verified findings” list.
98
+
99
+ **Randomisation strategy.**
100
+ [PROPOSAL] Use seed-driven, deterministic randomisation:
101
+ - rename services/routes/IDs (`profile-service` might become `user-profile`),
102
+ - shuffle benign log lines around the key evidence,
103
+ - vary exact header sets / dependency versions within a small closed set,
104
+ - keep each scenario **fully reproducible from the seed**.
105
+
106
+ [SOURCED] Benchmark generators (e.g., AMaze) exist specifically to create diverse but controlled environments for evaluating generalisation, supporting the idea of seeded procedural variation rather than a single static scenario. citeturn16search7turn16search1
107
+
108
+ **Safety boundaries.**
109
+ [PROPOSAL] The sandbox contains **no live targets**, no real secrets, and no outbound network. “Secrets” are synthetic strings with an explicit “DO NOT USE OUTSIDE LAB” marker. Tools return synthetic results only. 🔒
110
+
111
+ [SOURCED] NIST’s cyber range guidance emphasises cyber ranges as safe and legal environments for training and assessment; separate research also discusses that cyber ranges themselves have security risks that must be mitigated (e.g., leakage/misuse), reinforcing the need for strict isolation and artefact sanitisation. citeturn29search1turn29search2
112
+
113
+ **How state is exposed to the agent.**
114
+ [PROPOSAL] Expose only a concise state summary: current phase, step budget remaining, tools remaining, verified findings count, and recent evidence IDs. Keep full ground truth hidden.
115
+
116
+ **Tool/action design (bounded action space).**
117
+ [PROPOSAL] V1 tool list (keep it ≤8 tools):
118
+ 1) `list_assets()` → returns asset IDs and route IDs
119
+ 2) `get_log_events(service_id, query)` → returns evidence IDs
120
+ 3) `check_security_headers(service_id)` → returns evidence IDs + pass/fail list
121
+ 4) `search_repo(query)` → returns evidence IDs from code snippets
122
+ 5) `scan_dependencies()` → returns evidence IDs from a lockfile excerpt
123
+ 6) `create_finding(finding_type, evidence_ids, severity_guess, remediation)` → stores candidate finding
124
+ 7) `validate_finding(finding_id)` → deterministic verifier; returns `(verified, matching_gt_id)`
125
+ 8) `submit_report(report_json)` → terminal action
126
+
127
+ **Anti-loop logic.**
128
+ [PROPOSAL] Track action signatures `(tool_name, args_hash)` and:
129
+ - apply increasing penalties for repeats,
130
+ - hard-stop an episode if identical actions repeat ≥6 times, returning `done=True` with a low score,
131
+ - always return a valid observation (never a server crash) to preserve training rollouts.
132
+
133
+ [SOURCED] OpenEnv’s environment-creation guidance strongly implies you should implement robust behaviour around `reset/step/state` with typed contracts and predictable server behaviour. citeturn19search1turn18view0
134
+
135
+ **SECTION 9 — Tasks / Benchmarks**
136
+
137
+ [SOURCED] The hackathon requires **at least 3 tasks with graders** and explicitly checks the tasks registry. citeturn3view6turn27search0
138
+
139
+ [PROPOSAL] V1 ships exactly **3 flagship tasks**, difficulty-tiered, each with deterministic success criteria and intermediate milestones.
140
+
141
+ **Flagship tasks (easy/medium/hard).**
142
+ [PROPOSAL] Each task is a *family* with small seeded variants.
143
+
144
+ **Easy: Secret exposure in repo snapshot**
145
+ - Goal: identify a leaked synthetic API key in a config file excerpt; propose rotation/removal.
146
+ - Deterministic success: report includes the correct finding type `secret_exposure`, includes ≥1 correct evidence ID, and remediation mentions rotation + removal.
147
+ - Intermediate rewards: `search_repo()` surfaces the evidence ID; `create_finding()` with correct type gets partial credit; `validate_finding()` confirms.
148
+ - False-positive check: claiming *additional* vulnerabilities not verified triggers penalty.
149
+
150
+ **Medium: Missing security headers**
151
+ - Goal: detect missing/weak security headers in a service “header snapshot”; propose remediation.
152
+ - Deterministic success: correct missing header set identification (from a fixed list), plus remediation mapping (e.g., add HSTS, CSP) within the environment’s rubric.
153
+ - Intermediate rewards: correct tool usage (`check_security_headers()`), correct mapping to finding type, successful verifier validation.
154
+ - Generalisation: header ordering/extra benign headers vary by seed.
155
+
156
+ **Hard: Authorisation boundary misconfiguration**
157
+ - Goal: detect an access control policy bug in a route/role matrix (modelled safely, without exploitation).
158
+ - Deterministic success: evidence IDs must show the policy mismatch; report must describe impact and remediation (principle of least privilege + policy fix + regression test).
159
+ - Intermediate rewards: `list_assets()` + `get_log_events()` reveal the mismatch pattern; candidate finding validated.
160
+ - False-positive guardrail: generic “SQLi/RCE” claims penalised unless evidence supports (it won’t, by design).
161
+
162
+ **Stretch tasks (post-V1, not for hackathon critical path).**
163
+ [PROPOSAL] Dependency-risk identification (synthetic CVE mapping), error-handling info leak, prioritisation under strict budget, and multi-finding episodes (2 findings) — but only once the validator-safe V1 is shipped.
164
+
165
+ ## OpenEnv compliance blueprint and repo plan
166
+
167
+ **SECTION 6 — OpenEnv Compliance Blueprint**
168
+
169
+ [SOURCED] OpenEnv’s core contract is Gymnasium-like APIs (`reset()`, `step()`, `state()`), with type-safe models, packaged behind a FastAPI server and typically accessed via an EnvClient. citeturn18view0turn19search1
170
+
171
+ [SOURCED] For environment creators, OpenEnv explicitly supports `openenv init`, and documents a canonical structure: `models.py`, `client.py`, `server/app.py`, `server/<environment>.py`, plus `openenv.yaml` and packaging metadata. citeturn18view0turn18view1
172
+
173
+ [SOURCED] OpenEnv provides CLI commands including `openenv init` and `openenv push` for deploying to **Hugging Face Spaces**. citeturn18view0turn17view0
174
+
175
+ [SOURCED] The OpenEnv repo’s environment-building guide demonstrates typed models (Action/Observation/State) as Python dataclasses and a `create_fastapi_app(...)` helper to serve the environment. citeturn19search1
176
+
177
+ [SOURCED] The OpenEnv repo explicitly warns *not* to copy outdated manifest patterns; current examples use `spec_version`, `type`, `runtime`, `app`, `port`. citeturn19search2turn23view0
178
+
179
+ **Validator-sensitive details you must implement (non-negotiable).**
180
+ [PROPOSAL] Based on official requirements + observed validator behaviour:
181
+ - Provide `openenv.yaml` with `spec_version: 1`, `name`, `runtime: fastapi`, `app: server.app:app`, `port: <int>`, and a `tasks:` list with **≥3 tasks each having `id`, `description`, `grader`**. citeturn23view0turn19search2
182
+ - Ensure each task’s final score is **strictly within (0,1)** to avoid fail-fast validation errors. citeturn27search0turn26view0
183
+ - Implement an `inference.py` that prints `[START]/[STEP]/[END]` lines exactly and uses the OpenAI SDK for LLM calls (if any), reading `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`. citeturn3view6turn22view1turn22view2
184
+ - Provide a `/health` endpoint that returns 200 once ready (commonly used in examples and deployment docs). citeturn17view0turn20view0
185
+
186
+ **Sync vs async.**
187
+ [SOURCED] OpenEnv supports async-first clients with a `.sync()` wrapper for synchronous usage. For hackathon inference scripts, synchronous control flow is often simpler and widely used in examples. citeturn18view0turn22view4
188
+
189
+ **What not to copy from older examples.**
190
+ [SOURCED] Some course material shows a simplified `openenv.yaml` (`name/version/description`), but the repo’s skill guidance explicitly warns against outdated manifests; follow the current spec-style manifest used in validated examples. citeturn19search2turn19search11turn23view0
191
+
192
+ **SECTION 7 — Repo / File Tree Plan**
193
+
194
+ [SOURCED] OpenEnv’s scaffold and common community submissions converge on a predictable repository layout and file naming. citeturn18view0turn20view0turn23view0
195
+
196
+ [PROPOSAL] Recommended repo structure (submission-ready):
197
+
198
+ ```
199
+ secops_evidence_gym/
200
+ openenv.yaml # REQUIRED: spec_version, runtime, app, port, tasks+graders
201
+ pyproject.toml # REQUIRED: package metadata + deps
202
+ README.md # REQUIRED: judging narrative + quickstart + safety boundaries
203
+ inference.py # REQUIRED: strict stdout logs + OpenAI client usage
204
+ models.py # REQUIRED: typed Action/Observation/State dataclasses
205
+ client.py # REQUIRED: EnvClient wrapper (sync + async)
206
+ __init__.py # REQUIRED: export Env + models for pip install
207
+
208
+ server/
209
+ app.py # REQUIRED: create_fastapi_app(...) wiring + /health
210
+ environment.py # REQUIRED: SecOpsEvidenceGymEnvironment(reset/step/state)
211
+ graders.py # REQUIRED: grade_easy/medium/hard + safe_reward clamp
212
+ tasks.py # OPTIONAL (high-leverage): scenario registry + seed sampling
213
+ safety.py # OPTIONAL (high-leverage): tool allowlist + sanitisation helpers
214
+ requirements.txt # OPTIONAL (if Docker build uses it)
215
+ Dockerfile # REQUIRED (practically): HF Spaces docker build
216
+
217
+ tests/
218
+ test_api_contract.py # smoke: reset/step/state doesn’t crash; reward range
219
+ test_graders.py # unit: deterministic scoring + strict (0,1) clamp
220
+ test_seed_determinism.py # unit: same seed → same evidence IDs
221
+ ```
222
+
223
+ [PROPOSAL] Mandatory for hackathon success: `openenv.yaml`, server app wiring, three tasks+graders, Docker build success, `inference.py` with strict logs, and a README that makes the environment’s value obvious in <60 seconds.
224
+
225
+ ## Reward, grading, and anti-hallucination design
226
+
227
+ **SECTION 10 — Reward Design**
228
+
229
+ [SOURCED] OpenEnv leaves reward semantics to the environment; you are responsible for correctness scoring and determinism. citeturn18view0turn19search1
230
+
231
+ [SOURCED] Hackathon validation has shown strict “score must be between 0 and 1 (not 0.0 and not 1.0)” behaviour, and teams clamp rewards (e.g., 0.01–0.99). citeturn27search0turn26view0
232
+
233
+ [SOURCED] Empirical RL research in other domains (e.g., autonomous racing) shows reward design choices materially affect performance and generalisation, supporting the need for careful shaping rather than a single sparse terminal reward. citeturn15view2
234
+
235
+ [PROPOSAL] **Core principle:** correctness is **verifier-gated**, not language-judged. You can optionally add *format/style* checks, but never allow style to dominate correctness reward.
236
+
237
+ ### Reward structure (practical V1)
238
+
239
+ [PROPOSAL] Normalise the final *task score* into `(0,1)` and keep per-step rewards small enough that summed episode reward stays in `(0,1)` as well (or only final reward is used, depending on your environment semantics). Use a single “score” to satisfy the validator and expose detailed breakdowns in `observation.metadata`.
240
+
241
+ **Terminal (sparse) components** ✅
242
+ [PROPOSAL]
243
+ - `+0.60` if at least one ground-truth finding is verified and correctly described (type + impact).
244
+ - `+0.15` if the report includes **≥1 valid evidence ID** per finding and those IDs correspond to the right artefacts.
245
+ - `+0.15` if remediation is actionable (specific control, config, test).
246
+ - `-0.40` per hallucinated/unverified finding claimed in the report.
247
+ - `-0.20` if the agent fails to run `validate_finding()` before `submit_report()`.
248
+
249
+ **Intermediate (dense) components** 🧭
250
+ [PROPOSAL]
251
+ - `+0.02` for discovering a *new* relevant evidence ID (first time only).
252
+ - `+0.03` for creating a well-formed candidate finding that references evidence IDs.
253
+ - `-0.01` per step (efficiency pressure).
254
+ - `-0.03` for repeating the same tool call (exact same args) beyond 2 times.
255
+
256
+ **False-positive penalties / anti-hallucination** 🧯
257
+ [PROPOSAL] A “hallucination” is operationally defined as: the report asserts a finding that is not in the environment’s `verified_findings` list. This is easy to compute deterministically and maps directly to your stated goal (“avoid hallucinating findings”).
258
+
259
+ ### Avoiding reward hacking
260
+
261
+ [PROPOSAL] Hardening rules:
262
+ - Cap rewards from verbosity: extra words do not add points.
263
+ - Make evidence IDs required for high scores (prevents purely rhetorical “security speak”).
264
+ - Penalise calling `validate_finding()` repeatedly without new evidence.
265
+ - Reject “kitchen sink” reporting by penalising extra unverified findings.
266
+
267
+ ### Binary vs shaped reward
268
+
269
+ [PROPOSAL] **Binary-only** (0/1) will be easy to implement but brittle for multi-step tool use; the agent gets no gradient for *how* to investigate efficiently.
270
+
271
+ [PROPOSAL] **Lightly shaped** (recommended) keeps correctness deterministic while providing enough signal to train investigation workflow (evidence collection, validation order, loop avoidance). This mirrors the broader lesson from reward engineering research: shaping and tuning can significantly alter learning outcomes. citeturn15view2
272
+
273
+ ### Deterministic judge vs hybrid judge
274
+
275
+ [PROPOSAL]
276
+ - **Strict deterministic judge (recommended V1):** all correctness via verifiers + string/structure checks.
277
+ - **Hybrid (stretch):** add a small LLM-based style score (e.g., clarity), heavily downweighted (≤0.05 of total) and never affecting pass/fail correctness.
278
+
279
+ ## Baseline inference pipeline and strict stdout logging
280
+
281
+ **SECTION 11 — Baseline Inference Pipeline**
282
+
283
+ [SOURCED] Hackathon requirements include: a reproducible `inference.py`, the OpenAI client requirement for LLM calls (using provided env vars), and strict stdout logging. citeturn3view6
284
+
285
+ [SOURCED] A concrete, hackathon-aligned stdout format has been used by validated submissions (example):
286
+ - `[START] task=<name> env=<benchmark> model=<model_name>`
287
+ - `[STEP] step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>`
288
+ - `[END] task=<name> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>` citeturn22view1turn22view2
289
+
290
+ [SOURCED] The same example inference uses the OpenAI SDK, reading `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`. citeturn22view1turn22view4
291
+
292
+ ### Responsibilities of `inference.py`
293
+
294
+ [PROPOSAL] `inference.py` should:
295
+ - read env vars: `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`, `ENV_URL` (and optionally `TASK_NAME` override),
296
+ - connect to the env via `.sync()` client,
297
+ - run tasks in a fixed order (easy → medium → hard),
298
+ - execute a bounded number of steps per task,
299
+ - log exactly one `[START]...` per task, one `[END]...` per task, and a `[STEP]...` per environment step,
300
+ - always exit with code 0 (even on failures) and log errors in the `[STEP] error=` field to avoid hard crashes.
301
+
302
+ ### Control flow (V1 baseline strategy)
303
+
304
+ [PROPOSAL] Use a **hybrid baseline** that is reliable under time constraints:
305
+ - scripted tool sequence per task (fast, deterministic),
306
+ - one LLM call (optional) to draft the final report from gathered evidence (so the demo shows “agentic reasoning”),
307
+ - temperature fixed to 0 for reproducibility (and lower variance).
308
+
309
+ [SOURCED] Deterministic inference settings like `TEMPERATURE=0.0` are used in competitive OpenEnv hackathon baselines. citeturn20view0turn22view4
310
+
311
+ ### Minimum viable baseline (must ship)
312
+
313
+ [PROPOSAL] For each task:
314
+ 1) `reset(task_id=<tier>)`
315
+ 2) run 2–4 tool calls that are always relevant (e.g., `check_security_headers`, `search_repo`, etc.)
316
+ 3) `create_finding(...)` using evidence IDs
317
+ 4) `validate_finding(finding_id)`
318
+ 5) `submit_report(report_json)`
319
+
320
+ ### Stronger baseline (only if time permits)
321
+
322
+ [PROPOSAL] Add one planning LLM call that chooses among tools based on the alert type, but still keep a hard step limit, and always include verifier validation before reporting.
323
+
324
+ ## Complete build, validation, deployment, and submission pipeline
325
+
326
+ **SECTION 5 — Complete End-to-End Pipeline**
327
+
328
+ [SOURCED] This pipeline is built to satisfy both OpenEnv conventions (init/push, typed models, FastAPI server) and hackathon validation constraints (tasks/graders, inference logging, runtime budgets). citeturn18view0turn19search2turn3view6turn22view1
329
+
330
+ ### Phase goals, deliverables, verification (execution-ready)
331
+
332
+ [PROPOSAL] The table below is the “do-this-in-order” execution plan. It is intentionally validator-first.
333
+
334
+ | Phase | Goal | Deliverables | Files touched | Acceptance criteria | Main risks | How to verify |
335
+ |---|---|---|---|---|---|---|
336
+ | Scope lock | Freeze V1 to 3 tasks + bounded tools | 1-page spec + non-goals | README.md | No pentest/exploit scope; 3 tasks defined | Scope creep | Manual checklist |
337
+ | Scaffold | Generate OpenEnv skeleton | Working importable package | openenv.yaml, models.py, client.py, server/* | `python -c "import ..."` succeeds | Wrong template/paths | Local import smoke test |
338
+ | Environment core | Implement reset/step/state; tool router | Simulator runs end-to-end | server/environment.py | reset+step returns typed observation; no crashes | Action validation crashes | manual `curl` + python client |
339
+ | Tasks + graders | Implement 3 graders + strict (0,1) clamp | `grade_easy/medium/hard` | server/graders.py, openenv.yaml | tasks discoverable; scores strictly in (0,1) | Validator fail-fast | unit tests + manual checks |
340
+ | Baseline inference | Make inference reproducible + strict logs | inference.py | inference.py | prints correct `[START]/[STEP]/[END]` | log-parser failure | run script locally |
341
+ | Local validation | Run OpenEnv build & validate | passes `openenv validate` | Dockerfile, server/app.py | validate passes locally | port mismatch | `openenv validate --url ...` |
342
+ | Docker + HF | Deploy to Spaces | live endpoint | openenv push output | `/health` 200; reset+step works remotely | HF port/env mismatch | curl + python client |
343
+ | Submission | Final narrative + demo | polished README + screenshots | README.md | demo works in <2 min | unclear story | run “demo script” |
344
+
345
+ ### Concrete build plan with commands
346
+
347
+ [SOURCED] OpenEnv supports `openenv init` and `openenv push` and documents this as the standard creator workflow. citeturn18view0turn17view0
348
+ [SOURCED] The OpenEnv course also provides a grounded dev loop: `uv sync`, `uv run server`, `curl /health`, and Docker build/run commands. citeturn17view0
349
+
350
+ [PROPOSAL] Commands (copy/paste order):
351
+
352
+ 1) **Scaffold**
353
+ ```bash
354
+ pip install openenv-core
355
+ openenv init secops_evidence_gym
356
+ cd secops_evidence_gym
357
+ ```
358
+ [SOURCED] `openenv init` is the documented way to scaffold a new environment. citeturn18view0turn18view2
359
+
360
+ 2) **Local dev install + run**
361
+ ```bash
362
+ uv sync
363
+ uv run server
364
+ curl http://localhost:8000/health
365
+ ```
366
+ [SOURCED] `uv run server` and `/health` checks are part of the recommended iteration loop in OpenEnv course materials. citeturn17view0
367
+
368
+ 3) **Implement core files (edit)**
369
+ - `models.py`: define `Action/Observation/State` dataclasses
370
+ - `server/environment.py`: implement reset/step/state + tool routing
371
+ - `server/graders.py`: implement `grade_easy/grade_medium/grade_hard` + `safe_reward()`
372
+ - `openenv.yaml`: add `tasks:` with grader import paths
373
+
374
+ [SOURCED] OpenEnv’s environment-building guide explicitly directs you to define models and implement `reset/step/state`, then wire a FastAPI app. citeturn19search1
375
+ [SOURCED] A validator-aligned `openenv.yaml` with `spec_version`, `runtime`, `app`, `port`, and `tasks` exists in deep-validation passing examples. citeturn23view0
376
+
377
+ 4) **Build + validate (local)**
378
+ ```bash
379
+ openenv build
380
+ openenv validate --verbose
381
+ ```
382
+ [SOURCED] `openenv build` and `openenv validate` are part of OpenEnv’s recommended validation workflow. citeturn19search2
383
+
384
+ 5) **Docker build/run smoke test**
385
+ ```bash
386
+ docker build -t secops-evidence-gym:latest -f server/Dockerfile .
387
+ docker run -p 8000:8000 secops-evidence-gym:latest
388
+ curl http://localhost:8000/health
389
+ ```
390
+ [SOURCED] This `docker build -f server/Dockerfile .` pattern is directly shown in OpenEnv deployment course material. citeturn17view0
391
+
392
+ 6) **Run inference locally**
393
+ ```bash
394
+ export HF_TOKEN="..."
395
+ export API_BASE_URL="..."
396
+ export MODEL_NAME="..."
397
+ export ENV_URL="http://localhost:8000"
398
+ python inference.py
399
+ ```
400
+ [SOURCED] These env var names and OpenAI SDK usage are consistent with hackathon guidance and existing inference implementations. citeturn3view6turn22view4
401
+
402
+ 7) **Deploy to Hugging Face Spaces**
403
+ ```bash
404
+ openenv push --repo-id <your-hf-username>/secops-evidence-gym
405
+ ```
406
+ [SOURCED] `openenv push` is described as the fastest path to deploy to **Hugging Face Spaces**. citeturn17view0turn18view0
407
+
408
+ ### Testing and validation plan (high-signal)
409
+
410
+ [SOURCED] OpenEnv stresses predictable API behaviour and type-safe contracts; hackathon validation is fail-fast. citeturn18view0turn27search0
411
+
412
+ [PROPOSAL] Test layers (in priority order):
413
+ - **API contract smoke tests:** reset/step/state return valid JSON; never crash on invalid tool name (should return an observation with an error field).
414
+ - **Grader tests:** for each task, verify (a) correctness cases score high, (b) hallucination cases score low, (c) score always ∈ (0,1).
415
+ - **Seed determinism tests:** same `seed` produces same evidence IDs and same verifier outputs.
416
+ - **Runtime test:** run `inference.py` end-to-end and assert wall-clock < 2 minutes locally; assume < 20 minutes on grader infra even with cold starts. citeturn3view6turn22view4
417
+ - **Reward sanity tests:** ensure reward increases monotonically with verified correctness; fails if verbosity alone increases reward.
418
+
419
+ ## Submission packaging, execution roadmap, real-world usefulness, and failure modes
420
+
421
+ **SECTION 14 — README / Demo / Submission Narrative**
422
+ [SOURCED] Judges likely assess both the environment’s technical correctness (programmatic checks) and qualitative merit (LLM scoring / narrative). citeturn3view7
423
+
424
+ [PROPOSAL] README structure that “feels like a winner” 🏆:
425
+ - **Hero block:** one-paragraph pitch + why it’s real-world + safety claim.
426
+ - **Two-minute demo:** copy/paste commands + expected output snippet with `[START]/[STEP]/[END]`.
427
+ - **Environment contract:** action schema, observation schema, task list.
428
+ - **Grading:** explain deterministic verifiers + hallucination penalties.
429
+ - **Safety & isolation:** explicit exclusions (no egress, no shell, synthetic artefacts).
430
+ - **Real-world relevance:** how this benchmarks/reporting maps to security workflows (triage, evidence, remediation).
431
+ - **Screenshots:** web UI (optional) + an evidence trace + one scored report example.
432
+
433
+ **SECTION 15 — Project Management Plan**
434
+ [PROPOSAL] Day-by-day (assuming a hackathon-style sprint):
435
+
436
+ - **Day 0 (scope lock + scaffold):** environment skeleton, `openenv.yaml` with 3 tasks, stub graders returning 0.5 (clamped), server runs locally.
437
+ - **Day 1 (determinism + validator):** implement scenario generator, evidence registry, verifiers, and strict (0,1) scoring; pass `openenv validate`.
438
+ - **Day 2 (baseline + polish):** implement `inference.py` strict logs; deploy to Spaces; polish README + demo artefacts.
439
+
440
+ [PROPOSAL] Critical path: `openenv.yaml tasks+graders` → grader clamp `(0,1)` → inference stdout format → Docker+Spaces deployment. (Everything else is secondary.)
441
+
442
+ **SECTION 16 — Real-World Usefulness Plan**
443
+ [SOURCED] NIST’s testing guide emphasises planning, conducting tests, analysing findings, and developing mitigation strategies; your environment’s “evidence → remediation” focus aligns with that lifecycle without requiring offensive exploitation. citeturn29search8turn29search0
444
+
445
+ [PROPOSAL] Who would care after the hackathon:
446
+ - security engineering teams evaluating agentic “triage + reporting” reliability,
447
+ - LLM tooling teams wanting benchmarks for **non-hallucinating, evidence-grounded** outputs,
448
+ - training teams building safe cyber ranges (without weaponisation).
449
+
450
+ [PROPOSAL] Post-hackathon upgrades (highest leverage):
451
+ - export trajectories as JSONL for offline training,
452
+ - add more scenario families (still safe) and a held-out split for generalisation,
453
+ - integrate with RL trainers (e.g., TRL’s OpenEnv integration) to show real training curves. citeturn19search6turn10view0
454
+
455
+ [SOURCED] PenGym provides evidence that realism/faithfulness of environments can affect transfer and stability when moving from simulation to more realistic settings—so you should roadmap a “higher fidelity mode” (still safe) later, not in V1. citeturn15view0
456
+
457
+ **SECTION 17 — Why the naive version would fail**
458
+ [PROPOSAL] Top failure patterns (and why they kill submissions):
459
+ - Too broad (full cyber range, live services): fails time/infra constraints. citeturn3view6turn10view0
460
+ - Fuzzy grading (LLM-only judging): non-deterministic, easy to game.
461
+ - Unbounded tools (shell/network): unsafe + untrainable action space.
462
+ - Scores at exactly 0.0 or 1.0: fail-fast “out of range” validator. citeturn27search0turn26view0
463
+ - Inference logs not parseable: phase-1 failure even if env is good. citeturn3view6turn22view1
464
+ - Port / health issues on Spaces: container “works locally” but fails remotely. citeturn17view0turn20view0
465
+
466
+ **SECTION 18 — Final Recommendation**
467
+
468
+ [PROPOSAL] **What should you build?**
469
+ Build **SecOps Evidence Gym**: a deterministic, safe, sandbox-only cyber analyst environment focused on evidence collection, verifier validation, and remediation reporting.
470
+
471
+ [PROPOSAL] **What should V1 include? (minimum winning set)**
472
+ - OpenEnv-compliant FastAPI env with typed models and `reset/step/state`. citeturn18view0turn19search1
473
+ - `openenv.yaml` with **3 tasks + graders**. citeturn23view0turn3view6
474
+ - Deterministic verifiers + strict score clamp to `(0,1)`. citeturn27search0turn26view0
475
+ - Baseline `inference.py` with strict `[START]/[STEP]/[END]` logging + OpenAI SDK usage for any LLM calls. citeturn3view6turn22view1turn22view4
476
+ - HF Spaces deployment with a working `/health`. citeturn17view0turn20view0
477
+
478
+ [PROPOSAL] **What should you cut?**
479
+ - Any real pentesting/offensive content, any arbitrary command execution, any live targets, any correctness scoring via an LLM judge.
480
+
481
+ [PROPOSAL] **Top 5 implementation decisions that matter most**
482
+ 1) Validator-safe `openenv.yaml` tasks+graders wiring. citeturn23view0
483
+ 2) Score/range compliance: clamp to `(0,1)` everywhere. citeturn27search0turn26view0
484
+ 3) Strict stdout format in `inference.py`. citeturn22view1turn22view2
485
+ 4) Deterministic verifiers as the source of truth.
486
+ 5) Bounded tool set (≤8 tools) with anti-loop penalties.
487
+
488
+ [PROPOSAL] **Minimum viable winning submission**
489
+ A V1 with 3 tasks, deterministic graders, bounded tools, strict inference logging, and a polished README + demo trace.
490
+
491
+ [PROPOSAL] **Minimum viable real-world useful submission**
492
+ The same V1, plus: seed determinism, trajectory export, and a clear “how to add new scenarios” contributor guide.
493
+
494
+ [PROPOSAL] **If you only have time for 20% of ambition—do this exact 20%:**
495
+ - Implement **one** robust multi-step loop (tools → validate → report)
496
+ - Implement **exactly 3** tasks (easy/medium/hard)
497
+ - Make graders deterministic and validator-safe
498
+ - Make deployment + inference bulletproof
499
+ Everything else is stretch.
500
+
501
+ **Confidence (my estimate): 8.4/10** ✅🔥
502
+
503
+ ## Sources and credibility ratings (with exact links)
504
+
505
+ [SOURCED] Ratings are my judgement of authority + relevance for this hackathon context (0–10). URLs are provided verbatim in code form.
506
+
507
+ ### Tier 1 (official OpenEnv + hackathon dashboard)
508
+ - Credibility **9.5/10** — `https://github.com/meta-pytorch/OpenEnv` citeturn18view0
509
+ - Credibility **9.0/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/envs/README.md` citeturn19search1
510
+ - Credibility **8.5/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/.claude/skills/generate-openenv-env/SKILL.md` citeturn19search2
511
+ - Credibility **9.0/10** — `https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard` citeturn1view0turn3view6turn3view7
512
+
513
+ ### Tier 2 (strong community exemplars)
514
+ - Credibility **8.5/10** — `https://github.com/sid-rp/kube-sre-gym` citeturn10view0
515
+ - Credibility **8.0/10** — `https://huggingface.co/openenv-community` citeturn14view0
516
+ - Credibility **7.5/10** — `https://github.com/Harikishanth/Incident-Triage-Environment` citeturn20view0turn23view0turn22view1
517
+
518
+ ### Tier 3 (peer-reviewed / primary references for design constraints)
519
+ - Credibility **8.5/10** — PenGym (Computers & Security, open access): `https://www.sciencedirect.com/science/article/pii/S0167404824004450` citeturn15view0
520
+ - Credibility **8.0/10** — Reward design + generalisation (Scientific Reports, 2025): `https://www.nature.com/articles/s41598-025-27702-6` citeturn15view2
521
+ - Credibility **8.5/10** — AMaze (JOSS, 2025): `https://joss.theoj.org/papers/10.21105/joss.07208` citeturn16search7
522
+ - Credibility **9.5/10** — NIST SP 800-115: `https://csrc.nist.gov/pubs/sp/800/115/final` citeturn29search8
523
+ - Credibility **9.0/10** — NIST “Cyber Range: A Guide” (PDF landing): `https://www.nist.gov/document/cyber-range` citeturn29search1
524
+ - Credibility **7.5/10** — “Cybersecurity of Cyber Ranges: Threats and Mitigations” (IJISR, 2022 PDF): `https://infonomics-society.org/wp-content/uploads/Cybersecurity-of-Cyber-Ranges.pdf` citeturn29search2
525
+
526
+ ### Tier 4 (useful validator “ground truth” signals from the field)
527
+ - Credibility **6.5/10** — Validator failure mode discussion (score must be strictly between 0 and 1): `https://www.reddit.com/r/pytorch/comments/1shi767/meta_x_pytorch_x_sst_x_openenv_hackathon_phase_2/` citeturn27search0
528
+ - Credibility **7.0/10** — Strict logging format reference via a verified submission’s `inference.py`: `https://github.com/Harikishanth/Incident-Triage-Environment/blob/main/inference.py` citeturn22view1turn22view2
529
+
530
+ ### Uploaded reference you provided
531
  - Credibility **7.0/10** (useful as a design draft; not independently authoritative) — `deep-research-report (2).md` fileciteturn2file0
inference.py CHANGED
@@ -1,290 +1,290 @@
1
- #!/usr/bin/env python3
2
- """Model-backed baseline inference for the Cyber Analyst OpenEnv environment."""
3
-
4
- from __future__ import annotations
5
-
6
- import json
7
- import os
8
- import sys
9
- import textwrap
10
- from dataclasses import dataclass
11
- from pathlib import Path
12
- from typing import Any
13
-
14
- from openai import OpenAI
15
-
16
- PACKAGE_PARENT = Path(__file__).resolve().parent.parent
17
- if str(PACKAGE_PARENT) not in sys.path:
18
- sys.path.insert(0, str(PACKAGE_PARENT))
19
-
20
- from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv, CyberAnalystObservation
21
-
22
-
23
- ENV_NAME = "Cyber_analyst"
24
- ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
25
- API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
26
- MODEL_NAME = os.getenv("MODEL_NAME", "google/gemma-4-31B-it:fastest")
27
- TEMPERATURE = float(os.getenv("TEMPERATURE", "0.0"))
28
- MAX_TOKENS = int(os.getenv("MAX_TOKENS", "512"))
29
- MAX_STEPS = int(os.getenv("MAX_STEPS", "12"))
30
- SEED = int(os.getenv("SEED", "7"))
31
- TASK_IDS = [
32
- "secret_exposure_easy",
33
- "missing_security_headers_medium",
34
- "authz_boundary_hard",
35
- ]
36
-
37
- SYSTEM_PROMPT = textwrap.dedent(
38
- """
39
- You are running a bounded Cyber Analyst benchmark. You may only choose one
40
- tool call from the provided tool catalog per turn. All evidence is synthetic;
41
- do not request shell access, live network access, or external investigation.
42
-
43
- Return exactly one compact JSON object and no surrounding text:
44
- {"tool_name":"<tool name>","args":{...}}
45
-
46
- To complete an episode, first discover relevant evidence, then create and
47
- validate a finding, then submit a report_json with findings that include
48
- finding_type, evidence_ids, impact, and remediation.
49
- """
50
- ).strip()
51
-
52
-
53
- @dataclass(frozen=True)
54
- class LLMConfig:
55
- base_url: str
56
- model_name: str
57
- temperature: float
58
- max_tokens: int
59
-
60
-
61
- class ModelActionError(RuntimeError):
62
- """Raised when the model cannot provide a valid benchmark action."""
63
-
64
-
65
- def build_llm_config() -> LLMConfig:
66
- return LLMConfig(
67
- base_url=API_BASE_URL,
68
- model_name=MODEL_NAME,
69
- temperature=TEMPERATURE,
70
- max_tokens=MAX_TOKENS,
71
- )
72
-
73
-
74
- def build_openai_client() -> OpenAI:
75
- """Return an OpenAI-compatible client for the Hugging Face Router."""
76
-
77
- return OpenAI(base_url=API_BASE_URL, api_key=os.environ["HF_TOKEN"])
78
-
79
-
80
- def single_line(value: str) -> str:
81
- return " ".join(str(value).split())
82
-
83
-
84
- def action_to_log(action: CyberAnalystAction) -> str:
85
- payload = {"tool_name": action.tool_name, "args": action.args}
86
- return single_line(json.dumps(payload, sort_keys=True, separators=(",", ":")))
87
-
88
-
89
- def log_start(task_id: str, llm_config: LLMConfig) -> None:
90
- print(
91
- f"[START] task={task_id} env={ENV_NAME} model={llm_config.model_name}",
92
- flush=True,
93
- )
94
-
95
-
96
- def log_step(
97
- step: int, action: CyberAnalystAction, reward: float | None, done: bool, error: str
98
- ) -> None:
99
- reward_value = 0.0 if reward is None else float(reward)
100
- error_value = single_line(error) if error else "null"
101
- print(
102
- f"[STEP] step={step} action={action_to_log(action)} "
103
- f"reward={reward_value:.2f} done={str(done).lower()} error={error_value}",
104
- flush=True,
105
- )
106
-
107
-
108
- def log_end(task_id: str, success: bool, steps: int, score: float, rewards: list[float]) -> None:
109
- rewards_text = ",".join(f"{reward:.2f}" for reward in rewards)
110
- print(
111
- f"[END] task={task_id} success={str(success).lower()} "
112
- f"steps={steps} score={score:.2f} rewards={rewards_text}",
113
- flush=True,
114
- )
115
-
116
-
117
- def observation_payload(obs: CyberAnalystObservation) -> dict[str, Any]:
118
- return {
119
- "task_id": obs.task_id,
120
- "alert": obs.alert,
121
- "phase": obs.phase,
122
- "tool_catalog": obs.tool_catalog,
123
- "tool_result": obs.tool_result,
124
- "evidence_ids": obs.evidence_ids,
125
- "candidate_findings": obs.candidate_findings,
126
- "verified_findings": obs.verified_findings,
127
- "step_budget_remaining": obs.step_budget_remaining,
128
- "score_breakdown": obs.score_breakdown,
129
- "error": obs.error,
130
- }
131
-
132
-
133
- def build_user_prompt(task_id: str, step: int, obs: CyberAnalystObservation) -> str:
134
- payload = {
135
- "task_id": task_id,
136
- "step": step,
137
- "observation": observation_payload(obs),
138
- }
139
- return textwrap.dedent(
140
- f"""
141
- Choose the next benchmark tool call.
142
- Current state JSON:
143
- {json.dumps(payload, sort_keys=True)}
144
- """
145
- ).strip()
146
-
147
-
148
- def extract_json_object(text: str) -> dict[str, Any]:
149
- content = text.strip()
150
- if content.startswith("```"):
151
- lines = content.splitlines()
152
- if lines and lines[0].startswith("```"):
153
- lines = lines[1:]
154
- if lines and lines[-1].startswith("```"):
155
- lines = lines[:-1]
156
- content = "\n".join(lines).strip()
157
-
158
- try:
159
- decoded = json.loads(content)
160
- except json.JSONDecodeError as exc:
161
- raise ModelActionError(f"model_parse_error: {exc.msg}") from exc
162
-
163
- if not isinstance(decoded, dict):
164
- raise ModelActionError("model_parse_error: response is not a JSON object")
165
- return decoded
166
-
167
-
168
- def parse_model_action(text: str) -> CyberAnalystAction:
169
- payload = extract_json_object(text)
170
- tool_name = payload.get("tool_name")
171
- args = payload.get("args", {})
172
-
173
- if not isinstance(tool_name, str) or not tool_name:
174
- raise ModelActionError("model_parse_error: missing tool_name")
175
- if not isinstance(args, dict):
176
- raise ModelActionError("model_parse_error: args must be an object")
177
-
178
- return CyberAnalystAction(tool_name=tool_name, args=args)
179
-
180
-
181
- def get_model_action(
182
- client: OpenAI,
183
- llm_config: LLMConfig,
184
- task_id: str,
185
- step: int,
186
- obs: CyberAnalystObservation,
187
- ) -> CyberAnalystAction:
188
- try:
189
- completion = client.chat.completions.create(
190
- model=llm_config.model_name,
191
- messages=[
192
- {"role": "system", "content": SYSTEM_PROMPT},
193
- {"role": "user", "content": build_user_prompt(task_id, step, obs)},
194
- ],
195
- temperature=llm_config.temperature,
196
- max_tokens=llm_config.max_tokens,
197
- stream=False,
198
- )
199
- except Exception as exc:
200
- raise ModelActionError(f"model_request_error: {exc}") from exc
201
-
202
- text = (completion.choices[0].message.content or "").strip()
203
- if not text:
204
- raise ModelActionError("model_parse_error: empty response")
205
- return parse_model_action(text)
206
-
207
-
208
- def error_action(error: Exception) -> CyberAnalystAction:
209
- message = single_line(str(error))
210
- if message.startswith("model_request_error"):
211
- tool_name = "model_request_error"
212
- elif message.startswith("model_parse_error"):
213
- tool_name = "model_parse_error"
214
- else:
215
- tool_name = "model_action_error"
216
- return CyberAnalystAction(
217
- tool_name=tool_name,
218
- args={"message": message[:500]},
219
- )
220
-
221
-
222
- def run_task(task_id: str, client: OpenAI, llm_config: LLMConfig) -> None:
223
- log_start(task_id, llm_config)
224
- rewards: list[float] = []
225
- steps_taken = 0
226
- final_score = 0.01
227
- success = False
228
-
229
- try:
230
- with CyberAnalystEnv(base_url=ENV_URL).sync() as env:
231
- reset_result = env.reset(task_id=task_id, seed=SEED)
232
- obs = reset_result.observation
233
-
234
- for step in range(1, MAX_STEPS + 1):
235
- if obs.done:
236
- break
237
-
238
- model_failed = False
239
- try:
240
- action = get_model_action(client, llm_config, task_id, step, obs)
241
- except ModelActionError as exc:
242
- action = error_action(exc)
243
- model_failed = True
244
-
245
- result = env.step(action)
246
- obs = result.observation
247
- reward = float(result.reward or 0.0)
248
- rewards.append(reward)
249
- steps_taken = step
250
-
251
- log_step(step, action, result.reward, result.done, obs.error)
252
-
253
- if isinstance(obs.tool_result, dict) and "score" in obs.tool_result:
254
- final_score = float(obs.tool_result["score"])
255
-
256
- if result.done or model_failed:
257
- success = final_score > 0.5
258
- break
259
-
260
- except Exception as exc:
261
- action = CyberAnalystAction(
262
- tool_name="runtime_error",
263
- args={"message": single_line(str(exc))[:500]},
264
- )
265
- steps_taken = max(steps_taken, 1)
266
- rewards.append(0.01)
267
- log_step(steps_taken, action, 0.01, True, single_line(str(exc)))
268
-
269
- log_end(task_id, success, steps_taken, final_score, rewards)
270
-
271
-
272
- def selected_task_ids() -> list[str]:
273
- task_override = os.getenv("TASK_NAME")
274
- return [task_override] if task_override else TASK_IDS
275
-
276
-
277
- def main() -> None:
278
- llm_config = build_llm_config()
279
- try:
280
- client = build_openai_client()
281
- except KeyError:
282
- print("HF_TOKEN must be set for inference.", file=sys.stderr, flush=True)
283
- raise SystemExit(2) from None
284
-
285
- for task_id in selected_task_ids():
286
- run_task(task_id, client, llm_config)
287
-
288
-
289
- if __name__ == "__main__":
290
- main()
 
1
+ #!/usr/bin/env python3
2
+ """Model-backed baseline inference for the Cyber Analyst OpenEnv environment."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import json
7
+ import os
8
+ import sys
9
+ import textwrap
10
+ from dataclasses import dataclass
11
+ from pathlib import Path
12
+ from typing import Any
13
+
14
+ from openai import OpenAI
15
+
16
+ PACKAGE_PARENT = Path(__file__).resolve().parent.parent
17
+ if str(PACKAGE_PARENT) not in sys.path:
18
+ sys.path.insert(0, str(PACKAGE_PARENT))
19
+
20
+ from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv, CyberAnalystObservation
21
+
22
+
23
+ ENV_NAME = "Cyber_analyst"
24
+ ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
25
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
26
+ MODEL_NAME = os.getenv("MODEL_NAME", "google/gemma-4-31B-it:fastest")
27
+ TEMPERATURE = float(os.getenv("TEMPERATURE", "0.0"))
28
+ MAX_TOKENS = int(os.getenv("MAX_TOKENS", "512"))
29
+ MAX_STEPS = int(os.getenv("MAX_STEPS", "12"))
30
+ SEED = int(os.getenv("SEED", "7"))
31
+ TASK_IDS = [
32
+ "secret_exposure_easy",
33
+ "missing_security_headers_medium",
34
+ "authz_boundary_hard",
35
+ ]
36
+
37
+ SYSTEM_PROMPT = textwrap.dedent(
38
+ """
39
+ You are running a bounded Cyber Analyst benchmark. You may only choose one
40
+ tool call from the provided tool catalog per turn. All evidence is synthetic;
41
+ do not request shell access, live network access, or external investigation.
42
+
43
+ Return exactly one compact JSON object and no surrounding text:
44
+ {"tool_name":"<tool name>","args":{...}}
45
+
46
+ To complete an episode, first discover relevant evidence, then create and
47
+ validate a finding, then submit a report_json with findings that include
48
+ finding_type, evidence_ids, impact, and remediation.
49
+ """
50
+ ).strip()
51
+
52
+
53
+ @dataclass(frozen=True)
54
+ class LLMConfig:
55
+ base_url: str
56
+ model_name: str
57
+ temperature: float
58
+ max_tokens: int
59
+
60
+
61
+ class ModelActionError(RuntimeError):
62
+ """Raised when the model cannot provide a valid benchmark action."""
63
+
64
+
65
+ def build_llm_config() -> LLMConfig:
66
+ return LLMConfig(
67
+ base_url=API_BASE_URL,
68
+ model_name=MODEL_NAME,
69
+ temperature=TEMPERATURE,
70
+ max_tokens=MAX_TOKENS,
71
+ )
72
+
73
+
74
+ def build_openai_client() -> OpenAI:
75
+ """Return an OpenAI-compatible client for the Hugging Face Router."""
76
+
77
+ return OpenAI(base_url=API_BASE_URL, api_key=os.environ["HF_TOKEN"])
78
+
79
+
80
+ def single_line(value: str) -> str:
81
+ return " ".join(str(value).split())
82
+
83
+
84
+ def action_to_log(action: CyberAnalystAction) -> str:
85
+ payload = {"tool_name": action.tool_name, "args": action.args}
86
+ return single_line(json.dumps(payload, sort_keys=True, separators=(",", ":")))
87
+
88
+
89
+ def log_start(task_id: str, llm_config: LLMConfig) -> None:
90
+ print(
91
+ f"[START] task={task_id} env={ENV_NAME} model={llm_config.model_name}",
92
+ flush=True,
93
+ )
94
+
95
+
96
+ def log_step(
97
+ step: int, action: CyberAnalystAction, reward: float | None, done: bool, error: str
98
+ ) -> None:
99
+ reward_value = 0.0 if reward is None else float(reward)
100
+ error_value = single_line(error) if error else "null"
101
+ print(
102
+ f"[STEP] step={step} action={action_to_log(action)} "
103
+ f"reward={reward_value:.2f} done={str(done).lower()} error={error_value}",
104
+ flush=True,
105
+ )
106
+
107
+
108
+ def log_end(task_id: str, success: bool, steps: int, score: float, rewards: list[float]) -> None:
109
+ rewards_text = ",".join(f"{reward:.2f}" for reward in rewards)
110
+ print(
111
+ f"[END] task={task_id} success={str(success).lower()} "
112
+ f"steps={steps} score={score:.2f} rewards={rewards_text}",
113
+ flush=True,
114
+ )
115
+
116
+
117
+ def observation_payload(obs: CyberAnalystObservation) -> dict[str, Any]:
118
+ return {
119
+ "task_id": obs.task_id,
120
+ "alert": obs.alert,
121
+ "phase": obs.phase,
122
+ "tool_catalog": obs.tool_catalog,
123
+ "tool_result": obs.tool_result,
124
+ "evidence_ids": obs.evidence_ids,
125
+ "candidate_findings": obs.candidate_findings,
126
+ "verified_findings": obs.verified_findings,
127
+ "step_budget_remaining": obs.step_budget_remaining,
128
+ "score_breakdown": obs.score_breakdown,
129
+ "error": obs.error,
130
+ }
131
+
132
+
133
+ def build_user_prompt(task_id: str, step: int, obs: CyberAnalystObservation) -> str:
134
+ payload = {
135
+ "task_id": task_id,
136
+ "step": step,
137
+ "observation": observation_payload(obs),
138
+ }
139
+ return textwrap.dedent(
140
+ f"""
141
+ Choose the next benchmark tool call.
142
+ Current state JSON:
143
+ {json.dumps(payload, sort_keys=True)}
144
+ """
145
+ ).strip()
146
+
147
+
148
+ def extract_json_object(text: str) -> dict[str, Any]:
149
+ content = text.strip()
150
+ if content.startswith("```"):
151
+ lines = content.splitlines()
152
+ if lines and lines[0].startswith("```"):
153
+ lines = lines[1:]
154
+ if lines and lines[-1].startswith("```"):
155
+ lines = lines[:-1]
156
+ content = "\n".join(lines).strip()
157
+
158
+ try:
159
+ decoded = json.loads(content)
160
+ except json.JSONDecodeError as exc:
161
+ raise ModelActionError(f"model_parse_error: {exc.msg}") from exc
162
+
163
+ if not isinstance(decoded, dict):
164
+ raise ModelActionError("model_parse_error: response is not a JSON object")
165
+ return decoded
166
+
167
+
168
+ def parse_model_action(text: str) -> CyberAnalystAction:
169
+ payload = extract_json_object(text)
170
+ tool_name = payload.get("tool_name")
171
+ args = payload.get("args", {})
172
+
173
+ if not isinstance(tool_name, str) or not tool_name:
174
+ raise ModelActionError("model_parse_error: missing tool_name")
175
+ if not isinstance(args, dict):
176
+ raise ModelActionError("model_parse_error: args must be an object")
177
+
178
+ return CyberAnalystAction(tool_name=tool_name, args=args)
179
+
180
+
181
+ def get_model_action(
182
+ client: OpenAI,
183
+ llm_config: LLMConfig,
184
+ task_id: str,
185
+ step: int,
186
+ obs: CyberAnalystObservation,
187
+ ) -> CyberAnalystAction:
188
+ try:
189
+ completion = client.chat.completions.create(
190
+ model=llm_config.model_name,
191
+ messages=[
192
+ {"role": "system", "content": SYSTEM_PROMPT},
193
+ {"role": "user", "content": build_user_prompt(task_id, step, obs)},
194
+ ],
195
+ temperature=llm_config.temperature,
196
+ max_tokens=llm_config.max_tokens,
197
+ stream=False,
198
+ )
199
+ except Exception as exc:
200
+ raise ModelActionError(f"model_request_error: {exc}") from exc
201
+
202
+ text = (completion.choices[0].message.content or "").strip()
203
+ if not text:
204
+ raise ModelActionError("model_parse_error: empty response")
205
+ return parse_model_action(text)
206
+
207
+
208
+ def error_action(error: Exception) -> CyberAnalystAction:
209
+ message = single_line(str(error))
210
+ if message.startswith("model_request_error"):
211
+ tool_name = "model_request_error"
212
+ elif message.startswith("model_parse_error"):
213
+ tool_name = "model_parse_error"
214
+ else:
215
+ tool_name = "model_action_error"
216
+ return CyberAnalystAction(
217
+ tool_name=tool_name,
218
+ args={"message": message[:500]},
219
+ )
220
+
221
+
222
+ def run_task(task_id: str, client: OpenAI, llm_config: LLMConfig) -> None:
223
+ log_start(task_id, llm_config)
224
+ rewards: list[float] = []
225
+ steps_taken = 0
226
+ final_score = 0.01
227
+ success = False
228
+
229
+ try:
230
+ with CyberAnalystEnv(base_url=ENV_URL).sync() as env:
231
+ reset_result = env.reset(task_id=task_id, seed=SEED)
232
+ obs = reset_result.observation
233
+
234
+ for step in range(1, MAX_STEPS + 1):
235
+ if obs.done:
236
+ break
237
+
238
+ model_failed = False
239
+ try:
240
+ action = get_model_action(client, llm_config, task_id, step, obs)
241
+ except ModelActionError as exc:
242
+ action = error_action(exc)
243
+ model_failed = True
244
+
245
+ result = env.step(action)
246
+ obs = result.observation
247
+ reward = float(result.reward or 0.0)
248
+ rewards.append(reward)
249
+ steps_taken = step
250
+
251
+ log_step(step, action, result.reward, result.done, obs.error)
252
+
253
+ if isinstance(obs.tool_result, dict) and "score" in obs.tool_result:
254
+ final_score = float(obs.tool_result["score"])
255
+
256
+ if result.done or model_failed:
257
+ success = final_score > 0.5
258
+ break
259
+
260
+ except Exception as exc:
261
+ action = CyberAnalystAction(
262
+ tool_name="runtime_error",
263
+ args={"message": single_line(str(exc))[:500]},
264
+ )
265
+ steps_taken = max(steps_taken, 1)
266
+ rewards.append(0.01)
267
+ log_step(steps_taken, action, 0.01, True, single_line(str(exc)))
268
+
269
+ log_end(task_id, success, steps_taken, final_score, rewards)
270
+
271
+
272
+ def selected_task_ids() -> list[str]:
273
+ task_override = os.getenv("TASK_NAME")
274
+ return [task_override] if task_override else TASK_IDS
275
+
276
+
277
+ def main() -> None:
278
+ llm_config = build_llm_config()
279
+ try:
280
+ client = build_openai_client()
281
+ except KeyError:
282
+ print("HF_TOKEN must be set for inference.", file=sys.stderr, flush=True)
283
+ raise SystemExit(2) from None
284
+
285
+ for task_id in selected_task_ids():
286
+ run_task(task_id, client, llm_config)
287
+
288
+
289
+ if __name__ == "__main__":
290
+ main()
models.py CHANGED
@@ -1,64 +1,64 @@
1
- # Copyright (c) Meta Platforms, Inc. and affiliates.
2
- # All rights reserved.
3
- #
4
- # This source code is licensed under the BSD-style license found in the
5
- # LICENSE file in the root directory of this source tree.
6
-
7
- """Typed models for the Cyber Analyst OpenEnv environment."""
8
-
9
- from typing import Any
10
-
11
- from openenv.core.env_server.types import Action, Observation, State
12
- from pydantic import Field
13
-
14
-
15
- class CyberAnalystAction(Action):
16
- """A bounded simulator tool call."""
17
-
18
- tool_name: str = Field(..., description="Name of the approved simulator tool")
19
- args: dict[str, Any] = Field(
20
- default_factory=dict,
21
- description="Tool arguments. The environment ignores unsupported keys.",
22
- )
23
-
24
-
25
- class CyberAnalystObservation(Observation):
26
- """Observation returned after reset or an environment step."""
27
-
28
- task_id: str = Field(default="", description="Current benchmark task id")
29
- alert: str = Field(default="", description="Initial alert or task prompt")
30
- phase: str = Field(default="investigate", description="Current episode phase")
31
- tool_catalog: list[dict[str, Any]] = Field(
32
- default_factory=list, description="Approved tools and their schemas"
33
- )
34
- tool_result: dict[str, Any] = Field(
35
- default_factory=dict, description="Result returned by the latest tool call"
36
- )
37
- evidence_ids: list[str] = Field(
38
- default_factory=list, description="Evidence ids discovered so far"
39
- )
40
- verified_findings: list[dict[str, Any]] = Field(
41
- default_factory=list, description="Verifier-confirmed findings"
42
- )
43
- candidate_findings: list[dict[str, Any]] = Field(
44
- default_factory=list, description="Candidate findings created by the agent"
45
- )
46
- step_budget_remaining: int = Field(
47
- default=0, ge=0, description="Steps remaining before timeout"
48
- )
49
- score_breakdown: dict[str, Any] = Field(
50
- default_factory=dict, description="Deterministic reward/score explanation"
51
- )
52
- error: str = Field(default="", description="Non-fatal environment error, if any")
53
-
54
-
55
- class CyberAnalystState(State):
56
- """State summary exposed via the OpenEnv state endpoint."""
57
-
58
- task_id: str = Field(default="", description="Current benchmark task id")
59
- seed: int | None = Field(default=None, description="Current deterministic seed")
60
- phase: str = Field(default="investigate", description="Current episode phase")
61
- step_budget_remaining: int = Field(default=0, ge=0)
62
- recent_evidence_ids: list[str] = Field(default_factory=list)
63
- verified_finding_ids: list[str] = Field(default_factory=list)
64
- done: bool = Field(default=False)
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Typed models for the Cyber Analyst OpenEnv environment."""
8
+
9
+ from typing import Any
10
+
11
+ from openenv.core.env_server.types import Action, Observation, State
12
+ from pydantic import Field
13
+
14
+
15
+ class CyberAnalystAction(Action):
16
+ """A bounded simulator tool call."""
17
+
18
+ tool_name: str = Field(..., description="Name of the approved simulator tool")
19
+ args: dict[str, Any] = Field(
20
+ default_factory=dict,
21
+ description="Tool arguments. The environment ignores unsupported keys.",
22
+ )
23
+
24
+
25
+ class CyberAnalystObservation(Observation):
26
+ """Observation returned after reset or an environment step."""
27
+
28
+ task_id: str = Field(default="", description="Current benchmark task id")
29
+ alert: str = Field(default="", description="Initial alert or task prompt")
30
+ phase: str = Field(default="investigate", description="Current episode phase")
31
+ tool_catalog: list[dict[str, Any]] = Field(
32
+ default_factory=list, description="Approved tools and their schemas"
33
+ )
34
+ tool_result: dict[str, Any] = Field(
35
+ default_factory=dict, description="Result returned by the latest tool call"
36
+ )
37
+ evidence_ids: list[str] = Field(
38
+ default_factory=list, description="Evidence ids discovered so far"
39
+ )
40
+ verified_findings: list[dict[str, Any]] = Field(
41
+ default_factory=list, description="Verifier-confirmed findings"
42
+ )
43
+ candidate_findings: list[dict[str, Any]] = Field(
44
+ default_factory=list, description="Candidate findings created by the agent"
45
+ )
46
+ step_budget_remaining: int = Field(
47
+ default=0, ge=0, description="Steps remaining before timeout"
48
+ )
49
+ score_breakdown: dict[str, Any] = Field(
50
+ default_factory=dict, description="Deterministic reward/score explanation"
51
+ )
52
+ error: str = Field(default="", description="Non-fatal environment error, if any")
53
+
54
+
55
+ class CyberAnalystState(State):
56
+ """State summary exposed via the OpenEnv state endpoint."""
57
+
58
+ task_id: str = Field(default="", description="Current benchmark task id")
59
+ seed: int | None = Field(default=None, description="Current deterministic seed")
60
+ phase: str = Field(default="investigate", description="Current episode phase")
61
+ step_budget_remaining: int = Field(default=0, ge=0)
62
+ recent_evidence_ids: list[str] = Field(default_factory=list)
63
+ verified_finding_ids: list[str] = Field(default_factory=list)
64
+ done: bool = Field(default=False)
openenv.yaml CHANGED
@@ -1,16 +1,16 @@
1
- spec_version: 1
2
- name: Cyber_analyst
3
- type: space
4
- runtime: fastapi
5
- app: server.app:app
6
- port: 8000
7
- tasks:
8
- - id: secret_exposure_easy
9
- description: Detect a leaked synthetic API key in a repo snapshot and submit rotation/removal remediation.
10
- grader: server.graders:grade_secret_exposure_easy
11
- - id: missing_security_headers_medium
12
- description: Detect missing HSTS/CSP headers in a synthetic gateway header snapshot and submit remediation.
13
- grader: server.graders:grade_missing_security_headers_medium
14
- - id: authz_boundary_hard
15
- description: Detect an admin route role-policy mismatch and submit least-privilege remediation.
16
- grader: server.graders:grade_authz_boundary_hard
 
1
+ spec_version: 1
2
+ name: Cyber_analyst
3
+ type: space
4
+ runtime: fastapi
5
+ app: server.app:app
6
+ port: 8000
7
+ tasks:
8
+ - id: secret_exposure_easy
9
+ description: Detect a leaked synthetic API key in a repo snapshot and submit rotation/removal remediation.
10
+ grader: server.graders:grade_secret_exposure_easy
11
+ - id: missing_security_headers_medium
12
+ description: Detect missing HSTS/CSP headers in a synthetic gateway header snapshot and submit remediation.
13
+ grader: server.graders:grade_missing_security_headers_medium
14
+ - id: authz_boundary_hard
15
+ description: Detect an admin route role-policy mismatch and submit least-privilege remediation.
16
+ grader: server.graders:grade_authz_boundary_hard
pyproject.toml CHANGED
@@ -1,39 +1,39 @@
1
- # Copyright (c) Meta Platforms, Inc. and affiliates.
2
- # All rights reserved.
3
- #
4
- # This source code is licensed under the BSD-style license found in the
5
- # LICENSE file in the root directory of this source tree.
6
-
7
- [build-system]
8
- requires = ["setuptools>=45", "wheel"]
9
- build-backend = "setuptools.build_meta"
10
-
11
- [project]
12
- name = "openenv-Cyber_analyst"
13
- version = "0.1.0"
14
- description = "Cyber Analyst environment for OpenEnv"
15
- requires-python = ">=3.10"
16
- dependencies = [
17
- # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
18
- # install from github
19
- # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
20
- "openenv-core[core]>=0.2.2",
21
- # Environment-specific dependencies
22
- "openai>=1.0.0",
23
- ]
24
-
25
- [project.optional-dependencies]
26
- dev = [
27
- "pytest>=8.0.0",
28
- "pytest-cov>=4.0.0",
29
- ]
30
-
31
- [project.scripts]
32
- # Server entry point - enables running via: uv run --project . server
33
- # or: python -m Cyber_analyst.server.app
34
- server = "Cyber_analyst.server.app:main"
35
-
36
- [tool.setuptools]
37
- include-package-data = true
38
- packages = ["Cyber_analyst", "Cyber_analyst.server"]
39
- package-dir = { "Cyber_analyst" = ".", "Cyber_analyst.server" = "server" }
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ [build-system]
8
+ requires = ["setuptools>=45", "wheel"]
9
+ build-backend = "setuptools.build_meta"
10
+
11
+ [project]
12
+ name = "openenv-Cyber_analyst"
13
+ version = "0.1.0"
14
+ description = "Cyber Analyst environment for OpenEnv"
15
+ requires-python = ">=3.10"
16
+ dependencies = [
17
+ # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
18
+ # install from github
19
+ # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
20
+ "openenv-core[core]>=0.2.2",
21
+ # Environment-specific dependencies
22
+ "openai>=1.0.0",
23
+ ]
24
+
25
+ [project.optional-dependencies]
26
+ dev = [
27
+ "pytest>=8.0.0",
28
+ "pytest-cov>=4.0.0",
29
+ ]
30
+
31
+ [project.scripts]
32
+ # Server entry point - enables running via: uv run --project . server
33
+ # or: python -m Cyber_analyst.server.app
34
+ server = "Cyber_analyst.server.app:main"
35
+
36
+ [tool.setuptools]
37
+ include-package-data = true
38
+ packages = ["Cyber_analyst", "Cyber_analyst.server"]
39
+ package-dir = { "Cyber_analyst" = ".", "Cyber_analyst.server" = "server" }
sample_inference_script.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference Script Example
3
+ ===================================
4
+ MANDATORY
5
+ - Before submitting, ensure the following variables are defined in your environment configuration:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
10
+ method
11
+
12
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
13
+ (and should reflect your active inference setup):
14
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
15
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
16
+
17
+ - The inference script must be named `inference.py` and placed in the root directory of the project
18
+ - Participants must use OpenAI Client for all LLM calls using above variables
19
+
20
+ STDOUT FORMAT
21
+ - The script must emit exactly three line types to stdout, in this order:
22
+
23
+ [START] task=<task_name> env=<benchmark> model=<model_name>
24
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
25
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
26
+
27
+ Rules:
28
+ - One [START] line at episode begin.
29
+ - One [STEP] line per step, immediately after env.step() returns.
30
+ - One [END] line after env.close(), always emitted (even on exception).
31
+ - reward and rewards are formatted to 2 decimal places.
32
+ - done and success are lowercase booleans: true or false.
33
+ - error is the raw last_action_error string, or null if none.
34
+ - All fields on a single line with no newlines within a line.
35
+ - Each tasks should return score in [0, 1]
36
+
37
+ Example:
38
+ [START] task=click-test env=miniwob model=Qwen3-VL-30B
39
+ [STEP] step=1 action=click('123') reward=0.00 done=false error=null
40
+ [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
41
+ [STEP] step=3 action=click('789') reward=1.00 done=true error=null
42
+ [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
43
+ """
44
+
45
+ import asyncio
46
+ import os
47
+ import textwrap
48
+ from typing import List, Optional
49
+
50
+ from openai import OpenAI
51
+
52
+ from my_env_v4 import MyEnvV4Action, MyEnvV4Env
53
+ IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
54
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
55
+
56
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
57
+ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
58
+ TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
59
+ BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
60
+ MAX_STEPS = 8
61
+ TEMPERATURE = 0.7
62
+ MAX_TOKENS = 150
63
+ SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
64
+
65
+ # Max possible reward: each token contributes 0.1, across all steps
66
+ _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
67
+ MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
68
+
69
+ SYSTEM_PROMPT = textwrap.dedent(
70
+ """
71
+ You are interacting with a simple echo environment.
72
+ Each turn you must send a message. The environment will echo it back.
73
+ Reward is proportional to message length: reward = len(message) * 0.1
74
+ Your goal is to maximize total reward by sending meaningful, substantive messages.
75
+ Reply with exactly one message string — no quotes, no prefixes, just the message text.
76
+ """
77
+ ).strip()
78
+
79
+
80
+ def log_start(task: str, env: str, model: str) -> None:
81
+ print(f"[START] task={task} env={env} model={model}", flush=True)
82
+
83
+
84
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
85
+ error_val = error if error else "null"
86
+ done_val = str(done).lower()
87
+ print(
88
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
89
+ flush=True,
90
+ )
91
+
92
+
93
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
94
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
95
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
96
+
97
+
98
+ def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
99
+ history_block = "\n".join(history[-4:]) if history else "None"
100
+ return textwrap.dedent(
101
+ f"""
102
+ Step: {step}
103
+ Last echoed message: {last_echoed!r}
104
+ Last reward: {last_reward:.2f}
105
+ Previous steps:
106
+ {history_block}
107
+ Send your next message.
108
+ """
109
+ ).strip()
110
+
111
+
112
+ def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
113
+ user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
114
+ try:
115
+ completion = client.chat.completions.create(
116
+ model=MODEL_NAME,
117
+ messages=[
118
+ {"role": "system", "content": SYSTEM_PROMPT},
119
+ {"role": "user", "content": user_prompt},
120
+ ],
121
+ temperature=TEMPERATURE,
122
+ max_tokens=MAX_TOKENS,
123
+ stream=False,
124
+ )
125
+ text = (completion.choices[0].message.content or "").strip()
126
+ return text if text else "hello"
127
+ except Exception as exc:
128
+ print(f"[DEBUG] Model request failed: {exc}", flush=True)
129
+ return "hello"
130
+
131
+
132
+ async def main() -> None:
133
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
134
+
135
+ env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
136
+
137
+ history: List[str] = []
138
+ rewards: List[float] = []
139
+ steps_taken = 0
140
+ score = 0.0
141
+ success = False
142
+
143
+ log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
144
+
145
+ try:
146
+ result = await env.reset() # OpenENV.reset()
147
+ last_echoed = result.observation.echoed_message
148
+ last_reward = 0.0
149
+
150
+ for step in range(1, MAX_STEPS + 1):
151
+ if result.done:
152
+ break
153
+
154
+ message = get_model_message(client, step, last_echoed, last_reward, history)
155
+
156
+ result = await env.step(MyEnvV4Action(message=message))
157
+ obs = result.observation
158
+
159
+ reward = result.reward or 0.0
160
+ done = result.done
161
+ error = None
162
+
163
+ rewards.append(reward)
164
+ steps_taken = step
165
+ last_echoed = obs.echoed_message
166
+ last_reward = reward
167
+
168
+ log_step(step=step, action=message, reward=reward, done=done, error=error)
169
+
170
+ history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
171
+
172
+ if done:
173
+ break
174
+
175
+ score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
176
+ score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
177
+ success = score >= SUCCESS_SCORE_THRESHOLD
178
+
179
+ finally:
180
+ try:
181
+ await env.close()
182
+ except Exception as e:
183
+ print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
184
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
185
+
186
+
187
+ if __name__ == "__main__":
188
+ asyncio.run(main())
scripts/validate-submission.sh ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # validate-submission.sh — OpenEnv Submission Validator
4
+ #
5
+ # Checks that your HF Space is live, Docker image builds, and openenv validate passes.
6
+ #
7
+ # Prerequisites:
8
+ # - Docker: https://docs.docker.com/get-docker/
9
+ # - openenv-core: pip install openenv-core
10
+ # - curl (usually pre-installed)
11
+ #
12
+ # Run:
13
+ # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
14
+ #
15
+ # Or download and run locally:
16
+ # chmod +x validate-submission.sh
17
+ # ./validate-submission.sh <ping_url> [repo_dir]
18
+ #
19
+ # Arguments:
20
+ # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
21
+ # repo_dir Path to your repo (default: current directory)
22
+ #
23
+ # Examples:
24
+ # ./validate-submission.sh https://my-team.hf.space
25
+ # ./validate-submission.sh https://my-team.hf.space ./my-repo
26
+ #
27
+
28
+ set -uo pipefail
29
+
30
+ DOCKER_BUILD_TIMEOUT=600
31
+ if [ -t 1 ]; then
32
+ RED='\033[0;31m'
33
+ GREEN='\033[0;32m'
34
+ YELLOW='\033[1;33m'
35
+ BOLD='\033[1m'
36
+ NC='\033[0m'
37
+ else
38
+ RED='' GREEN='' YELLOW='' BOLD='' NC=''
39
+ fi
40
+
41
+ run_with_timeout() {
42
+ local secs="$1"; shift
43
+ if command -v timeout &>/dev/null; then
44
+ timeout "$secs" "$@"
45
+ elif command -v gtimeout &>/dev/null; then
46
+ gtimeout "$secs" "$@"
47
+ else
48
+ "$@" &
49
+ local pid=$!
50
+ ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
51
+ local watcher=$!
52
+ wait "$pid" 2>/dev/null
53
+ local rc=$?
54
+ kill "$watcher" 2>/dev/null
55
+ wait "$watcher" 2>/dev/null
56
+ return $rc
57
+ fi
58
+ }
59
+
60
+ portable_mktemp() {
61
+ local prefix="${1:-validate}"
62
+ mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
63
+ }
64
+
65
+ CLEANUP_FILES=()
66
+ cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
67
+ trap cleanup EXIT
68
+
69
+ PING_URL="${1:-}"
70
+ REPO_DIR="${2:-.}"
71
+
72
+ if [ -z "$PING_URL" ]; then
73
+ printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
74
+ printf "\n"
75
+ printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
76
+ printf " repo_dir Path to your repo (default: current directory)\n"
77
+ exit 1
78
+ fi
79
+
80
+ if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
81
+ printf "Error: directory '%s' not found\n" "${2:-.}"
82
+ exit 1
83
+ fi
84
+ PING_URL="${PING_URL%/}"
85
+ export PING_URL
86
+ PASS=0
87
+
88
+ log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
89
+ pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
90
+ fail() { log "${RED}FAILED${NC} -- $1"; }
91
+ hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
92
+ stop_at() {
93
+ printf "\n"
94
+ printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
95
+ exit 1
96
+ }
97
+
98
+ printf "\n"
99
+ printf "${BOLD}========================================${NC}\n"
100
+ printf "${BOLD} OpenEnv Submission Validator${NC}\n"
101
+ printf "${BOLD}========================================${NC}\n"
102
+ log "Repo: $REPO_DIR"
103
+ log "Ping URL: $PING_URL"
104
+ printf "\n"
105
+
106
+ log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
107
+
108
+ CURL_OUTPUT=$(portable_mktemp "validate-curl")
109
+ CLEANUP_FILES+=("$CURL_OUTPUT")
110
+ HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
111
+ -H "Content-Type: application/json" -d '{}' \
112
+ "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
113
+
114
+ if [ "$HTTP_CODE" = "200" ]; then
115
+ pass "HF Space is live and responds to /reset"
116
+ elif [ "$HTTP_CODE" = "000" ]; then
117
+ fail "HF Space not reachable (connection failed or timed out)"
118
+ hint "Check your network connection and that the Space is running."
119
+ hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
120
+ stop_at "Step 1"
121
+ else
122
+ fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
123
+ hint "Make sure your Space is running and the URL is correct."
124
+ hint "Try opening $PING_URL in your browser first."
125
+ stop_at "Step 1"
126
+ fi
127
+
128
+ log "${BOLD}Step 2/3: Running docker build${NC} ..."
129
+
130
+ if ! command -v docker &>/dev/null; then
131
+ fail "docker command not found"
132
+ hint "Install Docker: https://docs.docker.com/get-docker/"
133
+ stop_at "Step 2"
134
+ fi
135
+
136
+ if [ -f "$REPO_DIR/Dockerfile" ]; then
137
+ DOCKER_CONTEXT="$REPO_DIR"
138
+ DOCKERFILE="$REPO_DIR/Dockerfile"
139
+ elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
140
+ DOCKER_CONTEXT="$REPO_DIR"
141
+ DOCKERFILE="$REPO_DIR/server/Dockerfile"
142
+ else
143
+ fail "No Dockerfile found in repo root or server/ directory"
144
+ stop_at "Step 2"
145
+ fi
146
+
147
+ log " Found Dockerfile at $DOCKERFILE"
148
+ log " Docker context: $DOCKER_CONTEXT"
149
+
150
+ BUILD_OK=false
151
+ BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build -f "$DOCKERFILE" "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
152
+
153
+ if [ "$BUILD_OK" = true ]; then
154
+ pass "Docker build succeeded"
155
+ else
156
+ fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
157
+ printf "%s\n" "$BUILD_OUTPUT" | tail -20
158
+ stop_at "Step 2"
159
+ fi
160
+
161
+ log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
162
+
163
+ if ! command -v openenv &>/dev/null; then
164
+ fail "openenv command not found"
165
+ hint "Install it: pip install openenv-core"
166
+ stop_at "Step 3"
167
+ fi
168
+
169
+ VALIDATE_OK=false
170
+ VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
171
+
172
+ if [ "$VALIDATE_OK" = true ]; then
173
+ pass "openenv validate passed"
174
+ [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
175
+ else
176
+ fail "openenv validate failed"
177
+ printf "%s\n" "$VALIDATE_OUTPUT"
178
+ stop_at "Step 3"
179
+ fi
180
+
181
+ printf "\n"
182
+ printf "${BOLD}========================================${NC}\n"
183
+ printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
184
+ printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
185
+ printf "${BOLD}========================================${NC}\n"
186
+ printf "\n"
187
+
188
+ exit 0
server/Cyber_analyst_environment.py CHANGED
@@ -1,506 +1,506 @@
1
- # Copyright (c) Meta Platforms, Inc. and affiliates.
2
- # All rights reserved.
3
- #
4
- # This source code is licensed under the BSD-style license found in the
5
- # LICENSE file in the root directory of this source tree.
6
-
7
- """SecOps Evidence Gym environment implementation."""
8
-
9
- from __future__ import annotations
10
-
11
- import hashlib
12
- import json
13
- from collections import Counter
14
- from typing import Any
15
- from uuid import uuid4
16
-
17
- from openenv.core.env_server.interfaces import Environment
18
-
19
- try:
20
- from ..models import (
21
- CyberAnalystAction,
22
- CyberAnalystObservation,
23
- CyberAnalystState,
24
- )
25
- from .graders import safe_reward, score_report
26
- from .tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
27
- except ImportError: # pragma: no cover - supports direct module execution
28
- from models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
29
- from server.graders import safe_reward, score_report
30
- from server.tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
31
-
32
-
33
- class CyberAnalystEnvironment(
34
- Environment[CyberAnalystAction, CyberAnalystObservation, CyberAnalystState]
35
- ):
36
- """A safe, deterministic evidence-grounded cyber analyst benchmark."""
37
-
38
- SUPPORTS_CONCURRENT_SESSIONS: bool = True
39
- MAX_STEPS = 12
40
- REPEAT_HARD_STOP = 6
41
-
42
- def __init__(self):
43
- super().__init__()
44
- self._scenario: dict[str, Any] = {}
45
- self._state = CyberAnalystState()
46
- self._discovered_evidence: set[str] = set()
47
- self._candidate_findings: dict[str, dict[str, Any]] = {}
48
- self._verified_findings: list[dict[str, Any]] = []
49
- self._validated_finding_ids: set[str] = set()
50
- self._action_counts: Counter[str] = Counter()
51
- self._last_score_breakdown: dict[str, Any] = {}
52
- self._trajectory_events: list[dict[str, Any]] = []
53
- self._initialize_episode(DEFAULT_TASK_ID, seed=None, episode_id=None)
54
-
55
- def reset(
56
- self,
57
- seed: int | None = None,
58
- episode_id: str | None = None,
59
- task_id: str = DEFAULT_TASK_ID,
60
- **_: Any,
61
- ) -> CyberAnalystObservation:
62
- """Reset the selected deterministic task."""
63
-
64
- self._initialize_episode(task_id=task_id, seed=seed, episode_id=episode_id)
65
- tool_result = {
66
- "message": "Cyber Analyst environment ready.",
67
- "allowed_scope": "Synthetic artifacts only. No live targets or shell.",
68
- }
69
- obs = self._observation(
70
- tool_result={
71
- **tool_result,
72
- "trajectory_jsonl": self.export_trajectory_jsonl(),
73
- },
74
- reward=0.01,
75
- )
76
- self._record_trajectory("reset", None, tool_result, obs.reward, obs.done, obs.error)
77
- return obs
78
-
79
- def step( # type: ignore[override]
80
- self,
81
- action: CyberAnalystAction,
82
- timeout_s: float | None = None,
83
- **_: Any,
84
- ) -> CyberAnalystObservation:
85
- """Execute one bounded simulator tool call."""
86
-
87
- del timeout_s
88
-
89
- if self._state.done:
90
- tool_result = {"message": "Episode is already complete."}
91
- obs = self._observation(
92
- tool_result=tool_result,
93
- reward=0.01,
94
- done=True,
95
- error="episode_already_done",
96
- )
97
- self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
98
- return obs
99
-
100
- self._state.step_count += 1
101
- self._state.step_budget_remaining = max(
102
- 0, self.MAX_STEPS - self._state.step_count
103
- )
104
-
105
- signature = self._action_signature(action)
106
- self._action_counts[signature] += 1
107
- repeat_count = self._action_counts[signature]
108
-
109
- if repeat_count >= self.REPEAT_HARD_STOP:
110
- self._state.phase = "done"
111
- self._state.done = True
112
- self._last_score_breakdown = {
113
- "score": 0.03,
114
- "repeat_hard_stop": True,
115
- "signature": signature,
116
- }
117
- tool_result = {"message": "Episode stopped after repeated identical actions."}
118
- obs = self._observation(
119
- tool_result=tool_result,
120
- reward=0.03,
121
- done=True,
122
- error="repeat_hard_stop",
123
- )
124
- self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
125
- return obs
126
-
127
- handler = getattr(self, f"_tool_{action.tool_name}", None)
128
- if handler is None:
129
- tool_result = {
130
- "ok": False,
131
- "message": f"Unsupported tool: {action.tool_name}",
132
- "available_tools": [tool["name"] for tool in TOOL_CATALOG],
133
- }
134
- obs = self._step_observation(
135
- tool_result=tool_result,
136
- repeat_count=repeat_count,
137
- error="unsupported_tool",
138
- )
139
- self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
140
- return obs
141
-
142
- try:
143
- result, reward_delta, done = handler(action.args)
144
- error = ""
145
- except Exception as exc: # pragma: no cover - defensive rollout guard
146
- result = {"ok": False, "message": str(exc)}
147
- reward_delta = -0.05
148
- done = False
149
- error = exc.__class__.__name__
150
-
151
- if self._state.step_budget_remaining <= 0 and not done:
152
- done = True
153
- self._state.phase = "done"
154
- self._state.done = True
155
- result = {
156
- **result,
157
- "timeout": True,
158
- "message": "Step budget exhausted before report submission.",
159
- }
160
- reward_delta -= 0.10
161
-
162
- obs = self._step_observation(
163
- tool_result=result,
164
- repeat_count=repeat_count,
165
- reward_delta=reward_delta,
166
- done=done,
167
- error=error,
168
- )
169
- self._record_trajectory("step", action, result, obs.reward, obs.done, obs.error)
170
- return obs
171
-
172
- @property
173
- def state(self) -> CyberAnalystState:
174
- """Return the current episode state summary."""
175
-
176
- return self._state
177
-
178
- def _initialize_episode(
179
- self, task_id: str, seed: int | None, episode_id: str | None
180
- ) -> None:
181
- self._scenario = build_scenario(task_id, seed)
182
- self._discovered_evidence = set()
183
- self._candidate_findings = {}
184
- self._verified_findings = []
185
- self._validated_finding_ids = set()
186
- self._action_counts = Counter()
187
- self._last_score_breakdown = {}
188
- self._trajectory_events = []
189
- self._state = CyberAnalystState(
190
- episode_id=episode_id or str(uuid4()),
191
- step_count=0,
192
- task_id=self._scenario["task_id"],
193
- seed=seed,
194
- phase="investigate",
195
- step_budget_remaining=self.MAX_STEPS,
196
- recent_evidence_ids=[],
197
- verified_finding_ids=[],
198
- done=False,
199
- )
200
-
201
- def export_trajectory_jsonl(self) -> str:
202
- """Return the current episode trajectory as JSONL for offline analysis."""
203
-
204
- return "\n".join(
205
- json.dumps(event, sort_keys=True, default=str)
206
- for event in self._trajectory_events
207
- )
208
-
209
- def _record_trajectory(
210
- self,
211
- event_type: str,
212
- action: CyberAnalystAction | None,
213
- tool_result: dict[str, Any],
214
- reward: float | int | None,
215
- done: bool,
216
- error: str,
217
- ) -> None:
218
- action_payload = None
219
- if action is not None:
220
- action_payload = action.model_dump(exclude_none=True)
221
- self._trajectory_events.append(
222
- {
223
- "episode_id": self._state.episode_id,
224
- "task_id": self._state.task_id,
225
- "seed": self._state.seed,
226
- "event_type": event_type,
227
- "step": self._state.step_count,
228
- "phase": self._state.phase,
229
- "action": action_payload,
230
- "tool_result": tool_result,
231
- "evidence_ids": sorted(self._discovered_evidence),
232
- "verified_finding_ids": list(self._state.verified_finding_ids),
233
- "reward": reward,
234
- "done": done,
235
- "error": error,
236
- }
237
- )
238
-
239
- def _observation(
240
- self,
241
- tool_result: dict[str, Any] | None = None,
242
- reward: float = 0.01,
243
- done: bool | None = None,
244
- error: str = "",
245
- ) -> CyberAnalystObservation:
246
- done_value = self._state.done if done is None else done
247
- return CyberAnalystObservation(
248
- task_id=self._scenario.get("task_id", ""),
249
- alert=self._scenario.get("alert", ""),
250
- phase=self._state.phase,
251
- tool_catalog=TOOL_CATALOG,
252
- tool_result=tool_result or {},
253
- evidence_ids=sorted(self._discovered_evidence),
254
- verified_findings=list(self._verified_findings),
255
- candidate_findings=list(self._candidate_findings.values()),
256
- step_budget_remaining=self._state.step_budget_remaining,
257
- score_breakdown=dict(self._last_score_breakdown),
258
- error=error,
259
- done=done_value,
260
- reward=safe_reward(reward),
261
- )
262
-
263
- def _step_observation(
264
- self,
265
- tool_result: dict[str, Any],
266
- repeat_count: int,
267
- reward_delta: float = 0.0,
268
- done: bool = False,
269
- error: str = "",
270
- ) -> CyberAnalystObservation:
271
- reward = 0.04 + reward_delta - 0.01
272
- if repeat_count > 2:
273
- reward -= 0.03 * (repeat_count - 2)
274
-
275
- if done:
276
- self._state.phase = "done"
277
- self._state.done = True
278
-
279
- self._state.recent_evidence_ids = sorted(self._discovered_evidence)[-5:]
280
- self._state.verified_finding_ids = [
281
- finding["finding_id"] for finding in self._verified_findings
282
- ]
283
-
284
- return self._observation(
285
- tool_result=tool_result,
286
- reward=safe_reward(reward),
287
- done=self._state.done,
288
- error=error,
289
- )
290
-
291
- def _action_signature(self, action: CyberAnalystAction) -> str:
292
- payload = {
293
- "tool_name": action.tool_name,
294
- "args": action.args,
295
- }
296
- encoded = json.dumps(payload, sort_keys=True, default=str)
297
- return hashlib.sha256(encoded.encode("utf-8")).hexdigest()[:16]
298
-
299
- def _record_evidence(self, evidence_ids: list[str]) -> int:
300
- relevant = set(self._scenario.get("required_evidence", [])) | set(
301
- self._scenario.get("supporting_evidence", [])
302
- )
303
- new_relevant = 0
304
- for evidence_id in evidence_ids:
305
- if evidence_id not in self._discovered_evidence and evidence_id in relevant:
306
- new_relevant += 1
307
- self._discovered_evidence.add(evidence_id)
308
- return new_relevant
309
-
310
- def _filter_entries(
311
- self, entries: list[dict[str, Any]], service_id: str = "", query: str = ""
312
- ) -> list[dict[str, Any]]:
313
- normalized_service = self._resolve_service_id(service_id).lower()
314
- normalized_query = query.strip().lower()
315
- matches: list[dict[str, Any]] = []
316
- for entry in entries:
317
- service_matches = (
318
- not normalized_service
319
- or str(entry.get("service_id", "")).lower() == normalized_service
320
- )
321
- search_blob = " ".join(
322
- [
323
- str(entry.get("text", "")),
324
- str(entry.get("source", "")),
325
- " ".join(str(tag) for tag in entry.get("tags", [])),
326
- ]
327
- ).lower()
328
- query_matches = not normalized_query or normalized_query in search_blob
329
- if service_matches and query_matches:
330
- matches.append(entry)
331
- return matches
332
-
333
- def _resolve_service_id(self, service_id: str) -> str:
334
- normalized = service_id.strip()
335
- aliases = self._scenario.get("service_aliases", {})
336
- return str(aliases.get(normalized, normalized))
337
-
338
- def _evidence_payload(self, entries: list[dict[str, Any]]) -> dict[str, Any]:
339
- evidence_ids = [entry["evidence_id"] for entry in entries]
340
- new_relevant = self._record_evidence(evidence_ids)
341
- return {
342
- "ok": True,
343
- "evidence_ids": evidence_ids,
344
- "new_relevant_evidence": new_relevant,
345
- "entries": [
346
- {
347
- "evidence_id": entry["evidence_id"],
348
- "service_id": entry.get("service_id", ""),
349
- "source": entry.get("source", ""),
350
- "text": entry.get("text", ""),
351
- }
352
- for entry in entries
353
- ],
354
- }
355
-
356
- def _tool_list_assets(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
357
- del args
358
- return {"ok": True, "assets": self._scenario["assets"]}, 0.0, False
359
-
360
- def _tool_get_log_events(
361
- self, args: dict[str, Any]
362
- ) -> tuple[dict[str, Any], float, bool]:
363
- entries = self._filter_entries(
364
- self._scenario.get("logs", []),
365
- service_id=str(args.get("service_id", "")),
366
- query=str(args.get("query", "")),
367
- )
368
- payload = self._evidence_payload(entries)
369
- return payload, 0.02 * payload["new_relevant_evidence"], False
370
-
371
- def _tool_check_security_headers(
372
- self, args: dict[str, Any]
373
- ) -> tuple[dict[str, Any], float, bool]:
374
- requested_service = self._resolve_service_id(str(args.get("service_id", ""))).lower()
375
- snapshots = self._scenario.get("headers", {})
376
- results = []
377
- evidence_ids = []
378
- for service_id, snapshot in snapshots.items():
379
- if requested_service and service_id.lower() != requested_service:
380
- continue
381
- evidence_ids.append(snapshot["evidence_id"])
382
- results.append(
383
- {
384
- "service_id": service_id,
385
- "evidence_id": snapshot["evidence_id"],
386
- "present": snapshot.get("present", []),
387
- "missing": snapshot.get("missing", []),
388
- "passed": not snapshot.get("missing"),
389
- }
390
- )
391
- new_relevant = self._record_evidence(evidence_ids)
392
- return (
393
- {
394
- "ok": True,
395
- "evidence_ids": evidence_ids,
396
- "new_relevant_evidence": new_relevant,
397
- "header_results": results,
398
- },
399
- 0.02 * new_relevant,
400
- False,
401
- )
402
-
403
- def _tool_search_repo(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
404
- entries = self._filter_entries(
405
- self._scenario.get("repo", []), query=str(args.get("query", ""))
406
- )
407
- payload = self._evidence_payload(entries)
408
- return payload, 0.02 * payload["new_relevant_evidence"], False
409
-
410
- def _tool_scan_dependencies(
411
- self, args: dict[str, Any]
412
- ) -> tuple[dict[str, Any], float, bool]:
413
- del args
414
- payload = self._evidence_payload(self._scenario.get("dependencies", []))
415
- return payload, 0.02 * payload["new_relevant_evidence"], False
416
-
417
- def _tool_create_finding(
418
- self, args: dict[str, Any]
419
- ) -> tuple[dict[str, Any], float, bool]:
420
- evidence_ids = args.get("evidence_ids", [])
421
- if isinstance(evidence_ids, str):
422
- evidence_ids = [evidence_ids]
423
- evidence_ids = [str(evidence_id) for evidence_id in evidence_ids]
424
-
425
- finding_id = f"FND-{len(self._candidate_findings) + 1:03d}"
426
- finding = {
427
- "finding_id": finding_id,
428
- "finding_type": str(args.get("finding_type", "")),
429
- "evidence_ids": evidence_ids,
430
- "severity_guess": str(args.get("severity_guess", "")),
431
- "remediation": str(args.get("remediation", "")),
432
- "validated": False,
433
- "matching_gt_id": None,
434
- }
435
- self._candidate_findings[finding_id] = finding
436
-
437
- well_formed = bool(
438
- finding["finding_type"] and evidence_ids and finding["remediation"]
439
- )
440
- return (
441
- {"ok": True, "finding_id": finding_id, "finding": finding},
442
- 0.03 if well_formed else 0.0,
443
- False,
444
- )
445
-
446
- def _tool_validate_finding(
447
- self, args: dict[str, Any]
448
- ) -> tuple[dict[str, Any], float, bool]:
449
- finding_id = str(args.get("finding_id", ""))
450
- finding = self._candidate_findings.get(finding_id)
451
- if finding is None:
452
- return (
453
- {"ok": False, "message": f"Unknown finding_id: {finding_id}"},
454
- -0.03,
455
- False,
456
- )
457
-
458
- expected_type = self._scenario["finding_type"]
459
- required_evidence = set(self._scenario.get("required_evidence", []))
460
- supplied_evidence = set(finding.get("evidence_ids", []))
461
- verified = (
462
- finding.get("finding_type") == expected_type
463
- and bool(required_evidence & supplied_evidence)
464
- )
465
- self._validated_finding_ids.add(finding_id)
466
- finding["validated"] = verified
467
- finding["matching_gt_id"] = self._scenario["ground_truth_id"] if verified else None
468
-
469
- if verified and not any(
470
- item["finding_id"] == finding_id for item in self._verified_findings
471
- ):
472
- self._verified_findings.append(dict(finding))
473
-
474
- return (
475
- {
476
- "ok": True,
477
- "finding_id": finding_id,
478
- "verified": verified,
479
- "matching_gt_id": finding["matching_gt_id"],
480
- },
481
- 0.08 if verified else -0.02,
482
- False,
483
- )
484
-
485
- def _tool_submit_report(
486
- self, args: dict[str, Any]
487
- ) -> tuple[dict[str, Any], float, bool]:
488
- report = args.get("report_json", {})
489
- score, breakdown = score_report(
490
- self._scenario["task_id"],
491
- report,
492
- verified_findings=self._verified_findings,
493
- validation_attempted=bool(self._validated_finding_ids),
494
- )
495
- self._last_score_breakdown = breakdown
496
- return (
497
- {
498
- "ok": True,
499
- "submitted": True,
500
- "score": score,
501
- "score_breakdown": breakdown,
502
- "trajectory_jsonl": self.export_trajectory_jsonl(),
503
- },
504
- score,
505
- True,
506
- )
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """SecOps Evidence Gym environment implementation."""
8
+
9
+ from __future__ import annotations
10
+
11
+ import hashlib
12
+ import json
13
+ from collections import Counter
14
+ from typing import Any
15
+ from uuid import uuid4
16
+
17
+ from openenv.core.env_server.interfaces import Environment
18
+
19
+ try:
20
+ from ..models import (
21
+ CyberAnalystAction,
22
+ CyberAnalystObservation,
23
+ CyberAnalystState,
24
+ )
25
+ from .graders import safe_reward, score_report
26
+ from .tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
27
+ except ImportError: # pragma: no cover - supports direct module execution
28
+ from models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
29
+ from server.graders import safe_reward, score_report
30
+ from server.tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
31
+
32
+
33
+ class CyberAnalystEnvironment(
34
+ Environment[CyberAnalystAction, CyberAnalystObservation, CyberAnalystState]
35
+ ):
36
+ """A safe, deterministic evidence-grounded cyber analyst benchmark."""
37
+
38
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True
39
+ MAX_STEPS = 12
40
+ REPEAT_HARD_STOP = 6
41
+
42
+ def __init__(self):
43
+ super().__init__()
44
+ self._scenario: dict[str, Any] = {}
45
+ self._state = CyberAnalystState()
46
+ self._discovered_evidence: set[str] = set()
47
+ self._candidate_findings: dict[str, dict[str, Any]] = {}
48
+ self._verified_findings: list[dict[str, Any]] = []
49
+ self._validated_finding_ids: set[str] = set()
50
+ self._action_counts: Counter[str] = Counter()
51
+ self._last_score_breakdown: dict[str, Any] = {}
52
+ self._trajectory_events: list[dict[str, Any]] = []
53
+ self._initialize_episode(DEFAULT_TASK_ID, seed=None, episode_id=None)
54
+
55
+ def reset(
56
+ self,
57
+ seed: int | None = None,
58
+ episode_id: str | None = None,
59
+ task_id: str = DEFAULT_TASK_ID,
60
+ **_: Any,
61
+ ) -> CyberAnalystObservation:
62
+ """Reset the selected deterministic task."""
63
+
64
+ self._initialize_episode(task_id=task_id, seed=seed, episode_id=episode_id)
65
+ tool_result = {
66
+ "message": "Cyber Analyst environment ready.",
67
+ "allowed_scope": "Synthetic artifacts only. No live targets or shell.",
68
+ }
69
+ obs = self._observation(
70
+ tool_result={
71
+ **tool_result,
72
+ "trajectory_jsonl": self.export_trajectory_jsonl(),
73
+ },
74
+ reward=0.01,
75
+ )
76
+ self._record_trajectory("reset", None, tool_result, obs.reward, obs.done, obs.error)
77
+ return obs
78
+
79
+ def step( # type: ignore[override]
80
+ self,
81
+ action: CyberAnalystAction,
82
+ timeout_s: float | None = None,
83
+ **_: Any,
84
+ ) -> CyberAnalystObservation:
85
+ """Execute one bounded simulator tool call."""
86
+
87
+ del timeout_s
88
+
89
+ if self._state.done:
90
+ tool_result = {"message": "Episode is already complete."}
91
+ obs = self._observation(
92
+ tool_result=tool_result,
93
+ reward=0.01,
94
+ done=True,
95
+ error="episode_already_done",
96
+ )
97
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
98
+ return obs
99
+
100
+ self._state.step_count += 1
101
+ self._state.step_budget_remaining = max(
102
+ 0, self.MAX_STEPS - self._state.step_count
103
+ )
104
+
105
+ signature = self._action_signature(action)
106
+ self._action_counts[signature] += 1
107
+ repeat_count = self._action_counts[signature]
108
+
109
+ if repeat_count >= self.REPEAT_HARD_STOP:
110
+ self._state.phase = "done"
111
+ self._state.done = True
112
+ self._last_score_breakdown = {
113
+ "score": 0.03,
114
+ "repeat_hard_stop": True,
115
+ "signature": signature,
116
+ }
117
+ tool_result = {"message": "Episode stopped after repeated identical actions."}
118
+ obs = self._observation(
119
+ tool_result=tool_result,
120
+ reward=0.03,
121
+ done=True,
122
+ error="repeat_hard_stop",
123
+ )
124
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
125
+ return obs
126
+
127
+ handler = getattr(self, f"_tool_{action.tool_name}", None)
128
+ if handler is None:
129
+ tool_result = {
130
+ "ok": False,
131
+ "message": f"Unsupported tool: {action.tool_name}",
132
+ "available_tools": [tool["name"] for tool in TOOL_CATALOG],
133
+ }
134
+ obs = self._step_observation(
135
+ tool_result=tool_result,
136
+ repeat_count=repeat_count,
137
+ error="unsupported_tool",
138
+ )
139
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
140
+ return obs
141
+
142
+ try:
143
+ result, reward_delta, done = handler(action.args)
144
+ error = ""
145
+ except Exception as exc: # pragma: no cover - defensive rollout guard
146
+ result = {"ok": False, "message": str(exc)}
147
+ reward_delta = -0.05
148
+ done = False
149
+ error = exc.__class__.__name__
150
+
151
+ if self._state.step_budget_remaining <= 0 and not done:
152
+ done = True
153
+ self._state.phase = "done"
154
+ self._state.done = True
155
+ result = {
156
+ **result,
157
+ "timeout": True,
158
+ "message": "Step budget exhausted before report submission.",
159
+ }
160
+ reward_delta -= 0.10
161
+
162
+ obs = self._step_observation(
163
+ tool_result=result,
164
+ repeat_count=repeat_count,
165
+ reward_delta=reward_delta,
166
+ done=done,
167
+ error=error,
168
+ )
169
+ self._record_trajectory("step", action, result, obs.reward, obs.done, obs.error)
170
+ return obs
171
+
172
+ @property
173
+ def state(self) -> CyberAnalystState:
174
+ """Return the current episode state summary."""
175
+
176
+ return self._state
177
+
178
+ def _initialize_episode(
179
+ self, task_id: str, seed: int | None, episode_id: str | None
180
+ ) -> None:
181
+ self._scenario = build_scenario(task_id, seed)
182
+ self._discovered_evidence = set()
183
+ self._candidate_findings = {}
184
+ self._verified_findings = []
185
+ self._validated_finding_ids = set()
186
+ self._action_counts = Counter()
187
+ self._last_score_breakdown = {}
188
+ self._trajectory_events = []
189
+ self._state = CyberAnalystState(
190
+ episode_id=episode_id or str(uuid4()),
191
+ step_count=0,
192
+ task_id=self._scenario["task_id"],
193
+ seed=seed,
194
+ phase="investigate",
195
+ step_budget_remaining=self.MAX_STEPS,
196
+ recent_evidence_ids=[],
197
+ verified_finding_ids=[],
198
+ done=False,
199
+ )
200
+
201
+ def export_trajectory_jsonl(self) -> str:
202
+ """Return the current episode trajectory as JSONL for offline analysis."""
203
+
204
+ return "\n".join(
205
+ json.dumps(event, sort_keys=True, default=str)
206
+ for event in self._trajectory_events
207
+ )
208
+
209
+ def _record_trajectory(
210
+ self,
211
+ event_type: str,
212
+ action: CyberAnalystAction | None,
213
+ tool_result: dict[str, Any],
214
+ reward: float | int | None,
215
+ done: bool,
216
+ error: str,
217
+ ) -> None:
218
+ action_payload = None
219
+ if action is not None:
220
+ action_payload = action.model_dump(exclude_none=True)
221
+ self._trajectory_events.append(
222
+ {
223
+ "episode_id": self._state.episode_id,
224
+ "task_id": self._state.task_id,
225
+ "seed": self._state.seed,
226
+ "event_type": event_type,
227
+ "step": self._state.step_count,
228
+ "phase": self._state.phase,
229
+ "action": action_payload,
230
+ "tool_result": tool_result,
231
+ "evidence_ids": sorted(self._discovered_evidence),
232
+ "verified_finding_ids": list(self._state.verified_finding_ids),
233
+ "reward": reward,
234
+ "done": done,
235
+ "error": error,
236
+ }
237
+ )
238
+
239
+ def _observation(
240
+ self,
241
+ tool_result: dict[str, Any] | None = None,
242
+ reward: float = 0.01,
243
+ done: bool | None = None,
244
+ error: str = "",
245
+ ) -> CyberAnalystObservation:
246
+ done_value = self._state.done if done is None else done
247
+ return CyberAnalystObservation(
248
+ task_id=self._scenario.get("task_id", ""),
249
+ alert=self._scenario.get("alert", ""),
250
+ phase=self._state.phase,
251
+ tool_catalog=TOOL_CATALOG,
252
+ tool_result=tool_result or {},
253
+ evidence_ids=sorted(self._discovered_evidence),
254
+ verified_findings=list(self._verified_findings),
255
+ candidate_findings=list(self._candidate_findings.values()),
256
+ step_budget_remaining=self._state.step_budget_remaining,
257
+ score_breakdown=dict(self._last_score_breakdown),
258
+ error=error,
259
+ done=done_value,
260
+ reward=safe_reward(reward),
261
+ )
262
+
263
+ def _step_observation(
264
+ self,
265
+ tool_result: dict[str, Any],
266
+ repeat_count: int,
267
+ reward_delta: float = 0.0,
268
+ done: bool = False,
269
+ error: str = "",
270
+ ) -> CyberAnalystObservation:
271
+ reward = 0.04 + reward_delta - 0.01
272
+ if repeat_count > 2:
273
+ reward -= 0.03 * (repeat_count - 2)
274
+
275
+ if done:
276
+ self._state.phase = "done"
277
+ self._state.done = True
278
+
279
+ self._state.recent_evidence_ids = sorted(self._discovered_evidence)[-5:]
280
+ self._state.verified_finding_ids = [
281
+ finding["finding_id"] for finding in self._verified_findings
282
+ ]
283
+
284
+ return self._observation(
285
+ tool_result=tool_result,
286
+ reward=safe_reward(reward),
287
+ done=self._state.done,
288
+ error=error,
289
+ )
290
+
291
+ def _action_signature(self, action: CyberAnalystAction) -> str:
292
+ payload = {
293
+ "tool_name": action.tool_name,
294
+ "args": action.args,
295
+ }
296
+ encoded = json.dumps(payload, sort_keys=True, default=str)
297
+ return hashlib.sha256(encoded.encode("utf-8")).hexdigest()[:16]
298
+
299
+ def _record_evidence(self, evidence_ids: list[str]) -> int:
300
+ relevant = set(self._scenario.get("required_evidence", [])) | set(
301
+ self._scenario.get("supporting_evidence", [])
302
+ )
303
+ new_relevant = 0
304
+ for evidence_id in evidence_ids:
305
+ if evidence_id not in self._discovered_evidence and evidence_id in relevant:
306
+ new_relevant += 1
307
+ self._discovered_evidence.add(evidence_id)
308
+ return new_relevant
309
+
310
+ def _filter_entries(
311
+ self, entries: list[dict[str, Any]], service_id: str = "", query: str = ""
312
+ ) -> list[dict[str, Any]]:
313
+ normalized_service = self._resolve_service_id(service_id).lower()
314
+ normalized_query = query.strip().lower()
315
+ matches: list[dict[str, Any]] = []
316
+ for entry in entries:
317
+ service_matches = (
318
+ not normalized_service
319
+ or str(entry.get("service_id", "")).lower() == normalized_service
320
+ )
321
+ search_blob = " ".join(
322
+ [
323
+ str(entry.get("text", "")),
324
+ str(entry.get("source", "")),
325
+ " ".join(str(tag) for tag in entry.get("tags", [])),
326
+ ]
327
+ ).lower()
328
+ query_matches = not normalized_query or normalized_query in search_blob
329
+ if service_matches and query_matches:
330
+ matches.append(entry)
331
+ return matches
332
+
333
+ def _resolve_service_id(self, service_id: str) -> str:
334
+ normalized = service_id.strip()
335
+ aliases = self._scenario.get("service_aliases", {})
336
+ return str(aliases.get(normalized, normalized))
337
+
338
+ def _evidence_payload(self, entries: list[dict[str, Any]]) -> dict[str, Any]:
339
+ evidence_ids = [entry["evidence_id"] for entry in entries]
340
+ new_relevant = self._record_evidence(evidence_ids)
341
+ return {
342
+ "ok": True,
343
+ "evidence_ids": evidence_ids,
344
+ "new_relevant_evidence": new_relevant,
345
+ "entries": [
346
+ {
347
+ "evidence_id": entry["evidence_id"],
348
+ "service_id": entry.get("service_id", ""),
349
+ "source": entry.get("source", ""),
350
+ "text": entry.get("text", ""),
351
+ }
352
+ for entry in entries
353
+ ],
354
+ }
355
+
356
+ def _tool_list_assets(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
357
+ del args
358
+ return {"ok": True, "assets": self._scenario["assets"]}, 0.0, False
359
+
360
+ def _tool_get_log_events(
361
+ self, args: dict[str, Any]
362
+ ) -> tuple[dict[str, Any], float, bool]:
363
+ entries = self._filter_entries(
364
+ self._scenario.get("logs", []),
365
+ service_id=str(args.get("service_id", "")),
366
+ query=str(args.get("query", "")),
367
+ )
368
+ payload = self._evidence_payload(entries)
369
+ return payload, 0.02 * payload["new_relevant_evidence"], False
370
+
371
+ def _tool_check_security_headers(
372
+ self, args: dict[str, Any]
373
+ ) -> tuple[dict[str, Any], float, bool]:
374
+ requested_service = self._resolve_service_id(str(args.get("service_id", ""))).lower()
375
+ snapshots = self._scenario.get("headers", {})
376
+ results = []
377
+ evidence_ids = []
378
+ for service_id, snapshot in snapshots.items():
379
+ if requested_service and service_id.lower() != requested_service:
380
+ continue
381
+ evidence_ids.append(snapshot["evidence_id"])
382
+ results.append(
383
+ {
384
+ "service_id": service_id,
385
+ "evidence_id": snapshot["evidence_id"],
386
+ "present": snapshot.get("present", []),
387
+ "missing": snapshot.get("missing", []),
388
+ "passed": not snapshot.get("missing"),
389
+ }
390
+ )
391
+ new_relevant = self._record_evidence(evidence_ids)
392
+ return (
393
+ {
394
+ "ok": True,
395
+ "evidence_ids": evidence_ids,
396
+ "new_relevant_evidence": new_relevant,
397
+ "header_results": results,
398
+ },
399
+ 0.02 * new_relevant,
400
+ False,
401
+ )
402
+
403
+ def _tool_search_repo(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
404
+ entries = self._filter_entries(
405
+ self._scenario.get("repo", []), query=str(args.get("query", ""))
406
+ )
407
+ payload = self._evidence_payload(entries)
408
+ return payload, 0.02 * payload["new_relevant_evidence"], False
409
+
410
+ def _tool_scan_dependencies(
411
+ self, args: dict[str, Any]
412
+ ) -> tuple[dict[str, Any], float, bool]:
413
+ del args
414
+ payload = self._evidence_payload(self._scenario.get("dependencies", []))
415
+ return payload, 0.02 * payload["new_relevant_evidence"], False
416
+
417
+ def _tool_create_finding(
418
+ self, args: dict[str, Any]
419
+ ) -> tuple[dict[str, Any], float, bool]:
420
+ evidence_ids = args.get("evidence_ids", [])
421
+ if isinstance(evidence_ids, str):
422
+ evidence_ids = [evidence_ids]
423
+ evidence_ids = [str(evidence_id) for evidence_id in evidence_ids]
424
+
425
+ finding_id = f"FND-{len(self._candidate_findings) + 1:03d}"
426
+ finding = {
427
+ "finding_id": finding_id,
428
+ "finding_type": str(args.get("finding_type", "")),
429
+ "evidence_ids": evidence_ids,
430
+ "severity_guess": str(args.get("severity_guess", "")),
431
+ "remediation": str(args.get("remediation", "")),
432
+ "validated": False,
433
+ "matching_gt_id": None,
434
+ }
435
+ self._candidate_findings[finding_id] = finding
436
+
437
+ well_formed = bool(
438
+ finding["finding_type"] and evidence_ids and finding["remediation"]
439
+ )
440
+ return (
441
+ {"ok": True, "finding_id": finding_id, "finding": finding},
442
+ 0.03 if well_formed else 0.0,
443
+ False,
444
+ )
445
+
446
+ def _tool_validate_finding(
447
+ self, args: dict[str, Any]
448
+ ) -> tuple[dict[str, Any], float, bool]:
449
+ finding_id = str(args.get("finding_id", ""))
450
+ finding = self._candidate_findings.get(finding_id)
451
+ if finding is None:
452
+ return (
453
+ {"ok": False, "message": f"Unknown finding_id: {finding_id}"},
454
+ -0.03,
455
+ False,
456
+ )
457
+
458
+ expected_type = self._scenario["finding_type"]
459
+ required_evidence = set(self._scenario.get("required_evidence", []))
460
+ supplied_evidence = set(finding.get("evidence_ids", []))
461
+ verified = (
462
+ finding.get("finding_type") == expected_type
463
+ and bool(required_evidence & supplied_evidence)
464
+ )
465
+ self._validated_finding_ids.add(finding_id)
466
+ finding["validated"] = verified
467
+ finding["matching_gt_id"] = self._scenario["ground_truth_id"] if verified else None
468
+
469
+ if verified and not any(
470
+ item["finding_id"] == finding_id for item in self._verified_findings
471
+ ):
472
+ self._verified_findings.append(dict(finding))
473
+
474
+ return (
475
+ {
476
+ "ok": True,
477
+ "finding_id": finding_id,
478
+ "verified": verified,
479
+ "matching_gt_id": finding["matching_gt_id"],
480
+ },
481
+ 0.08 if verified else -0.02,
482
+ False,
483
+ )
484
+
485
+ def _tool_submit_report(
486
+ self, args: dict[str, Any]
487
+ ) -> tuple[dict[str, Any], float, bool]:
488
+ report = args.get("report_json", {})
489
+ score, breakdown = score_report(
490
+ self._scenario["task_id"],
491
+ report,
492
+ verified_findings=self._verified_findings,
493
+ validation_attempted=bool(self._validated_finding_ids),
494
+ )
495
+ self._last_score_breakdown = breakdown
496
+ return (
497
+ {
498
+ "ok": True,
499
+ "submitted": True,
500
+ "score": score,
501
+ "score_breakdown": breakdown,
502
+ "trajectory_jsonl": self.export_trajectory_jsonl(),
503
+ },
504
+ score,
505
+ True,
506
+ )
server/__init__.py CHANGED
@@ -1,11 +1,11 @@
1
- # Copyright (c) Meta Platforms, Inc. and affiliates.
2
- # All rights reserved.
3
- #
4
- # This source code is licensed under the BSD-style license found in the
5
- # LICENSE file in the root directory of this source tree.
6
-
7
- """Cyber Analyst environment server components."""
8
-
9
- from .Cyber_analyst_environment import CyberAnalystEnvironment
10
-
11
- __all__ = ["CyberAnalystEnvironment"]
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Cyber Analyst environment server components."""
8
+
9
+ from .Cyber_analyst_environment import CyberAnalystEnvironment
10
+
11
+ __all__ = ["CyberAnalystEnvironment"]
server/app.py CHANGED
@@ -1,79 +1,79 @@
1
- # Copyright (c) Meta Platforms, Inc. and affiliates.
2
- # All rights reserved.
3
- #
4
- # This source code is licensed under the BSD-style license found in the
5
- # LICENSE file in the root directory of this source tree.
6
-
7
- """
8
- FastAPI application for the Cyber Analyst Environment.
9
-
10
- This module creates an HTTP server that exposes the CyberAnalystEnvironment
11
- over HTTP and WebSocket endpoints, compatible with EnvClient.
12
-
13
- Endpoints:
14
- - POST /reset: Reset the environment
15
- - POST /step: Execute an action
16
- - GET /state: Get current environment state
17
- - GET /schema: Get action/observation schemas
18
- - WS /ws: WebSocket endpoint for persistent sessions
19
-
20
- Usage:
21
- # Development (with auto-reload):
22
- uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
23
-
24
- # Production:
25
- uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
26
-
27
- # Or run directly:
28
- python -m server.app
29
- """
30
-
31
- try:
32
- from openenv.core.env_server.http_server import create_app
33
- except Exception as e: # pragma: no cover
34
- raise ImportError(
35
- "openenv is required for the web interface. Install dependencies with '\n uv sync\n'"
36
- ) from e
37
-
38
- try:
39
- from ..models import CyberAnalystAction, CyberAnalystObservation
40
- from .Cyber_analyst_environment import CyberAnalystEnvironment
41
- except ImportError:
42
- from models import CyberAnalystAction, CyberAnalystObservation
43
- from server.Cyber_analyst_environment import CyberAnalystEnvironment
44
-
45
-
46
- # Create the app with web interface and README integration
47
- app = create_app(
48
- CyberAnalystEnvironment,
49
- CyberAnalystAction,
50
- CyberAnalystObservation,
51
- env_name="Cyber_analyst",
52
- max_concurrent_envs=1, # increase this number to allow more concurrent WebSocket sessions
53
- )
54
-
55
-
56
- def main(host: str = "0.0.0.0", port: int = 8000):
57
- """
58
- Entry point for direct execution via uv run or python -m.
59
-
60
- This function enables running the server without Docker:
61
- uv run --project . server
62
- uv run --project . server --port 8001
63
- python -m Cyber_analyst.server.app
64
-
65
- Args:
66
- host: Host address to bind to (default: "0.0.0.0")
67
- port: Port number to listen on (default: 8000)
68
-
69
- For production deployments, consider using uvicorn directly with
70
- multiple workers:
71
- uvicorn Cyber_analyst.server.app:app --workers 4
72
- """
73
- import uvicorn
74
-
75
- uvicorn.run(app, host=host, port=port)
76
-
77
-
78
- if __name__ == "__main__":
79
- main()
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ FastAPI application for the Cyber Analyst Environment.
9
+
10
+ This module creates an HTTP server that exposes the CyberAnalystEnvironment
11
+ over HTTP and WebSocket endpoints, compatible with EnvClient.
12
+
13
+ Endpoints:
14
+ - POST /reset: Reset the environment
15
+ - POST /step: Execute an action
16
+ - GET /state: Get current environment state
17
+ - GET /schema: Get action/observation schemas
18
+ - WS /ws: WebSocket endpoint for persistent sessions
19
+
20
+ Usage:
21
+ # Development (with auto-reload):
22
+ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
23
+
24
+ # Production:
25
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
26
+
27
+ # Or run directly:
28
+ python -m server.app
29
+ """
30
+
31
+ try:
32
+ from openenv.core.env_server.http_server import create_app
33
+ except Exception as e: # pragma: no cover
34
+ raise ImportError(
35
+ "openenv is required for the web interface. Install dependencies with '\n uv sync\n'"
36
+ ) from e
37
+
38
+ try:
39
+ from ..models import CyberAnalystAction, CyberAnalystObservation
40
+ from .Cyber_analyst_environment import CyberAnalystEnvironment
41
+ except ImportError:
42
+ from models import CyberAnalystAction, CyberAnalystObservation
43
+ from server.Cyber_analyst_environment import CyberAnalystEnvironment
44
+
45
+
46
+ # Create the app with web interface and README integration
47
+ app = create_app(
48
+ CyberAnalystEnvironment,
49
+ CyberAnalystAction,
50
+ CyberAnalystObservation,
51
+ env_name="Cyber_analyst",
52
+ max_concurrent_envs=1, # increase this number to allow more concurrent WebSocket sessions
53
+ )
54
+
55
+
56
+ def main(host: str = "0.0.0.0", port: int = 8000):
57
+ """
58
+ Entry point for direct execution via uv run or python -m.
59
+
60
+ This function enables running the server without Docker:
61
+ uv run --project . server
62
+ uv run --project . server --port 8001
63
+ python -m Cyber_analyst.server.app
64
+
65
+ Args:
66
+ host: Host address to bind to (default: "0.0.0.0")
67
+ port: Port number to listen on (default: 8000)
68
+
69
+ For production deployments, consider using uvicorn directly with
70
+ multiple workers:
71
+ uvicorn Cyber_analyst.server.app:app --workers 4
72
+ """
73
+ import uvicorn
74
+
75
+ uvicorn.run(app, host=host, port=port)
76
+
77
+
78
+ if __name__ == "__main__":
79
+ main()
server/graders.py CHANGED
@@ -1,169 +1,169 @@
1
- """Deterministic graders for the Cyber Analyst OpenEnv tasks."""
2
-
3
- from __future__ import annotations
4
-
5
- import json
6
- from typing import Any
7
-
8
- try:
9
- from .tasks import SCENARIOS
10
- except ImportError: # pragma: no cover - supports direct module execution
11
- from tasks import SCENARIOS
12
-
13
-
14
- MIN_SCORE = 0.01
15
- MAX_SCORE = 0.99
16
-
17
-
18
- def safe_reward(score: float | int | None) -> float:
19
- """Clamp validator-facing scores to the strict open interval (0, 1)."""
20
-
21
- try:
22
- value = float(score if score is not None else 0.0)
23
- except (TypeError, ValueError):
24
- value = 0.0
25
- return max(MIN_SCORE, min(MAX_SCORE, value))
26
-
27
-
28
- def _coerce_report(report: Any) -> dict[str, Any]:
29
- if isinstance(report, dict):
30
- return report
31
- if isinstance(report, str):
32
- try:
33
- decoded = json.loads(report)
34
- except json.JSONDecodeError:
35
- return {"summary": report, "findings": []}
36
- return decoded if isinstance(decoded, dict) else {"findings": []}
37
- return {"findings": []}
38
-
39
-
40
- def _text_contains_any(text: str, keywords: list[str]) -> bool:
41
- lowered = text.lower()
42
- return any(keyword.lower() in lowered for keyword in keywords)
43
-
44
-
45
- def _report_findings(report: dict[str, Any]) -> list[dict[str, Any]]:
46
- findings = report.get("findings", [])
47
- if isinstance(findings, dict):
48
- findings = [findings]
49
- return [finding for finding in findings if isinstance(finding, dict)]
50
-
51
-
52
- def score_report(
53
- task_id: str,
54
- report: Any,
55
- verified_findings: list[dict[str, Any]] | None = None,
56
- validation_attempted: bool = False,
57
- ) -> tuple[float, dict[str, Any]]:
58
- """Score a submitted report against one task's deterministic ground truth."""
59
-
60
- scenario = SCENARIOS.get(task_id)
61
- report_dict = _coerce_report(report)
62
- report_findings = _report_findings(report_dict)
63
- verified_findings = verified_findings or []
64
-
65
- if scenario is None:
66
- return MIN_SCORE, {"unknown_task": task_id}
67
-
68
- expected_type = scenario["finding_type"]
69
- expected_evidence = set(scenario.get("required_evidence", [])) | set(
70
- scenario.get("supporting_evidence", [])
71
- )
72
-
73
- matching_verified = [
74
- finding
75
- for finding in verified_findings
76
- if finding.get("finding_type") == expected_type
77
- ]
78
- matching_report = [
79
- finding for finding in report_findings if finding.get("finding_type") == expected_type
80
- ]
81
-
82
- score = 0.05
83
- breakdown: dict[str, Any] = {
84
- "base": 0.05,
85
- "verified_correct": 0.0,
86
- "valid_evidence": 0.0,
87
- "actionable_remediation": 0.0,
88
- "hallucination_penalty": 0.0,
89
- "validation_penalty": 0.0,
90
- }
91
-
92
- if matching_verified and matching_report:
93
- impact_text = " ".join(
94
- str(finding.get("impact", "")) + " " + str(finding.get("description", ""))
95
- for finding in matching_report
96
- )
97
- if _text_contains_any(impact_text, scenario.get("impact_keywords", [])):
98
- score += 0.60
99
- breakdown["verified_correct"] = 0.60
100
-
101
- report_evidence: set[str] = set()
102
- for finding in matching_report:
103
- evidence_ids = finding.get("evidence_ids", [])
104
- if isinstance(evidence_ids, str):
105
- evidence_ids = [evidence_ids]
106
- report_evidence.update(str(evidence_id) for evidence_id in evidence_ids)
107
-
108
- if report_evidence & expected_evidence:
109
- score += 0.15
110
- breakdown["valid_evidence"] = 0.15
111
-
112
- remediation_text = " ".join(
113
- str(finding.get("remediation", "")) for finding in matching_report
114
- )
115
- if _text_contains_any(remediation_text, scenario.get("remediation_keywords", [])):
116
- score += 0.15
117
- breakdown["actionable_remediation"] = 0.15
118
-
119
- verified_types = {finding.get("finding_type") for finding in verified_findings}
120
- hallucinated = [
121
- finding
122
- for finding in report_findings
123
- if finding.get("finding_type") not in verified_types
124
- ]
125
- if hallucinated:
126
- penalty = 0.40 * len(hallucinated)
127
- score -= penalty
128
- breakdown["hallucination_penalty"] = -penalty
129
-
130
- if not validation_attempted:
131
- score -= 0.20
132
- breakdown["validation_penalty"] = -0.20
133
-
134
- final_score = safe_reward(score)
135
- breakdown["raw_score"] = round(score, 4)
136
- breakdown["score"] = final_score
137
- return final_score, breakdown
138
-
139
-
140
- def _payload_from_args(*args: Any, **kwargs: Any) -> dict[str, Any]:
141
- if args and isinstance(args[0], dict):
142
- payload = dict(args[0])
143
- else:
144
- payload = {}
145
- payload.update(kwargs)
146
- return payload
147
-
148
-
149
- def grade_task(task_id: str, *args: Any, **kwargs: Any) -> float:
150
- """Manifest-friendly grader adapter."""
151
-
152
- payload = _payload_from_args(*args, **kwargs)
153
- report = payload.get("report") or payload.get("report_json") or payload
154
- verified_findings = payload.get("verified_findings", [])
155
- validation_attempted = bool(payload.get("validation_attempted", False))
156
- score, _ = score_report(task_id, report, verified_findings, validation_attempted)
157
- return score
158
-
159
-
160
- def grade_secret_exposure_easy(*args: Any, **kwargs: Any) -> float:
161
- return grade_task("secret_exposure_easy", *args, **kwargs)
162
-
163
-
164
- def grade_missing_security_headers_medium(*args: Any, **kwargs: Any) -> float:
165
- return grade_task("missing_security_headers_medium", *args, **kwargs)
166
-
167
-
168
- def grade_authz_boundary_hard(*args: Any, **kwargs: Any) -> float:
169
- return grade_task("authz_boundary_hard", *args, **kwargs)
 
1
+ """Deterministic graders for the Cyber Analyst OpenEnv tasks."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from typing import Any
7
+
8
+ try:
9
+ from .tasks import SCENARIOS
10
+ except ImportError: # pragma: no cover - supports direct module execution
11
+ from tasks import SCENARIOS
12
+
13
+
14
+ MIN_SCORE = 0.01
15
+ MAX_SCORE = 0.99
16
+
17
+
18
+ def safe_reward(score: float | int | None) -> float:
19
+ """Clamp validator-facing scores to the strict open interval (0, 1)."""
20
+
21
+ try:
22
+ value = float(score if score is not None else 0.0)
23
+ except (TypeError, ValueError):
24
+ value = 0.0
25
+ return max(MIN_SCORE, min(MAX_SCORE, value))
26
+
27
+
28
+ def _coerce_report(report: Any) -> dict[str, Any]:
29
+ if isinstance(report, dict):
30
+ return report
31
+ if isinstance(report, str):
32
+ try:
33
+ decoded = json.loads(report)
34
+ except json.JSONDecodeError:
35
+ return {"summary": report, "findings": []}
36
+ return decoded if isinstance(decoded, dict) else {"findings": []}
37
+ return {"findings": []}
38
+
39
+
40
+ def _text_contains_any(text: str, keywords: list[str]) -> bool:
41
+ lowered = text.lower()
42
+ return any(keyword.lower() in lowered for keyword in keywords)
43
+
44
+
45
+ def _report_findings(report: dict[str, Any]) -> list[dict[str, Any]]:
46
+ findings = report.get("findings", [])
47
+ if isinstance(findings, dict):
48
+ findings = [findings]
49
+ return [finding for finding in findings if isinstance(finding, dict)]
50
+
51
+
52
+ def score_report(
53
+ task_id: str,
54
+ report: Any,
55
+ verified_findings: list[dict[str, Any]] | None = None,
56
+ validation_attempted: bool = False,
57
+ ) -> tuple[float, dict[str, Any]]:
58
+ """Score a submitted report against one task's deterministic ground truth."""
59
+
60
+ scenario = SCENARIOS.get(task_id)
61
+ report_dict = _coerce_report(report)
62
+ report_findings = _report_findings(report_dict)
63
+ verified_findings = verified_findings or []
64
+
65
+ if scenario is None:
66
+ return MIN_SCORE, {"unknown_task": task_id}
67
+
68
+ expected_type = scenario["finding_type"]
69
+ expected_evidence = set(scenario.get("required_evidence", [])) | set(
70
+ scenario.get("supporting_evidence", [])
71
+ )
72
+
73
+ matching_verified = [
74
+ finding
75
+ for finding in verified_findings
76
+ if finding.get("finding_type") == expected_type
77
+ ]
78
+ matching_report = [
79
+ finding for finding in report_findings if finding.get("finding_type") == expected_type
80
+ ]
81
+
82
+ score = 0.05
83
+ breakdown: dict[str, Any] = {
84
+ "base": 0.05,
85
+ "verified_correct": 0.0,
86
+ "valid_evidence": 0.0,
87
+ "actionable_remediation": 0.0,
88
+ "hallucination_penalty": 0.0,
89
+ "validation_penalty": 0.0,
90
+ }
91
+
92
+ if matching_verified and matching_report:
93
+ impact_text = " ".join(
94
+ str(finding.get("impact", "")) + " " + str(finding.get("description", ""))
95
+ for finding in matching_report
96
+ )
97
+ if _text_contains_any(impact_text, scenario.get("impact_keywords", [])):
98
+ score += 0.60
99
+ breakdown["verified_correct"] = 0.60
100
+
101
+ report_evidence: set[str] = set()
102
+ for finding in matching_report:
103
+ evidence_ids = finding.get("evidence_ids", [])
104
+ if isinstance(evidence_ids, str):
105
+ evidence_ids = [evidence_ids]
106
+ report_evidence.update(str(evidence_id) for evidence_id in evidence_ids)
107
+
108
+ if report_evidence & expected_evidence:
109
+ score += 0.15
110
+ breakdown["valid_evidence"] = 0.15
111
+
112
+ remediation_text = " ".join(
113
+ str(finding.get("remediation", "")) for finding in matching_report
114
+ )
115
+ if _text_contains_any(remediation_text, scenario.get("remediation_keywords", [])):
116
+ score += 0.15
117
+ breakdown["actionable_remediation"] = 0.15
118
+
119
+ verified_types = {finding.get("finding_type") for finding in verified_findings}
120
+ hallucinated = [
121
+ finding
122
+ for finding in report_findings
123
+ if finding.get("finding_type") not in verified_types
124
+ ]
125
+ if hallucinated:
126
+ penalty = 0.40 * len(hallucinated)
127
+ score -= penalty
128
+ breakdown["hallucination_penalty"] = -penalty
129
+
130
+ if not validation_attempted:
131
+ score -= 0.20
132
+ breakdown["validation_penalty"] = -0.20
133
+
134
+ final_score = safe_reward(score)
135
+ breakdown["raw_score"] = round(score, 4)
136
+ breakdown["score"] = final_score
137
+ return final_score, breakdown
138
+
139
+
140
+ def _payload_from_args(*args: Any, **kwargs: Any) -> dict[str, Any]:
141
+ if args and isinstance(args[0], dict):
142
+ payload = dict(args[0])
143
+ else:
144
+ payload = {}
145
+ payload.update(kwargs)
146
+ return payload
147
+
148
+
149
+ def grade_task(task_id: str, *args: Any, **kwargs: Any) -> float:
150
+ """Manifest-friendly grader adapter."""
151
+
152
+ payload = _payload_from_args(*args, **kwargs)
153
+ report = payload.get("report") or payload.get("report_json") or payload
154
+ verified_findings = payload.get("verified_findings", [])
155
+ validation_attempted = bool(payload.get("validation_attempted", False))
156
+ score, _ = score_report(task_id, report, verified_findings, validation_attempted)
157
+ return score
158
+
159
+
160
+ def grade_secret_exposure_easy(*args: Any, **kwargs: Any) -> float:
161
+ return grade_task("secret_exposure_easy", *args, **kwargs)
162
+
163
+
164
+ def grade_missing_security_headers_medium(*args: Any, **kwargs: Any) -> float:
165
+ return grade_task("missing_security_headers_medium", *args, **kwargs)
166
+
167
+
168
+ def grade_authz_boundary_hard(*args: Any, **kwargs: Any) -> float:
169
+ return grade_task("authz_boundary_hard", *args, **kwargs)
server/requirements.txt CHANGED
@@ -1,6 +1,6 @@
1
- openenv[core]>=0.2.0
2
- fastapi>=0.115.0
3
- uvicorn>=0.24.0
4
- openai>=1.0.0
5
-
6
-
 
1
+ openenv[core]>=0.2.0
2
+ fastapi>=0.115.0
3
+ uvicorn>=0.24.0
4
+ openai>=1.0.0
5
+
6
+
server/tasks.py CHANGED
@@ -1,283 +1,283 @@
1
- """Deterministic scenario registry for SecOps Evidence Gym."""
2
-
3
- from __future__ import annotations
4
-
5
- from copy import deepcopy
6
- from random import Random
7
- from typing import Any
8
-
9
-
10
- DEFAULT_TASK_ID = "secret_exposure_easy"
11
-
12
- TOOL_CATALOG: list[dict[str, Any]] = [
13
- {
14
- "name": "list_assets",
15
- "description": "List synthetic services, routes, and artifact collections.",
16
- "args": {},
17
- },
18
- {
19
- "name": "get_log_events",
20
- "description": "Return sanitized telemetry evidence ids for a service/query.",
21
- "args": {"service_id": "str", "query": "str"},
22
- },
23
- {
24
- "name": "check_security_headers",
25
- "description": "Inspect a service header snapshot and return pass/fail evidence.",
26
- "args": {"service_id": "str"},
27
- },
28
- {
29
- "name": "search_repo",
30
- "description": "Search synthetic repo/config snippets for evidence ids.",
31
- "args": {"query": "str"},
32
- },
33
- {
34
- "name": "scan_dependencies",
35
- "description": "Inspect a synthetic dependency manifest excerpt.",
36
- "args": {},
37
- },
38
- {
39
- "name": "create_finding",
40
- "description": "Store a candidate finding for verifier review.",
41
- "args": {
42
- "finding_type": "str",
43
- "evidence_ids": "list[str]",
44
- "severity_guess": "str",
45
- "remediation": "str",
46
- },
47
- },
48
- {
49
- "name": "validate_finding",
50
- "description": "Run the deterministic verifier for a candidate finding.",
51
- "args": {"finding_id": "str"},
52
- },
53
- {
54
- "name": "submit_report",
55
- "description": "Submit the final structured report and end the episode.",
56
- "args": {"report_json": "dict"},
57
- },
58
- ]
59
-
60
-
61
- SCENARIOS: dict[str, dict[str, Any]] = {
62
- "secret_exposure_easy": {
63
- "task_id": "secret_exposure_easy",
64
- "difficulty": "easy",
65
- "title": "Secret exposure in repo snapshot",
66
- "alert": (
67
- "Repository hygiene alert: a synthetic config snapshot may contain "
68
- "credential-like material. Investigate, cite evidence, validate, and "
69
- "submit a remediation report."
70
- ),
71
- "ground_truth_id": "GT-SECRET-001",
72
- "finding_type": "secret_exposure",
73
- "severity": "high",
74
- "required_evidence": ["EVID-101"],
75
- "remediation_keywords": ["rotate", "remove"],
76
- "impact_keywords": ["secret", "credential", "api key", "leak"],
77
- "assets": {
78
- "services": ["gateway", "profile-service", "admin-service"],
79
- "routes": ["/login", "/profile", "/admin/export"],
80
- "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
81
- },
82
- "repo": [
83
- {
84
- "evidence_id": "EVID-101",
85
- "source": "repo_snapshot/config/profile-service.env",
86
- "service_id": "profile-service",
87
- "tags": ["api key", "secret", "credential", "config"],
88
- "text": (
89
- "PROFILE_EXPORT_API_KEY=CYBERLAB_DO_NOT_USE_API_KEY_7F3A "
90
- "# synthetic lab credential"
91
- ),
92
- },
93
- {
94
- "evidence_id": "EVID-102",
95
- "source": "repo_snapshot/config/gateway.env",
96
- "service_id": "gateway",
97
- "tags": ["config", "benign"],
98
- "text": "LOG_LEVEL=info; FEATURE_SAFE_HEADERS=true",
99
- },
100
- ],
101
- "logs": [
102
- {
103
- "evidence_id": "EVID-103",
104
- "service_id": "profile-service",
105
- "tags": ["benign", "startup"],
106
- "text": "profile-service boot completed for synthetic tenant secops-lab",
107
- }
108
- ],
109
- "headers": {
110
- "gateway": {
111
- "evidence_id": "EVID-104",
112
- "present": ["Strict-Transport-Security", "Content-Security-Policy"],
113
- "missing": [],
114
- }
115
- },
116
- "dependencies": [
117
- {
118
- "evidence_id": "EVID-105",
119
- "source": "repo_snapshot/requirements.lock",
120
- "tags": ["dependency", "benign"],
121
- "text": "fastapi==0.115.0; pydantic==2.8.2",
122
- }
123
- ],
124
- },
125
- "missing_security_headers_medium": {
126
- "task_id": "missing_security_headers_medium",
127
- "difficulty": "medium",
128
- "title": "Missing security headers",
129
- "alert": (
130
- "Gateway response-hardening alert: verify whether required security "
131
- "headers are missing or weak and submit evidence-backed remediation."
132
- ),
133
- "ground_truth_id": "GT-HEADERS-001",
134
- "finding_type": "missing_security_headers",
135
- "severity": "medium",
136
- "required_evidence": ["EVID-201"],
137
- "remediation_keywords": ["hsts", "csp"],
138
- "impact_keywords": ["header", "hsts", "csp", "clickjacking"],
139
- "assets": {
140
- "services": ["gateway", "profile-service", "admin-service"],
141
- "routes": ["/login", "/profile", "/admin/export"],
142
- "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
143
- },
144
- "repo": [
145
- {
146
- "evidence_id": "EVID-202",
147
- "source": "repo_snapshot/gateway/security_headers.py",
148
- "service_id": "gateway",
149
- "tags": ["headers", "config"],
150
- "text": "X-Frame-Options is set, but HSTS and CSP are not configured.",
151
- }
152
- ],
153
- "logs": [
154
- {
155
- "evidence_id": "EVID-203",
156
- "service_id": "gateway",
157
- "tags": ["benign", "response"],
158
- "text": "GET /profile 200 request_id=req-442 synthetic header audit",
159
- }
160
- ],
161
- "headers": {
162
- "gateway": {
163
- "evidence_id": "EVID-201",
164
- "present": ["X-Frame-Options", "X-Content-Type-Options"],
165
- "missing": ["Strict-Transport-Security", "Content-Security-Policy"],
166
- }
167
- },
168
- "dependencies": [
169
- {
170
- "evidence_id": "EVID-204",
171
- "source": "repo_snapshot/requirements.lock",
172
- "tags": ["dependency", "benign"],
173
- "text": "starlette==0.38.2; uvicorn==0.30.1",
174
- }
175
- ],
176
- },
177
- "authz_boundary_hard": {
178
- "task_id": "authz_boundary_hard",
179
- "difficulty": "hard",
180
- "title": "Authorisation boundary misconfiguration",
181
- "alert": (
182
- "Access-control drift alert: investigate a route/role mismatch in the "
183
- "synthetic admin service and submit a validated remediation report."
184
- ),
185
- "ground_truth_id": "GT-AUTHZ-001",
186
- "finding_type": "authz_boundary_misconfiguration",
187
- "severity": "critical",
188
- "required_evidence": ["EVID-301"],
189
- "supporting_evidence": ["EVID-302"],
190
- "remediation_keywords": ["least privilege", "policy", "regression"],
191
- "impact_keywords": ["authorization", "authorisation", "role", "admin"],
192
- "assets": {
193
- "services": ["gateway", "profile-service", "admin-service"],
194
- "routes": ["/login", "/profile", "/admin/export"],
195
- "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
196
- },
197
- "repo": [
198
- {
199
- "evidence_id": "EVID-301",
200
- "source": "repo_snapshot/admin-service/policy_matrix.yaml",
201
- "service_id": "admin-service",
202
- "tags": ["authorization", "role", "policy", "admin export"],
203
- "text": (
204
- "route=/admin/export allowed_roles=[admin, analyst] "
205
- "expected_roles=[admin]"
206
- ),
207
- }
208
- ],
209
- "logs": [
210
- {
211
- "evidence_id": "EVID-302",
212
- "service_id": "admin-service",
213
- "tags": ["authorization", "role", "admin export"],
214
- "text": (
215
- "request_id=req-913 route=/admin/export role=analyst "
216
- "decision=allow synthetic boundary-check event"
217
- ),
218
- },
219
- {
220
- "evidence_id": "EVID-303",
221
- "service_id": "gateway",
222
- "tags": ["benign", "auth"],
223
- "text": "request_id=req-912 route=/profile role=user decision=allow",
224
- },
225
- ],
226
- "headers": {
227
- "admin-service": {
228
- "evidence_id": "EVID-304",
229
- "present": ["Strict-Transport-Security", "Content-Security-Policy"],
230
- "missing": [],
231
- }
232
- },
233
- "dependencies": [
234
- {
235
- "evidence_id": "EVID-305",
236
- "source": "repo_snapshot/requirements.lock",
237
- "tags": ["dependency", "benign"],
238
- "text": "pyyaml==6.0.2; fastapi==0.115.0",
239
- }
240
- ],
241
- },
242
- }
243
-
244
-
245
- def list_task_ids() -> list[str]:
246
- return list(SCENARIOS)
247
-
248
-
249
- def build_scenario(task_id: str | None, seed: int | None = None) -> dict[str, Any]:
250
- """Return a deep-copied scenario with deterministic benign variation."""
251
-
252
- selected_task_id = task_id if task_id in SCENARIOS else DEFAULT_TASK_ID
253
- scenario = deepcopy(SCENARIOS[selected_task_id])
254
- scenario["seed"] = seed
255
-
256
- rng = Random(seed if seed is not None else 0)
257
- service_alias_sets = [
258
- ["gateway", "profile-service", "admin-service"],
259
- ["edge-gateway", "user-profile", "admin-service"],
260
- ["public-gateway", "profile-api", "backoffice-admin"],
261
- ]
262
- aliases = service_alias_sets[rng.randrange(len(service_alias_sets))]
263
- original_services = scenario["assets"]["services"]
264
- alias_map = dict(zip(original_services, aliases, strict=True))
265
-
266
- scenario["service_aliases"] = alias_map
267
- scenario["assets"]["services"] = [alias_map.get(s, s) for s in original_services]
268
-
269
- for collection_name in ("repo", "logs"):
270
- for item in scenario.get(collection_name, []):
271
- service_id = item.get("service_id")
272
- if service_id in alias_map:
273
- item["service_id"] = alias_map[service_id]
274
-
275
- scenario["headers"] = {
276
- alias_map.get(service_id, service_id): snapshot
277
- for service_id, snapshot in scenario.get("headers", {}).items()
278
- }
279
-
280
- for entries_name in ("repo", "logs", "dependencies"):
281
- rng.shuffle(scenario.get(entries_name, []))
282
-
283
- return scenario
 
1
+ """Deterministic scenario registry for SecOps Evidence Gym."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from copy import deepcopy
6
+ from random import Random
7
+ from typing import Any
8
+
9
+
10
+ DEFAULT_TASK_ID = "secret_exposure_easy"
11
+
12
+ TOOL_CATALOG: list[dict[str, Any]] = [
13
+ {
14
+ "name": "list_assets",
15
+ "description": "List synthetic services, routes, and artifact collections.",
16
+ "args": {},
17
+ },
18
+ {
19
+ "name": "get_log_events",
20
+ "description": "Return sanitized telemetry evidence ids for a service/query.",
21
+ "args": {"service_id": "str", "query": "str"},
22
+ },
23
+ {
24
+ "name": "check_security_headers",
25
+ "description": "Inspect a service header snapshot and return pass/fail evidence.",
26
+ "args": {"service_id": "str"},
27
+ },
28
+ {
29
+ "name": "search_repo",
30
+ "description": "Search synthetic repo/config snippets for evidence ids.",
31
+ "args": {"query": "str"},
32
+ },
33
+ {
34
+ "name": "scan_dependencies",
35
+ "description": "Inspect a synthetic dependency manifest excerpt.",
36
+ "args": {},
37
+ },
38
+ {
39
+ "name": "create_finding",
40
+ "description": "Store a candidate finding for verifier review.",
41
+ "args": {
42
+ "finding_type": "str",
43
+ "evidence_ids": "list[str]",
44
+ "severity_guess": "str",
45
+ "remediation": "str",
46
+ },
47
+ },
48
+ {
49
+ "name": "validate_finding",
50
+ "description": "Run the deterministic verifier for a candidate finding.",
51
+ "args": {"finding_id": "str"},
52
+ },
53
+ {
54
+ "name": "submit_report",
55
+ "description": "Submit the final structured report and end the episode.",
56
+ "args": {"report_json": "dict"},
57
+ },
58
+ ]
59
+
60
+
61
+ SCENARIOS: dict[str, dict[str, Any]] = {
62
+ "secret_exposure_easy": {
63
+ "task_id": "secret_exposure_easy",
64
+ "difficulty": "easy",
65
+ "title": "Secret exposure in repo snapshot",
66
+ "alert": (
67
+ "Repository hygiene alert: a synthetic config snapshot may contain "
68
+ "credential-like material. Investigate, cite evidence, validate, and "
69
+ "submit a remediation report."
70
+ ),
71
+ "ground_truth_id": "GT-SECRET-001",
72
+ "finding_type": "secret_exposure",
73
+ "severity": "high",
74
+ "required_evidence": ["EVID-101"],
75
+ "remediation_keywords": ["rotate", "remove"],
76
+ "impact_keywords": ["secret", "credential", "api key", "leak"],
77
+ "assets": {
78
+ "services": ["gateway", "profile-service", "admin-service"],
79
+ "routes": ["/login", "/profile", "/admin/export"],
80
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
81
+ },
82
+ "repo": [
83
+ {
84
+ "evidence_id": "EVID-101",
85
+ "source": "repo_snapshot/config/profile-service.env",
86
+ "service_id": "profile-service",
87
+ "tags": ["api key", "secret", "credential", "config"],
88
+ "text": (
89
+ "PROFILE_EXPORT_API_KEY=CYBERLAB_DO_NOT_USE_API_KEY_7F3A "
90
+ "# synthetic lab credential"
91
+ ),
92
+ },
93
+ {
94
+ "evidence_id": "EVID-102",
95
+ "source": "repo_snapshot/config/gateway.env",
96
+ "service_id": "gateway",
97
+ "tags": ["config", "benign"],
98
+ "text": "LOG_LEVEL=info; FEATURE_SAFE_HEADERS=true",
99
+ },
100
+ ],
101
+ "logs": [
102
+ {
103
+ "evidence_id": "EVID-103",
104
+ "service_id": "profile-service",
105
+ "tags": ["benign", "startup"],
106
+ "text": "profile-service boot completed for synthetic tenant secops-lab",
107
+ }
108
+ ],
109
+ "headers": {
110
+ "gateway": {
111
+ "evidence_id": "EVID-104",
112
+ "present": ["Strict-Transport-Security", "Content-Security-Policy"],
113
+ "missing": [],
114
+ }
115
+ },
116
+ "dependencies": [
117
+ {
118
+ "evidence_id": "EVID-105",
119
+ "source": "repo_snapshot/requirements.lock",
120
+ "tags": ["dependency", "benign"],
121
+ "text": "fastapi==0.115.0; pydantic==2.8.2",
122
+ }
123
+ ],
124
+ },
125
+ "missing_security_headers_medium": {
126
+ "task_id": "missing_security_headers_medium",
127
+ "difficulty": "medium",
128
+ "title": "Missing security headers",
129
+ "alert": (
130
+ "Gateway response-hardening alert: verify whether required security "
131
+ "headers are missing or weak and submit evidence-backed remediation."
132
+ ),
133
+ "ground_truth_id": "GT-HEADERS-001",
134
+ "finding_type": "missing_security_headers",
135
+ "severity": "medium",
136
+ "required_evidence": ["EVID-201"],
137
+ "remediation_keywords": ["hsts", "csp"],
138
+ "impact_keywords": ["header", "hsts", "csp", "clickjacking"],
139
+ "assets": {
140
+ "services": ["gateway", "profile-service", "admin-service"],
141
+ "routes": ["/login", "/profile", "/admin/export"],
142
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
143
+ },
144
+ "repo": [
145
+ {
146
+ "evidence_id": "EVID-202",
147
+ "source": "repo_snapshot/gateway/security_headers.py",
148
+ "service_id": "gateway",
149
+ "tags": ["headers", "config"],
150
+ "text": "X-Frame-Options is set, but HSTS and CSP are not configured.",
151
+ }
152
+ ],
153
+ "logs": [
154
+ {
155
+ "evidence_id": "EVID-203",
156
+ "service_id": "gateway",
157
+ "tags": ["benign", "response"],
158
+ "text": "GET /profile 200 request_id=req-442 synthetic header audit",
159
+ }
160
+ ],
161
+ "headers": {
162
+ "gateway": {
163
+ "evidence_id": "EVID-201",
164
+ "present": ["X-Frame-Options", "X-Content-Type-Options"],
165
+ "missing": ["Strict-Transport-Security", "Content-Security-Policy"],
166
+ }
167
+ },
168
+ "dependencies": [
169
+ {
170
+ "evidence_id": "EVID-204",
171
+ "source": "repo_snapshot/requirements.lock",
172
+ "tags": ["dependency", "benign"],
173
+ "text": "starlette==0.38.2; uvicorn==0.30.1",
174
+ }
175
+ ],
176
+ },
177
+ "authz_boundary_hard": {
178
+ "task_id": "authz_boundary_hard",
179
+ "difficulty": "hard",
180
+ "title": "Authorisation boundary misconfiguration",
181
+ "alert": (
182
+ "Access-control drift alert: investigate a route/role mismatch in the "
183
+ "synthetic admin service and submit a validated remediation report."
184
+ ),
185
+ "ground_truth_id": "GT-AUTHZ-001",
186
+ "finding_type": "authz_boundary_misconfiguration",
187
+ "severity": "critical",
188
+ "required_evidence": ["EVID-301"],
189
+ "supporting_evidence": ["EVID-302"],
190
+ "remediation_keywords": ["least privilege", "policy", "regression"],
191
+ "impact_keywords": ["authorization", "authorisation", "role", "admin"],
192
+ "assets": {
193
+ "services": ["gateway", "profile-service", "admin-service"],
194
+ "routes": ["/login", "/profile", "/admin/export"],
195
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
196
+ },
197
+ "repo": [
198
+ {
199
+ "evidence_id": "EVID-301",
200
+ "source": "repo_snapshot/admin-service/policy_matrix.yaml",
201
+ "service_id": "admin-service",
202
+ "tags": ["authorization", "role", "policy", "admin export"],
203
+ "text": (
204
+ "route=/admin/export allowed_roles=[admin, analyst] "
205
+ "expected_roles=[admin]"
206
+ ),
207
+ }
208
+ ],
209
+ "logs": [
210
+ {
211
+ "evidence_id": "EVID-302",
212
+ "service_id": "admin-service",
213
+ "tags": ["authorization", "role", "admin export"],
214
+ "text": (
215
+ "request_id=req-913 route=/admin/export role=analyst "
216
+ "decision=allow synthetic boundary-check event"
217
+ ),
218
+ },
219
+ {
220
+ "evidence_id": "EVID-303",
221
+ "service_id": "gateway",
222
+ "tags": ["benign", "auth"],
223
+ "text": "request_id=req-912 route=/profile role=user decision=allow",
224
+ },
225
+ ],
226
+ "headers": {
227
+ "admin-service": {
228
+ "evidence_id": "EVID-304",
229
+ "present": ["Strict-Transport-Security", "Content-Security-Policy"],
230
+ "missing": [],
231
+ }
232
+ },
233
+ "dependencies": [
234
+ {
235
+ "evidence_id": "EVID-305",
236
+ "source": "repo_snapshot/requirements.lock",
237
+ "tags": ["dependency", "benign"],
238
+ "text": "pyyaml==6.0.2; fastapi==0.115.0",
239
+ }
240
+ ],
241
+ },
242
+ }
243
+
244
+
245
+ def list_task_ids() -> list[str]:
246
+ return list(SCENARIOS)
247
+
248
+
249
+ def build_scenario(task_id: str | None, seed: int | None = None) -> dict[str, Any]:
250
+ """Return a deep-copied scenario with deterministic benign variation."""
251
+
252
+ selected_task_id = task_id if task_id in SCENARIOS else DEFAULT_TASK_ID
253
+ scenario = deepcopy(SCENARIOS[selected_task_id])
254
+ scenario["seed"] = seed
255
+
256
+ rng = Random(seed if seed is not None else 0)
257
+ service_alias_sets = [
258
+ ["gateway", "profile-service", "admin-service"],
259
+ ["edge-gateway", "user-profile", "admin-service"],
260
+ ["public-gateway", "profile-api", "backoffice-admin"],
261
+ ]
262
+ aliases = service_alias_sets[rng.randrange(len(service_alias_sets))]
263
+ original_services = scenario["assets"]["services"]
264
+ alias_map = dict(zip(original_services, aliases, strict=True))
265
+
266
+ scenario["service_aliases"] = alias_map
267
+ scenario["assets"]["services"] = [alias_map.get(s, s) for s in original_services]
268
+
269
+ for collection_name in ("repo", "logs"):
270
+ for item in scenario.get(collection_name, []):
271
+ service_id = item.get("service_id")
272
+ if service_id in alias_map:
273
+ item["service_id"] = alias_map[service_id]
274
+
275
+ scenario["headers"] = {
276
+ alias_map.get(service_id, service_id): snapshot
277
+ for service_id, snapshot in scenario.get("headers", {}).items()
278
+ }
279
+
280
+ for entries_name in ("repo", "logs", "dependencies"):
281
+ rng.shuffle(scenario.get(entries_name, []))
282
+
283
+ return scenario
tests/conftest.py CHANGED
@@ -1,7 +1,7 @@
1
- import sys
2
- from pathlib import Path
3
-
4
-
5
- PACKAGE_PARENT = Path(__file__).resolve().parents[2]
6
- if str(PACKAGE_PARENT) not in sys.path:
7
- sys.path.insert(0, str(PACKAGE_PARENT))
 
1
+ import sys
2
+ from pathlib import Path
3
+
4
+
5
+ PACKAGE_PARENT = Path(__file__).resolve().parents[2]
6
+ if str(PACKAGE_PARENT) not in sys.path:
7
+ sys.path.insert(0, str(PACKAGE_PARENT))
tests/test_environment.py CHANGED
@@ -1,187 +1,187 @@
1
- from Cyber_analyst.models import CyberAnalystAction
2
- from Cyber_analyst.server.Cyber_analyst_environment import CyberAnalystEnvironment
3
- from Cyber_analyst.server.graders import (
4
- grade_authz_boundary_hard,
5
- grade_missing_security_headers_medium,
6
- grade_secret_exposure_easy,
7
- safe_reward,
8
- )
9
-
10
-
11
- def _run_success_path(task_id, actions):
12
- env = CyberAnalystEnvironment()
13
- obs = env.reset(task_id=task_id, seed=7)
14
- assert obs.task_id == task_id
15
-
16
- for action in actions:
17
- obs = env.step(action)
18
-
19
- assert obs.done is True
20
- assert obs.tool_result["score"] > 0.5
21
- assert 0.01 <= obs.tool_result["score"] <= 0.99
22
- assert obs.error == ""
23
- return obs
24
-
25
-
26
- def test_secret_exposure_success_path():
27
- report = {
28
- "findings": [
29
- {
30
- "finding_type": "secret_exposure",
31
- "evidence_ids": ["EVID-101"],
32
- "impact": "A synthetic API key secret is exposed in config.",
33
- "remediation": "Remove the key and rotate the credential.",
34
- }
35
- ]
36
- }
37
- obs = _run_success_path(
38
- "secret_exposure_easy",
39
- [
40
- CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}),
41
- CyberAnalystAction(
42
- tool_name="create_finding",
43
- args={
44
- "finding_type": "secret_exposure",
45
- "evidence_ids": ["EVID-101"],
46
- "severity_guess": "high",
47
- "remediation": "Remove and rotate the synthetic credential.",
48
- },
49
- ),
50
- CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
51
- CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
52
- ],
53
- )
54
- assert obs.verified_findings[0]["matching_gt_id"] == "GT-SECRET-001"
55
- assert "trajectory_jsonl" in obs.tool_result
56
- assert "search_repo" in obs.tool_result["trajectory_jsonl"]
57
-
58
-
59
- def test_missing_security_headers_success_path():
60
- report = {
61
- "findings": [
62
- {
63
- "finding_type": "missing_security_headers",
64
- "evidence_ids": ["EVID-201"],
65
- "impact": "The gateway is missing HSTS and CSP headers.",
66
- "remediation": "Add HSTS and CSP at the gateway.",
67
- }
68
- ]
69
- }
70
- obs = _run_success_path(
71
- "missing_security_headers_medium",
72
- [
73
- CyberAnalystAction(
74
- tool_name="check_security_headers", args={"service_id": "gateway"}
75
- ),
76
- CyberAnalystAction(
77
- tool_name="create_finding",
78
- args={
79
- "finding_type": "missing_security_headers",
80
- "evidence_ids": ["EVID-201"],
81
- "severity_guess": "medium",
82
- "remediation": "Add HSTS and CSP headers.",
83
- },
84
- ),
85
- CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
86
- CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
87
- ],
88
- )
89
- assert obs.score_breakdown["valid_evidence"] == 0.15
90
-
91
-
92
- def test_authz_boundary_success_path_with_alias_compatible_service_ids():
93
- report = {
94
- "findings": [
95
- {
96
- "finding_type": "authz_boundary_misconfiguration",
97
- "evidence_ids": ["EVID-301", "EVID-302"],
98
- "impact": "The admin route authorization policy allows an analyst role.",
99
- "remediation": "Apply least privilege in the policy and add a regression test.",
100
- }
101
- ]
102
- }
103
- obs = _run_success_path(
104
- "authz_boundary_hard",
105
- [
106
- CyberAnalystAction(tool_name="list_assets", args={}),
107
- CyberAnalystAction(
108
- tool_name="get_log_events",
109
- args={"service_id": "admin-service", "query": "admin export"},
110
- ),
111
- CyberAnalystAction(tool_name="search_repo", args={"query": "admin export"}),
112
- CyberAnalystAction(
113
- tool_name="create_finding",
114
- args={
115
- "finding_type": "authz_boundary_misconfiguration",
116
- "evidence_ids": ["EVID-301", "EVID-302"],
117
- "severity_guess": "critical",
118
- "remediation": "Apply least privilege and add a regression test.",
119
- },
120
- ),
121
- CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
122
- CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
123
- ],
124
- )
125
- assert obs.score_breakdown["actionable_remediation"] == 0.15
126
-
127
-
128
- def test_invalid_tool_returns_observation_error():
129
- env = CyberAnalystEnvironment()
130
- env.reset(task_id="secret_exposure_easy", seed=1)
131
- obs = env.step(CyberAnalystAction(tool_name="shell", args={"cmd": "whoami"}))
132
- assert obs.done is False
133
- assert obs.error == "unsupported_tool"
134
- assert obs.tool_result["ok"] is False
135
-
136
-
137
- def test_hallucinated_report_scores_low_but_in_range():
138
- env = CyberAnalystEnvironment()
139
- env.reset(task_id="secret_exposure_easy", seed=1)
140
- obs = env.step(
141
- CyberAnalystAction(
142
- tool_name="submit_report",
143
- args={
144
- "report_json": {
145
- "findings": [
146
- {
147
- "finding_type": "remote_code_execution",
148
- "evidence_ids": [],
149
- "impact": "Unsupported claim.",
150
- "remediation": "Unsupported remediation.",
151
- }
152
- ]
153
- }
154
- },
155
- )
156
- )
157
- assert obs.done is True
158
- assert obs.tool_result["score"] == 0.01
159
-
160
-
161
- def test_repeated_action_hard_stops_episode():
162
- env = CyberAnalystEnvironment()
163
- env.reset(task_id="secret_exposure_easy", seed=1)
164
- obs = None
165
- for _ in range(6):
166
- obs = env.step(CyberAnalystAction(tool_name="list_assets", args={}))
167
- assert obs is not None
168
- assert obs.done is True
169
- assert obs.error == "repeat_hard_stop"
170
-
171
-
172
- def test_seed_determinism_for_assets():
173
- env_one = CyberAnalystEnvironment()
174
- env_two = CyberAnalystEnvironment()
175
- env_one.reset(task_id="authz_boundary_hard", seed=22)
176
- env_two.reset(task_id="authz_boundary_hard", seed=22)
177
- obs_one = env_one.step(CyberAnalystAction(tool_name="list_assets", args={}))
178
- obs_two = env_two.step(CyberAnalystAction(tool_name="list_assets", args={}))
179
- assert obs_one.tool_result == obs_two.tool_result
180
-
181
-
182
- def test_grader_adapters_and_clamp_are_strictly_in_range():
183
- assert safe_reward(-1) == 0.01
184
- assert safe_reward(2) == 0.99
185
- assert 0.01 <= grade_secret_exposure_easy() <= 0.99
186
- assert 0.01 <= grade_missing_security_headers_medium() <= 0.99
187
- assert 0.01 <= grade_authz_boundary_hard() <= 0.99
 
1
+ from Cyber_analyst.models import CyberAnalystAction
2
+ from Cyber_analyst.server.Cyber_analyst_environment import CyberAnalystEnvironment
3
+ from Cyber_analyst.server.graders import (
4
+ grade_authz_boundary_hard,
5
+ grade_missing_security_headers_medium,
6
+ grade_secret_exposure_easy,
7
+ safe_reward,
8
+ )
9
+
10
+
11
+ def _run_success_path(task_id, actions):
12
+ env = CyberAnalystEnvironment()
13
+ obs = env.reset(task_id=task_id, seed=7)
14
+ assert obs.task_id == task_id
15
+
16
+ for action in actions:
17
+ obs = env.step(action)
18
+
19
+ assert obs.done is True
20
+ assert obs.tool_result["score"] > 0.5
21
+ assert 0.01 <= obs.tool_result["score"] <= 0.99
22
+ assert obs.error == ""
23
+ return obs
24
+
25
+
26
+ def test_secret_exposure_success_path():
27
+ report = {
28
+ "findings": [
29
+ {
30
+ "finding_type": "secret_exposure",
31
+ "evidence_ids": ["EVID-101"],
32
+ "impact": "A synthetic API key secret is exposed in config.",
33
+ "remediation": "Remove the key and rotate the credential.",
34
+ }
35
+ ]
36
+ }
37
+ obs = _run_success_path(
38
+ "secret_exposure_easy",
39
+ [
40
+ CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}),
41
+ CyberAnalystAction(
42
+ tool_name="create_finding",
43
+ args={
44
+ "finding_type": "secret_exposure",
45
+ "evidence_ids": ["EVID-101"],
46
+ "severity_guess": "high",
47
+ "remediation": "Remove and rotate the synthetic credential.",
48
+ },
49
+ ),
50
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
51
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
52
+ ],
53
+ )
54
+ assert obs.verified_findings[0]["matching_gt_id"] == "GT-SECRET-001"
55
+ assert "trajectory_jsonl" in obs.tool_result
56
+ assert "search_repo" in obs.tool_result["trajectory_jsonl"]
57
+
58
+
59
+ def test_missing_security_headers_success_path():
60
+ report = {
61
+ "findings": [
62
+ {
63
+ "finding_type": "missing_security_headers",
64
+ "evidence_ids": ["EVID-201"],
65
+ "impact": "The gateway is missing HSTS and CSP headers.",
66
+ "remediation": "Add HSTS and CSP at the gateway.",
67
+ }
68
+ ]
69
+ }
70
+ obs = _run_success_path(
71
+ "missing_security_headers_medium",
72
+ [
73
+ CyberAnalystAction(
74
+ tool_name="check_security_headers", args={"service_id": "gateway"}
75
+ ),
76
+ CyberAnalystAction(
77
+ tool_name="create_finding",
78
+ args={
79
+ "finding_type": "missing_security_headers",
80
+ "evidence_ids": ["EVID-201"],
81
+ "severity_guess": "medium",
82
+ "remediation": "Add HSTS and CSP headers.",
83
+ },
84
+ ),
85
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
86
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
87
+ ],
88
+ )
89
+ assert obs.score_breakdown["valid_evidence"] == 0.15
90
+
91
+
92
+ def test_authz_boundary_success_path_with_alias_compatible_service_ids():
93
+ report = {
94
+ "findings": [
95
+ {
96
+ "finding_type": "authz_boundary_misconfiguration",
97
+ "evidence_ids": ["EVID-301", "EVID-302"],
98
+ "impact": "The admin route authorization policy allows an analyst role.",
99
+ "remediation": "Apply least privilege in the policy and add a regression test.",
100
+ }
101
+ ]
102
+ }
103
+ obs = _run_success_path(
104
+ "authz_boundary_hard",
105
+ [
106
+ CyberAnalystAction(tool_name="list_assets", args={}),
107
+ CyberAnalystAction(
108
+ tool_name="get_log_events",
109
+ args={"service_id": "admin-service", "query": "admin export"},
110
+ ),
111
+ CyberAnalystAction(tool_name="search_repo", args={"query": "admin export"}),
112
+ CyberAnalystAction(
113
+ tool_name="create_finding",
114
+ args={
115
+ "finding_type": "authz_boundary_misconfiguration",
116
+ "evidence_ids": ["EVID-301", "EVID-302"],
117
+ "severity_guess": "critical",
118
+ "remediation": "Apply least privilege and add a regression test.",
119
+ },
120
+ ),
121
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
122
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
123
+ ],
124
+ )
125
+ assert obs.score_breakdown["actionable_remediation"] == 0.15
126
+
127
+
128
+ def test_invalid_tool_returns_observation_error():
129
+ env = CyberAnalystEnvironment()
130
+ env.reset(task_id="secret_exposure_easy", seed=1)
131
+ obs = env.step(CyberAnalystAction(tool_name="shell", args={"cmd": "whoami"}))
132
+ assert obs.done is False
133
+ assert obs.error == "unsupported_tool"
134
+ assert obs.tool_result["ok"] is False
135
+
136
+
137
+ def test_hallucinated_report_scores_low_but_in_range():
138
+ env = CyberAnalystEnvironment()
139
+ env.reset(task_id="secret_exposure_easy", seed=1)
140
+ obs = env.step(
141
+ CyberAnalystAction(
142
+ tool_name="submit_report",
143
+ args={
144
+ "report_json": {
145
+ "findings": [
146
+ {
147
+ "finding_type": "remote_code_execution",
148
+ "evidence_ids": [],
149
+ "impact": "Unsupported claim.",
150
+ "remediation": "Unsupported remediation.",
151
+ }
152
+ ]
153
+ }
154
+ },
155
+ )
156
+ )
157
+ assert obs.done is True
158
+ assert obs.tool_result["score"] == 0.01
159
+
160
+
161
+ def test_repeated_action_hard_stops_episode():
162
+ env = CyberAnalystEnvironment()
163
+ env.reset(task_id="secret_exposure_easy", seed=1)
164
+ obs = None
165
+ for _ in range(6):
166
+ obs = env.step(CyberAnalystAction(tool_name="list_assets", args={}))
167
+ assert obs is not None
168
+ assert obs.done is True
169
+ assert obs.error == "repeat_hard_stop"
170
+
171
+
172
+ def test_seed_determinism_for_assets():
173
+ env_one = CyberAnalystEnvironment()
174
+ env_two = CyberAnalystEnvironment()
175
+ env_one.reset(task_id="authz_boundary_hard", seed=22)
176
+ env_two.reset(task_id="authz_boundary_hard", seed=22)
177
+ obs_one = env_one.step(CyberAnalystAction(tool_name="list_assets", args={}))
178
+ obs_two = env_two.step(CyberAnalystAction(tool_name="list_assets", args={}))
179
+ assert obs_one.tool_result == obs_two.tool_result
180
+
181
+
182
+ def test_grader_adapters_and_clamp_are_strictly_in_range():
183
+ assert safe_reward(-1) == 0.01
184
+ assert safe_reward(2) == 0.99
185
+ assert 0.01 <= grade_secret_exposure_easy() <= 0.99
186
+ assert 0.01 <= grade_missing_security_headers_medium() <= 0.99
187
+ assert 0.01 <= grade_authz_boundary_hard() <= 0.99
tests/test_inference.py CHANGED
@@ -1,47 +1,47 @@
1
- import pytest
2
-
3
- from Cyber_analyst.inference import (
4
- ModelActionError,
5
- action_to_log,
6
- error_action,
7
- parse_model_action,
8
- )
9
-
10
-
11
- def test_parse_model_action_accepts_compact_json():
12
- action = parse_model_action('{"tool_name":"search_repo","args":{"query":"api key"}}')
13
-
14
- assert action.tool_name == "search_repo"
15
- assert action.args == {"query": "api key"}
16
-
17
-
18
- def test_parse_model_action_accepts_fenced_json():
19
- action = parse_model_action(
20
- """```json
21
- {"tool_name":"list_assets","args":{}}
22
- ```"""
23
- )
24
-
25
- assert action.tool_name == "list_assets"
26
- assert action.args == {}
27
-
28
-
29
- def test_parse_model_action_rejects_malformed_json():
30
- with pytest.raises(ModelActionError, match="model_parse_error"):
31
- parse_model_action("search the repo for api keys")
32
-
33
-
34
- def test_action_to_log_is_single_line_json():
35
- action = parse_model_action('{"tool_name":"search_repo","args":{"query":"api\\nkey"}}')
36
-
37
- logged = action_to_log(action)
38
-
39
- assert "\n" not in logged
40
- assert logged == '{"args":{"query":"api\\nkey"},"tool_name":"search_repo"}'
41
-
42
-
43
- def test_error_action_uses_strict_diagnostic_tool_name():
44
- action = error_action(ModelActionError("model_parse_error: empty response"))
45
-
46
- assert action.tool_name == "model_parse_error"
47
- assert action.args == {"message": "model_parse_error: empty response"}
 
1
+ import pytest
2
+
3
+ from Cyber_analyst.inference import (
4
+ ModelActionError,
5
+ action_to_log,
6
+ error_action,
7
+ parse_model_action,
8
+ )
9
+
10
+
11
+ def test_parse_model_action_accepts_compact_json():
12
+ action = parse_model_action('{"tool_name":"search_repo","args":{"query":"api key"}}')
13
+
14
+ assert action.tool_name == "search_repo"
15
+ assert action.args == {"query": "api key"}
16
+
17
+
18
+ def test_parse_model_action_accepts_fenced_json():
19
+ action = parse_model_action(
20
+ """```json
21
+ {"tool_name":"list_assets","args":{}}
22
+ ```"""
23
+ )
24
+
25
+ assert action.tool_name == "list_assets"
26
+ assert action.args == {}
27
+
28
+
29
+ def test_parse_model_action_rejects_malformed_json():
30
+ with pytest.raises(ModelActionError, match="model_parse_error"):
31
+ parse_model_action("search the repo for api keys")
32
+
33
+
34
+ def test_action_to_log_is_single_line_json():
35
+ action = parse_model_action('{"tool_name":"search_repo","args":{"query":"api\\nkey"}}')
36
+
37
+ logged = action_to_log(action)
38
+
39
+ assert "\n" not in logged
40
+ assert logged == '{"args":{"query":"api\\nkey"},"tool_name":"search_repo"}'
41
+
42
+
43
+ def test_error_action_uses_strict_diagnostic_tool_name():
44
+ action = error_action(ModelActionError("model_parse_error: empty response"))
45
+
46
+ assert action.tool_name == "model_parse_error"
47
+ assert action.args == {"message": "model_parse_error: empty response"}
uv.lock CHANGED
The diff for this file is too large to render. See raw diff