Humanlearning commited on
Commit
9852074
·
2 Parent(s): b2ee80d63a6397

chore: sync local CyberSecurity_OWASP with HF Space history

Browse files
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
docs/deep-research-report.md ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Hackathon Execution Pipeline for a Safe Cybersecurity Analyst Environment
2
+
3
+ ## Executive decision and scope selection
4
+
5
+ **SECTION 1 — Executive Decision**
6
+
7
+ [SOURCED] The hackathon’s *validator + judging* constraints strongly favour environments that: (a) simulate a real-world task (not “games/toys”), (b) are fully OpenEnv-compliant, (c) ship with **≥3 tasks with graders**, (d) produce **scores/rewards in the 0–1 range**, (e) include a reproducible `inference.py` that uses the **OpenAI client** (for any LLM calls) and prints strict `[START]/[STEP]/[END]` logs, and (f) run within a ~**20 minute** budget on ~**2 vCPU / 8 GB** infra. citeturn3view6turn3view7turn22view1turn22view2
8
+
9
+ [INFERENCE] Under these realities, your cybersecurity-analyst direction is *not* automatically the best, but it *can* become a high-probability-to-win choice if—and only if—you narrow to a deterministic, bounded, “investigate → cite evidence → verify → remediate” loop where (i) tools are tightly sandboxed, (ii) graders are deterministic, and (iii) the action space is small enough to be learnable and demo-able under the runtime cap.
10
+
11
+ [PROPOSAL] **Decision:** keep the cybersecurity direction, but **narrow aggressively** to a V1 environment that benchmarks **disciplined security triage + evidence-grounded reporting**, not pentesting/exploitation. The V1 I recommend building is:
12
+
13
+ [PROPOSAL] **“SecOps Evidence Gym”** — a safe, isolated OpenEnv environment where an agent investigates a *synthetic* microservice “organisation” via a **bounded tool API**, collects **evidence IDs**, validates candidate findings through **deterministic verifiers**, and submits a structured remediation report.
14
+
15
+ [SOURCED] This matches strong “winner DNA” seen in `kube-sre-gym` (realistic professional workflow + verification + narrative clarity) while remaining implementable in a hackathon budget. citeturn10view0turn18view0
16
+
17
+ [PROPOSAL] **What to cut entirely in V1 (non-negotiable):**
18
+ - “Live target” behaviour; no external network targets; no arbitrary HTTP to the internet. 🔒
19
+ - Any exploit payload recipes, exploit chains, privilege-escalation playbooks, or “how to hack X”. 🔒
20
+ - Arbitrary shell access (`bash`, `kubectl`, `nmap`, etc.) inside the environment. (Action space explosion + safety risk.)
21
+ - LLM-only grading/judging for correctness. (Reward hacking + non-determinism.)
22
+
23
+ [SOURCED] **What to keep (but narrow):** tool-using investigation, multi-step interaction, and deterministic verification—these are consistent with what OpenEnv is designed to support (typed `reset/step/state`, isolated server, type-safe schemas). citeturn18view0turn19search1
24
+
25
+ **SECTION 3 — Candidate Scope Comparison**
26
+
27
+ [SOURCED] The scoring below is anchored on hackathon validator requirements (3+ graded tasks, 0–1 scoring, strict inference logging, runtime limits) plus OpenEnv’s scaffolding/CLI/deployment model. citeturn3view6turn18view0turn22view1
28
+
29
+ [PROPOSAL] Weighted criteria (sum=1.00): judging fit 0.14, OpenEnv fit 0.12, grader determinism 0.14, implementation risk 0.12, runtime feasibility 0.10, demoability 0.10, real-world usefulness 0.10, novelty 0.08, training usefulness 0.06, shipping-on-time likelihood 0.04.
30
+
31
+ | Candidate scope | Judging fit | OpenEnv fit | Determinism | Impl risk (lower=better) | Runtime | Demo | Real-world use | Novelty | Training use | Ship-on-time | Weighted total |
32
+ |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
33
+ | **A. Your original direction:** “disciplined cyber analyst investigating a sandbox” (broad) | 8 | 7 | 6 | 4 | 6 | 8 | 8 | 8 | 8 | 4 | **6.7** |
34
+ | **B. Narrow cyber variant (recommended):** evidence-first triage lab with bounded tools + deterministic verifiers + structured report | 9 | 8 | 9 | 7 | 9 | 9 | 8 | 7 | 8 | 8 | **8.4** |
35
+ | **C. Adjacent: SRE incident triage (single-turn, deterministic logs → RCA)** | 9 | 8 | 10 | 9 | 10 | 8 | 8 | 5 | 6 | 9 | **8.3** |
36
+ | **D. Adjacent: prompt-injection “WAF” benchmark** | 7 | 8 | 8 | 7 | 9 | 9 | 7 | 7 | 6 | 7 | **7.6** |
37
+
38
+ [INFERENCE] Candidate C (SRE triage) is extremely validator-safe (many examples already pass deep validation), but it is likely more saturated and less “new” in judging. Candidate B keeps your cybersecurity theme while retaining the determinism and boundedness that the validator and time budget demand, so it is the best balance for **winning + real-world usefulness**.
39
+
40
+ **SECTION 4 — Final V1 Problem Statement**
41
+
42
+ [PROPOSAL] **One-sentence version:** Build a safe OpenEnv environment that trains and benchmarks an agent to perform **evidence-grounded security triage** on a bounded synthetic system and produce a **remediation-oriented report** without hallucinating.
43
+
44
+ [PROPOSAL] **Short pitch (demo-ready):** “SecOps Evidence Gym” gives the model an alert and a constrained set of investigation tools. The model must gather evidence, validate findings with deterministic verifiers, and submit a structured report. Scores reward verified correctness, penalise hallucinated claims and wasted steps, and remain strictly within (0,1) to satisfy the hackathon validator. ✅🔒
45
+
46
+ [PROPOSAL] **Precise implementation version:** A FastAPI OpenEnv server exposing typed `reset()`, `step()`, `state()` and a manifest `openenv.yaml` with at least three tasks (easy/medium/hard), each associated with a grader. The environment implements multi-step tool calls inside `step()`, and uses deterministic verifiers plus a strict score clamp to `(0,1)` for validator compatibility. citeturn18view0turn19search1turn23view0turn26view0turn27search0
47
+
48
+ ## Winner patterns and judging fit extraction
49
+
50
+ **SECTION 2 — Winner Pattern Extraction**
51
+
52
+ [SOURCED] `kube-sre-gym` (OpenEnv Hackathon SF winner) demonstrates several patterns that appear strongly aligned with what judges value: a **realistic professional task**, an explicit **multi-step workflow** (triage → investigate → fix → verify), **multi-layer verification** (programmatic health checks + judge), a strong narrative that explains learning dynamics, and deployment on **Hugging Face Spaces**. citeturn10view0turn14view0
53
+
54
+ image_group{"layout":"carousel","aspect_ratio":"16:9","query":["kube-sre-gym OpenEnv Hackathon winner screenshot","OpenEnv framework architecture diagram","Hugging Face Spaces OpenEnv environment web demo","Incident Triage Environment OpenEnv Hackathon 2026 screenshot"],"num_per_query":1}
55
+
56
+ [SOURCED] Concretely, `kube-sre-gym` highlights: “real cluster/tool interaction”, verification layers to prevent false success, curriculum progression, and strong documentation that makes the environment’s value obvious quickly. citeturn10view0
57
+
58
+ [SOURCED] A 2026 hackathon submission that explicitly claims Phase-2 deep-validation success (`Incident-Triage-Environment`) reveals particularly transferable “validator-winning” patterns:
59
+ - A manifest with `spec_version`, `runtime`, `app`, `port`, and a **`tasks:` list where each task has a `grader:` pointing to importable Python functions**. citeturn23view0
60
+ - Deterministic graders that clamp outputs to a validator-friendly range. citeturn26view0turn26view3
61
+ - An `inference.py` that uses the OpenAI SDK and prints the strict stdout protocol with `[START]`, `[STEP]`, `[END]` lines in a stable key=value format. citeturn22view1turn22view2turn22view4
62
+
63
+ [SOURCED] The official OpenEnv repo reinforces what is transferable: type-safe action/observation/state contracts, containerised isolation, and standard Gym-like APIs. It also explicitly says OpenEnv is experimental and APIs can change, which increases the value of a minimal, validator-first build loop. citeturn18view0turn19search2
64
+
65
+ [INFERENCE] Given hackathon evaluation combines programmatic checks with LLM scoring, you must optimise for **deterministic correctness** *and* a compelling narrative/demo. citeturn3view7turn3view6
66
+
67
+ [PROPOSAL] Transferable “winner patterns” you should copy (selectively):
68
+ - **Strong “professional workflow” framing** (SRE, security analyst, triage) with clear step boundaries.
69
+ - **Small, discoverable tool set** that mirrors real practice (logs, config, policy checks) while staying bounded.
70
+ - **Deterministic verification** (programmatic checks) as the source of truth for correctness.
71
+ - **Narrative traceability**: logs, episode IDs, and a short “watch the agent work” demo.
72
+ - **Deployment excellence**: clean Docker build, working `/health`, working `/web` UI if enabled, and reproducible inference.
73
+
74
+ [PROPOSAL] What *not* to copy blindly:
75
+ - The “real cluster” dependency (e.g., GKE) is high operational burden and can fail the hackathon’s limited infra budget. citeturn10view0turn3view6
76
+ - LLM-as-judge for correctness (too easy to reward-hack; non-deterministic). (Use it, at most, for *format/style*, not correctness.)
77
+
78
+ ## Core V1 environment design and benchmark tasks
79
+
80
+ **SECTION 8 — Core Environment Design**
81
+
82
+ **V1 concept (aggressively narrow).**
83
+ [PROPOSAL] Your environment is a **synthetic organisation** with a small, fixed topology (three “services” + artefacts). The agent receives an alert. It can only interact via **approved tools** (implemented inside the simulator). It must (a) gather evidence IDs, (b) validate candidate findings, and (c) submit a report.
84
+
85
+ **Topology (V1).**
86
+ [PROPOSAL] Fixed components (no containers inside containers):
87
+ - `gateway` (public entry), `profile-service`, `admin-service`
88
+ - `repo_snapshot` (static code/config excerpts)
89
+ - `telemetry` (sanitised logs + “header snapshot” + “dependency manifest snapshot”)
90
+
91
+ **Reset logic.**
92
+ [PROPOSAL] `reset(task_id=..., seed=...)` selects a scenario variant and initialises:
93
+ - episode ID, step count
94
+ - scenario ground truth (one injected issue per episode in V1)
95
+ - tool budgets + “allowed scope” banner
96
+ - an evidence registry mapping `EVID-### → artefact snippet`
97
+ Return an initial observation containing the alert, the tool catalogue, and an empty “verified findings” list.
98
+
99
+ **Randomisation strategy.**
100
+ [PROPOSAL] Use seed-driven, deterministic randomisation:
101
+ - rename services/routes/IDs (`profile-service` might become `user-profile`),
102
+ - shuffle benign log lines around the key evidence,
103
+ - vary exact header sets / dependency versions within a small closed set,
104
+ - keep each scenario **fully reproducible from the seed**.
105
+
106
+ [SOURCED] Benchmark generators (e.g., AMaze) exist specifically to create diverse but controlled environments for evaluating generalisation, supporting the idea of seeded procedural variation rather than a single static scenario. citeturn16search7turn16search1
107
+
108
+ **Safety boundaries.**
109
+ [PROPOSAL] The sandbox contains **no live targets**, no real secrets, and no outbound network. “Secrets” are synthetic strings with an explicit “DO NOT USE OUTSIDE LAB” marker. Tools return synthetic results only. 🔒
110
+
111
+ [SOURCED] NIST’s cyber range guidance emphasises cyber ranges as safe and legal environments for training and assessment; separate research also discusses that cyber ranges themselves have security risks that must be mitigated (e.g., leakage/misuse), reinforcing the need for strict isolation and artefact sanitisation. citeturn29search1turn29search2
112
+
113
+ **How state is exposed to the agent.**
114
+ [PROPOSAL] Expose only a concise state summary: current phase, step budget remaining, tools remaining, verified findings count, and recent evidence IDs. Keep full ground truth hidden.
115
+
116
+ **Tool/action design (bounded action space).**
117
+ [PROPOSAL] V1 tool list (keep it ≤8 tools):
118
+ 1) `list_assets()` → returns asset IDs and route IDs
119
+ 2) `get_log_events(service_id, query)` → returns evidence IDs
120
+ 3) `check_security_headers(service_id)` → returns evidence IDs + pass/fail list
121
+ 4) `search_repo(query)` → returns evidence IDs from code snippets
122
+ 5) `scan_dependencies()` → returns evidence IDs from a lockfile excerpt
123
+ 6) `create_finding(finding_type, evidence_ids, severity_guess, remediation)` → stores candidate finding
124
+ 7) `validate_finding(finding_id)` → deterministic verifier; returns `(verified, matching_gt_id)`
125
+ 8) `submit_report(report_json)` → terminal action
126
+
127
+ **Anti-loop logic.**
128
+ [PROPOSAL] Track action signatures `(tool_name, args_hash)` and:
129
+ - apply increasing penalties for repeats,
130
+ - hard-stop an episode if identical actions repeat ≥6 times, returning `done=True` with a low score,
131
+ - always return a valid observation (never a server crash) to preserve training rollouts.
132
+
133
+ [SOURCED] OpenEnv’s environment-creation guidance strongly implies you should implement robust behaviour around `reset/step/state` with typed contracts and predictable server behaviour. citeturn19search1turn18view0
134
+
135
+ **SECTION 9 — Tasks / Benchmarks**
136
+
137
+ [SOURCED] The hackathon requires **at least 3 tasks with graders** and explicitly checks the tasks registry. citeturn3view6turn27search0
138
+
139
+ [PROPOSAL] V1 ships exactly **3 flagship tasks**, difficulty-tiered, each with deterministic success criteria and intermediate milestones.
140
+
141
+ **Flagship tasks (easy/medium/hard).**
142
+ [PROPOSAL] Each task is a *family* with small seeded variants.
143
+
144
+ **Easy: Secret exposure in repo snapshot**
145
+ - Goal: identify a leaked synthetic API key in a config file excerpt; propose rotation/removal.
146
+ - Deterministic success: report includes the correct finding type `secret_exposure`, includes ≥1 correct evidence ID, and remediation mentions rotation + removal.
147
+ - Intermediate rewards: `search_repo()` surfaces the evidence ID; `create_finding()` with correct type gets partial credit; `validate_finding()` confirms.
148
+ - False-positive check: claiming *additional* vulnerabilities not verified triggers penalty.
149
+
150
+ **Medium: Missing security headers**
151
+ - Goal: detect missing/weak security headers in a service “header snapshot”; propose remediation.
152
+ - Deterministic success: correct missing header set identification (from a fixed list), plus remediation mapping (e.g., add HSTS, CSP) within the environment’s rubric.
153
+ - Intermediate rewards: correct tool usage (`check_security_headers()`), correct mapping to finding type, successful verifier validation.
154
+ - Generalisation: header ordering/extra benign headers vary by seed.
155
+
156
+ **Hard: Authorisation boundary misconfiguration**
157
+ - Goal: detect an access control policy bug in a route/role matrix (modelled safely, without exploitation).
158
+ - Deterministic success: evidence IDs must show the policy mismatch; report must describe impact and remediation (principle of least privilege + policy fix + regression test).
159
+ - Intermediate rewards: `list_assets()` + `get_log_events()` reveal the mismatch pattern; candidate finding validated.
160
+ - False-positive guardrail: generic “SQLi/RCE” claims penalised unless evidence supports (it won’t, by design).
161
+
162
+ **Stretch tasks (post-V1, not for hackathon critical path).**
163
+ [PROPOSAL] Dependency-risk identification (synthetic CVE mapping), error-handling info leak, prioritisation under strict budget, and multi-finding episodes (2 findings) — but only once the validator-safe V1 is shipped.
164
+
165
+ ## OpenEnv compliance blueprint and repo plan
166
+
167
+ **SECTION 6 — OpenEnv Compliance Blueprint**
168
+
169
+ [SOURCED] OpenEnv’s core contract is Gymnasium-like APIs (`reset()`, `step()`, `state()`), with type-safe models, packaged behind a FastAPI server and typically accessed via an EnvClient. citeturn18view0turn19search1
170
+
171
+ [SOURCED] For environment creators, OpenEnv explicitly supports `openenv init`, and documents a canonical structure: `models.py`, `client.py`, `server/app.py`, `server/<environment>.py`, plus `openenv.yaml` and packaging metadata. citeturn18view0turn18view1
172
+
173
+ [SOURCED] OpenEnv provides CLI commands including `openenv init` and `openenv push` for deploying to **Hugging Face Spaces**. citeturn18view0turn17view0
174
+
175
+ [SOURCED] The OpenEnv repo’s environment-building guide demonstrates typed models (Action/Observation/State) as Python dataclasses and a `create_fastapi_app(...)` helper to serve the environment. citeturn19search1
176
+
177
+ [SOURCED] The OpenEnv repo explicitly warns *not* to copy outdated manifest patterns; current examples use `spec_version`, `type`, `runtime`, `app`, `port`. citeturn19search2turn23view0
178
+
179
+ **Validator-sensitive details you must implement (non-negotiable).**
180
+ [PROPOSAL] Based on official requirements + observed validator behaviour:
181
+ - Provide `openenv.yaml` with `spec_version: 1`, `name`, `runtime: fastapi`, `app: server.app:app`, `port: <int>`, and a `tasks:` list with **≥3 tasks each having `id`, `description`, `grader`**. citeturn23view0turn19search2
182
+ - Ensure each task’s final score is **strictly within (0,1)** to avoid fail-fast validation errors. citeturn27search0turn26view0
183
+ - Implement an `inference.py` that prints `[START]/[STEP]/[END]` lines exactly and uses the OpenAI SDK for LLM calls (if any), reading `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`. citeturn3view6turn22view1turn22view2
184
+ - Provide a `/health` endpoint that returns 200 once ready (commonly used in examples and deployment docs). citeturn17view0turn20view0
185
+
186
+ **Sync vs async.**
187
+ [SOURCED] OpenEnv supports async-first clients with a `.sync()` wrapper for synchronous usage. For hackathon inference scripts, synchronous control flow is often simpler and widely used in examples. citeturn18view0turn22view4
188
+
189
+ **What not to copy from older examples.**
190
+ [SOURCED] Some course material shows a simplified `openenv.yaml` (`name/version/description`), but the repo’s skill guidance explicitly warns against outdated manifests; follow the current spec-style manifest used in validated examples. citeturn19search2turn19search11turn23view0
191
+
192
+ **SECTION 7 — Repo / File Tree Plan**
193
+
194
+ [SOURCED] OpenEnv’s scaffold and common community submissions converge on a predictable repository layout and file naming. citeturn18view0turn20view0turn23view0
195
+
196
+ [PROPOSAL] Recommended repo structure (submission-ready):
197
+
198
+ ```
199
+ secops_evidence_gym/
200
+ openenv.yaml # REQUIRED: spec_version, runtime, app, port, tasks+graders
201
+ pyproject.toml # REQUIRED: package metadata + deps
202
+ README.md # REQUIRED: judging narrative + quickstart + safety boundaries
203
+ inference.py # REQUIRED: strict stdout logs + OpenAI client usage
204
+ models.py # REQUIRED: typed Action/Observation/State dataclasses
205
+ client.py # REQUIRED: EnvClient wrapper (sync + async)
206
+ __init__.py # REQUIRED: export Env + models for pip install
207
+
208
+ server/
209
+ app.py # REQUIRED: create_fastapi_app(...) wiring + /health
210
+ environment.py # REQUIRED: SecOpsEvidenceGymEnvironment(reset/step/state)
211
+ graders.py # REQUIRED: grade_easy/medium/hard + safe_reward clamp
212
+ tasks.py # OPTIONAL (high-leverage): scenario registry + seed sampling
213
+ safety.py # OPTIONAL (high-leverage): tool allowlist + sanitisation helpers
214
+ requirements.txt # OPTIONAL (if Docker build uses it)
215
+ Dockerfile # REQUIRED (practically): HF Spaces docker build
216
+
217
+ tests/
218
+ test_api_contract.py # smoke: reset/step/state doesn’t crash; reward range
219
+ test_graders.py # unit: deterministic scoring + strict (0,1) clamp
220
+ test_seed_determinism.py # unit: same seed → same evidence IDs
221
+ ```
222
+
223
+ [PROPOSAL] Mandatory for hackathon success: `openenv.yaml`, server app wiring, three tasks+graders, Docker build success, `inference.py` with strict logs, and a README that makes the environment’s value obvious in <60 seconds.
224
+
225
+ ## Reward, grading, and anti-hallucination design
226
+
227
+ **SECTION 10 — Reward Design**
228
+
229
+ [SOURCED] OpenEnv leaves reward semantics to the environment; you are responsible for correctness scoring and determinism. citeturn18view0turn19search1
230
+
231
+ [SOURCED] Hackathon validation has shown strict “score must be between 0 and 1 (not 0.0 and not 1.0)” behaviour, and teams clamp rewards (e.g., 0.01–0.99). citeturn27search0turn26view0
232
+
233
+ [SOURCED] Empirical RL research in other domains (e.g., autonomous racing) shows reward design choices materially affect performance and generalisation, supporting the need for careful shaping rather than a single sparse terminal reward. citeturn15view2
234
+
235
+ [PROPOSAL] **Core principle:** correctness is **verifier-gated**, not language-judged. You can optionally add *format/style* checks, but never allow style to dominate correctness reward.
236
+
237
+ ### Reward structure (practical V1)
238
+
239
+ [PROPOSAL] Normalise the final *task score* into `(0,1)` and keep per-step rewards small enough that summed episode reward stays in `(0,1)` as well (or only final reward is used, depending on your environment semantics). Use a single “score” to satisfy the validator and expose detailed breakdowns in `observation.metadata`.
240
+
241
+ **Terminal (sparse) components** ✅
242
+ [PROPOSAL]
243
+ - `+0.60` if at least one ground-truth finding is verified and correctly described (type + impact).
244
+ - `+0.15` if the report includes **≥1 valid evidence ID** per finding and those IDs correspond to the right artefacts.
245
+ - `+0.15` if remediation is actionable (specific control, config, test).
246
+ - `-0.40` per hallucinated/unverified finding claimed in the report.
247
+ - `-0.20` if the agent fails to run `validate_finding()` before `submit_report()`.
248
+
249
+ **Intermediate (dense) components** 🧭
250
+ [PROPOSAL]
251
+ - `+0.02` for discovering a *new* relevant evidence ID (first time only).
252
+ - `+0.03` for creating a well-formed candidate finding that references evidence IDs.
253
+ - `-0.01` per step (efficiency pressure).
254
+ - `-0.03` for repeating the same tool call (exact same args) beyond 2 times.
255
+
256
+ **False-positive penalties / anti-hallucination** 🧯
257
+ [PROPOSAL] A “hallucination” is operationally defined as: the report asserts a finding that is not in the environment’s `verified_findings` list. This is easy to compute deterministically and maps directly to your stated goal (“avoid hallucinating findings”).
258
+
259
+ ### Avoiding reward hacking
260
+
261
+ [PROPOSAL] Hardening rules:
262
+ - Cap rewards from verbosity: extra words do not add points.
263
+ - Make evidence IDs required for high scores (prevents purely rhetorical “security speak”).
264
+ - Penalise calling `validate_finding()` repeatedly without new evidence.
265
+ - Reject “kitchen sink” reporting by penalising extra unverified findings.
266
+
267
+ ### Binary vs shaped reward
268
+
269
+ [PROPOSAL] **Binary-only** (0/1) will be easy to implement but brittle for multi-step tool use; the agent gets no gradient for *how* to investigate efficiently.
270
+
271
+ [PROPOSAL] **Lightly shaped** (recommended) keeps correctness deterministic while providing enough signal to train investigation workflow (evidence collection, validation order, loop avoidance). This mirrors the broader lesson from reward engineering research: shaping and tuning can significantly alter learning outcomes. citeturn15view2
272
+
273
+ ### Deterministic judge vs hybrid judge
274
+
275
+ [PROPOSAL]
276
+ - **Strict deterministic judge (recommended V1):** all correctness via verifiers + string/structure checks.
277
+ - **Hybrid (stretch):** add a small LLM-based style score (e.g., clarity), heavily downweighted (≤0.05 of total) and never affecting pass/fail correctness.
278
+
279
+ ## Baseline inference pipeline and strict stdout logging
280
+
281
+ **SECTION 11 — Baseline Inference Pipeline**
282
+
283
+ [SOURCED] Hackathon requirements include: a reproducible `inference.py`, the OpenAI client requirement for LLM calls (using provided env vars), and strict stdout logging. citeturn3view6
284
+
285
+ [SOURCED] A concrete, hackathon-aligned stdout format has been used by validated submissions (example):
286
+ - `[START] task=<name> env=<benchmark> model=<model_name>`
287
+ - `[STEP] step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>`
288
+ - `[END] task=<name> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>` citeturn22view1turn22view2
289
+
290
+ [SOURCED] The same example inference uses the OpenAI SDK, reading `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`. citeturn22view1turn22view4
291
+
292
+ ### Responsibilities of `inference.py`
293
+
294
+ [PROPOSAL] `inference.py` should:
295
+ - read env vars: `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`, `ENV_URL` (and optionally `TASK_NAME` override),
296
+ - connect to the env via `.sync()` client,
297
+ - run tasks in a fixed order (easy → medium → hard),
298
+ - execute a bounded number of steps per task,
299
+ - log exactly one `[START]...` per task, one `[END]...` per task, and a `[STEP]...` per environment step,
300
+ - always exit with code 0 (even on failures) and log errors in the `[STEP] error=` field to avoid hard crashes.
301
+
302
+ ### Control flow (V1 baseline strategy)
303
+
304
+ [PROPOSAL] Use a **hybrid baseline** that is reliable under time constraints:
305
+ - scripted tool sequence per task (fast, deterministic),
306
+ - one LLM call (optional) to draft the final report from gathered evidence (so the demo shows “agentic reasoning”),
307
+ - temperature fixed to 0 for reproducibility (and lower variance).
308
+
309
+ [SOURCED] Deterministic inference settings like `TEMPERATURE=0.0` are used in competitive OpenEnv hackathon baselines. citeturn20view0turn22view4
310
+
311
+ ### Minimum viable baseline (must ship)
312
+
313
+ [PROPOSAL] For each task:
314
+ 1) `reset(task_id=<tier>)`
315
+ 2) run 2–4 tool calls that are always relevant (e.g., `check_security_headers`, `search_repo`, etc.)
316
+ 3) `create_finding(...)` using evidence IDs
317
+ 4) `validate_finding(finding_id)`
318
+ 5) `submit_report(report_json)`
319
+
320
+ ### Stronger baseline (only if time permits)
321
+
322
+ [PROPOSAL] Add one planning LLM call that chooses among tools based on the alert type, but still keep a hard step limit, and always include verifier validation before reporting.
323
+
324
+ ## Complete build, validation, deployment, and submission pipeline
325
+
326
+ **SECTION 5 — Complete End-to-End Pipeline**
327
+
328
+ [SOURCED] This pipeline is built to satisfy both OpenEnv conventions (init/push, typed models, FastAPI server) and hackathon validation constraints (tasks/graders, inference logging, runtime budgets). citeturn18view0turn19search2turn3view6turn22view1
329
+
330
+ ### Phase goals, deliverables, verification (execution-ready)
331
+
332
+ [PROPOSAL] The table below is the “do-this-in-order” execution plan. It is intentionally validator-first.
333
+
334
+ | Phase | Goal | Deliverables | Files touched | Acceptance criteria | Main risks | How to verify |
335
+ |---|---|---|---|---|---|---|
336
+ | Scope lock | Freeze V1 to 3 tasks + bounded tools | 1-page spec + non-goals | README.md | No pentest/exploit scope; 3 tasks defined | Scope creep | Manual checklist |
337
+ | Scaffold | Generate OpenEnv skeleton | Working importable package | openenv.yaml, models.py, client.py, server/* | `python -c "import ..."` succeeds | Wrong template/paths | Local import smoke test |
338
+ | Environment core | Implement reset/step/state; tool router | Simulator runs end-to-end | server/environment.py | reset+step returns typed observation; no crashes | Action validation crashes | manual `curl` + python client |
339
+ | Tasks + graders | Implement 3 graders + strict (0,1) clamp | `grade_easy/medium/hard` | server/graders.py, openenv.yaml | tasks discoverable; scores strictly in (0,1) | Validator fail-fast | unit tests + manual checks |
340
+ | Baseline inference | Make inference reproducible + strict logs | inference.py | inference.py | prints correct `[START]/[STEP]/[END]` | log-parser failure | run script locally |
341
+ | Local validation | Run OpenEnv build & validate | passes `openenv validate` | Dockerfile, server/app.py | validate passes locally | port mismatch | `openenv validate --url ...` |
342
+ | Docker + HF | Deploy to Spaces | live endpoint | openenv push output | `/health` 200; reset+step works remotely | HF port/env mismatch | curl + python client |
343
+ | Submission | Final narrative + demo | polished README + screenshots | README.md | demo works in <2 min | unclear story | run “demo script” |
344
+
345
+ ### Concrete build plan with commands
346
+
347
+ [SOURCED] OpenEnv supports `openenv init` and `openenv push` and documents this as the standard creator workflow. citeturn18view0turn17view0
348
+ [SOURCED] The OpenEnv course also provides a grounded dev loop: `uv sync`, `uv run server`, `curl /health`, and Docker build/run commands. citeturn17view0
349
+
350
+ [PROPOSAL] Commands (copy/paste order):
351
+
352
+ 1) **Scaffold**
353
+ ```bash
354
+ pip install openenv-core
355
+ openenv init secops_evidence_gym
356
+ cd secops_evidence_gym
357
+ ```
358
+ [SOURCED] `openenv init` is the documented way to scaffold a new environment. citeturn18view0turn18view2
359
+
360
+ 2) **Local dev install + run**
361
+ ```bash
362
+ uv sync
363
+ uv run server
364
+ curl http://localhost:8000/health
365
+ ```
366
+ [SOURCED] `uv run server` and `/health` checks are part of the recommended iteration loop in OpenEnv course materials. citeturn17view0
367
+
368
+ 3) **Implement core files (edit)**
369
+ - `models.py`: define `Action/Observation/State` dataclasses
370
+ - `server/environment.py`: implement reset/step/state + tool routing
371
+ - `server/graders.py`: implement `grade_easy/grade_medium/grade_hard` + `safe_reward()`
372
+ - `openenv.yaml`: add `tasks:` with grader import paths
373
+
374
+ [SOURCED] OpenEnv’s environment-building guide explicitly directs you to define models and implement `reset/step/state`, then wire a FastAPI app. citeturn19search1
375
+ [SOURCED] A validator-aligned `openenv.yaml` with `spec_version`, `runtime`, `app`, `port`, and `tasks` exists in deep-validation passing examples. citeturn23view0
376
+
377
+ 4) **Build + validate (local)**
378
+ ```bash
379
+ openenv build
380
+ openenv validate --verbose
381
+ ```
382
+ [SOURCED] `openenv build` and `openenv validate` are part of OpenEnv’s recommended validation workflow. citeturn19search2
383
+
384
+ 5) **Docker build/run smoke test**
385
+ ```bash
386
+ docker build -t secops-evidence-gym:latest -f server/Dockerfile .
387
+ docker run -p 8000:8000 secops-evidence-gym:latest
388
+ curl http://localhost:8000/health
389
+ ```
390
+ [SOURCED] This `docker build -f server/Dockerfile .` pattern is directly shown in OpenEnv deployment course material. citeturn17view0
391
+
392
+ 6) **Run inference locally**
393
+ ```bash
394
+ export HF_TOKEN="..."
395
+ export API_BASE_URL="..."
396
+ export MODEL_NAME="..."
397
+ export ENV_URL="http://localhost:8000"
398
+ python inference.py
399
+ ```
400
+ [SOURCED] These env var names and OpenAI SDK usage are consistent with hackathon guidance and existing inference implementations. citeturn3view6turn22view4
401
+
402
+ 7) **Deploy to Hugging Face Spaces**
403
+ ```bash
404
+ openenv push --repo-id <your-hf-username>/secops-evidence-gym
405
+ ```
406
+ [SOURCED] `openenv push` is described as the fastest path to deploy to **Hugging Face Spaces**. citeturn17view0turn18view0
407
+
408
+ ### Testing and validation plan (high-signal)
409
+
410
+ [SOURCED] OpenEnv stresses predictable API behaviour and type-safe contracts; hackathon validation is fail-fast. citeturn18view0turn27search0
411
+
412
+ [PROPOSAL] Test layers (in priority order):
413
+ - **API contract smoke tests:** reset/step/state return valid JSON; never crash on invalid tool name (should return an observation with an error field).
414
+ - **Grader tests:** for each task, verify (a) correctness cases score high, (b) hallucination cases score low, (c) score always ∈ (0,1).
415
+ - **Seed determinism tests:** same `seed` produces same evidence IDs and same verifier outputs.
416
+ - **Runtime test:** run `inference.py` end-to-end and assert wall-clock < 2 minutes locally; assume < 20 minutes on grader infra even with cold starts. citeturn3view6turn22view4
417
+ - **Reward sanity tests:** ensure reward increases monotonically with verified correctness; fails if verbosity alone increases reward.
418
+
419
+ ## Submission packaging, execution roadmap, real-world usefulness, and failure modes
420
+
421
+ **SECTION 14 — README / Demo / Submission Narrative**
422
+ [SOURCED] Judges likely assess both the environment’s technical correctness (programmatic checks) and qualitative merit (LLM scoring / narrative). citeturn3view7
423
+
424
+ [PROPOSAL] README structure that “feels like a winner” 🏆:
425
+ - **Hero block:** one-paragraph pitch + why it’s real-world + safety claim.
426
+ - **Two-minute demo:** copy/paste commands + expected output snippet with `[START]/[STEP]/[END]`.
427
+ - **Environment contract:** action schema, observation schema, task list.
428
+ - **Grading:** explain deterministic verifiers + hallucination penalties.
429
+ - **Safety & isolation:** explicit exclusions (no egress, no shell, synthetic artefacts).
430
+ - **Real-world relevance:** how this benchmarks/reporting maps to security workflows (triage, evidence, remediation).
431
+ - **Screenshots:** web UI (optional) + an evidence trace + one scored report example.
432
+
433
+ **SECTION 15 — Project Management Plan**
434
+ [PROPOSAL] Day-by-day (assuming a hackathon-style sprint):
435
+
436
+ - **Day 0 (scope lock + scaffold):** environment skeleton, `openenv.yaml` with 3 tasks, stub graders returning 0.5 (clamped), server runs locally.
437
+ - **Day 1 (determinism + validator):** implement scenario generator, evidence registry, verifiers, and strict (0,1) scoring; pass `openenv validate`.
438
+ - **Day 2 (baseline + polish):** implement `inference.py` strict logs; deploy to Spaces; polish README + demo artefacts.
439
+
440
+ [PROPOSAL] Critical path: `openenv.yaml tasks+graders` → grader clamp `(0,1)` → inference stdout format → Docker+Spaces deployment. (Everything else is secondary.)
441
+
442
+ **SECTION 16 — Real-World Usefulness Plan**
443
+ [SOURCED] NIST’s testing guide emphasises planning, conducting tests, analysing findings, and developing mitigation strategies; your environment’s “evidence → remediation” focus aligns with that lifecycle without requiring offensive exploitation. citeturn29search8turn29search0
444
+
445
+ [PROPOSAL] Who would care after the hackathon:
446
+ - security engineering teams evaluating agentic “triage + reporting” reliability,
447
+ - LLM tooling teams wanting benchmarks for **non-hallucinating, evidence-grounded** outputs,
448
+ - training teams building safe cyber ranges (without weaponisation).
449
+
450
+ [PROPOSAL] Post-hackathon upgrades (highest leverage):
451
+ - export trajectories as JSONL for offline training,
452
+ - add more scenario families (still safe) and a held-out split for generalisation,
453
+ - integrate with RL trainers (e.g., TRL’s OpenEnv integration) to show real training curves. citeturn19search6turn10view0
454
+
455
+ [SOURCED] PenGym provides evidence that realism/faithfulness of environments can affect transfer and stability when moving from simulation to more realistic settings—so you should roadmap a “higher fidelity mode” (still safe) later, not in V1. citeturn15view0
456
+
457
+ **SECTION 17 — Why the naive version would fail**
458
+ [PROPOSAL] Top failure patterns (and why they kill submissions):
459
+ - Too broad (full cyber range, live services): fails time/infra constraints. citeturn3view6turn10view0
460
+ - Fuzzy grading (LLM-only judging): non-deterministic, easy to game.
461
+ - Unbounded tools (shell/network): unsafe + untrainable action space.
462
+ - Scores at exactly 0.0 or 1.0: fail-fast “out of range” validator. citeturn27search0turn26view0
463
+ - Inference logs not parseable: phase-1 failure even if env is good. citeturn3view6turn22view1
464
+ - Port / health issues on Spaces: container “works locally” but fails remotely. citeturn17view0turn20view0
465
+
466
+ **SECTION 18 — Final Recommendation**
467
+
468
+ [PROPOSAL] **What should you build?**
469
+ Build **SecOps Evidence Gym**: a deterministic, safe, sandbox-only cyber analyst environment focused on evidence collection, verifier validation, and remediation reporting.
470
+
471
+ [PROPOSAL] **What should V1 include? (minimum winning set)**
472
+ - OpenEnv-compliant FastAPI env with typed models and `reset/step/state`. citeturn18view0turn19search1
473
+ - `openenv.yaml` with **3 tasks + graders**. citeturn23view0turn3view6
474
+ - Deterministic verifiers + strict score clamp to `(0,1)`. citeturn27search0turn26view0
475
+ - Baseline `inference.py` with strict `[START]/[STEP]/[END]` logging + OpenAI SDK usage for any LLM calls. citeturn3view6turn22view1turn22view4
476
+ - HF Spaces deployment with a working `/health`. citeturn17view0turn20view0
477
+
478
+ [PROPOSAL] **What should you cut?**
479
+ - Any real pentesting/offensive content, any arbitrary command execution, any live targets, any correctness scoring via an LLM judge.
480
+
481
+ [PROPOSAL] **Top 5 implementation decisions that matter most**
482
+ 1) Validator-safe `openenv.yaml` tasks+graders wiring. citeturn23view0
483
+ 2) Score/range compliance: clamp to `(0,1)` everywhere. citeturn27search0turn26view0
484
+ 3) Strict stdout format in `inference.py`. citeturn22view1turn22view2
485
+ 4) Deterministic verifiers as the source of truth.
486
+ 5) Bounded tool set (≤8 tools) with anti-loop penalties.
487
+
488
+ [PROPOSAL] **Minimum viable winning submission**
489
+ A V1 with 3 tasks, deterministic graders, bounded tools, strict inference logging, and a polished README + demo trace.
490
+
491
+ [PROPOSAL] **Minimum viable real-world useful submission**
492
+ The same V1, plus: seed determinism, trajectory export, and a clear “how to add new scenarios” contributor guide.
493
+
494
+ [PROPOSAL] **If you only have time for 20% of ambition—do this exact 20%:**
495
+ - Implement **one** robust multi-step loop (tools → validate → report)
496
+ - Implement **exactly 3** tasks (easy/medium/hard)
497
+ - Make graders deterministic and validator-safe
498
+ - Make deployment + inference bulletproof
499
+ Everything else is stretch.
500
+
501
+ **Confidence (my estimate): 8.4/10** ✅🔥
502
+
503
+ ## Sources and credibility ratings (with exact links)
504
+
505
+ [SOURCED] Ratings are my judgement of authority + relevance for this hackathon context (0–10). URLs are provided verbatim in code form.
506
+
507
+ ### Tier 1 (official OpenEnv + hackathon dashboard)
508
+ - Credibility **9.5/10** — `https://github.com/meta-pytorch/OpenEnv` citeturn18view0
509
+ - Credibility **9.0/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/envs/README.md` citeturn19search1
510
+ - Credibility **8.5/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/.claude/skills/generate-openenv-env/SKILL.md` citeturn19search2
511
+ - Credibility **9.0/10** — `https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard` citeturn1view0turn3view6turn3view7
512
+
513
+ ### Tier 2 (strong community exemplars)
514
+ - Credibility **8.5/10** — `https://github.com/sid-rp/kube-sre-gym` citeturn10view0
515
+ - Credibility **8.0/10** — `https://huggingface.co/openenv-community` citeturn14view0
516
+ - Credibility **7.5/10** — `https://github.com/Harikishanth/Incident-Triage-Environment` citeturn20view0turn23view0turn22view1
517
+
518
+ ### Tier 3 (peer-reviewed / primary references for design constraints)
519
+ - Credibility **8.5/10** — PenGym (Computers & Security, open access): `https://www.sciencedirect.com/science/article/pii/S0167404824004450` citeturn15view0
520
+ - Credibility **8.0/10** — Reward design + generalisation (Scientific Reports, 2025): `https://www.nature.com/articles/s41598-025-27702-6` citeturn15view2
521
+ - Credibility **8.5/10** — AMaze (JOSS, 2025): `https://joss.theoj.org/papers/10.21105/joss.07208` citeturn16search7
522
+ - Credibility **9.5/10** — NIST SP 800-115: `https://csrc.nist.gov/pubs/sp/800/115/final` citeturn29search8
523
+ - Credibility **9.0/10** — NIST “Cyber Range: A Guide” (PDF landing): `https://www.nist.gov/document/cyber-range` citeturn29search1
524
+ - Credibility **7.5/10** — “Cybersecurity of Cyber Ranges: Threats and Mitigations” (IJISR, 2022 PDF): `https://infonomics-society.org/wp-content/uploads/Cybersecurity-of-Cyber-Ranges.pdf` citeturn29search2
525
+
526
+ ### Tier 4 (useful validator “ground truth” signals from the field)
527
+ - Credibility **6.5/10** — Validator failure mode discussion (score must be strictly between 0 and 1): `https://www.reddit.com/r/pytorch/comments/1shi767/meta_x_pytorch_x_sst_x_openenv_hackathon_phase_2/` citeturn27search0
528
+ - Credibility **7.0/10** — Strict logging format reference via a verified submission’s `inference.py`: `https://github.com/Harikishanth/Incident-Triage-Environment/blob/main/inference.py` citeturn22view1turn22view2
529
+
530
+ ### Uploaded reference you provided
531
+ - Credibility **7.0/10** (useful as a design draft; not independently authoritative) — `deep-research-report (2).md` fileciteturn2file0
inference.py ADDED
@@ -0,0 +1,290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Model-backed baseline inference for the Cyber Analyst OpenEnv environment."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import json
7
+ import os
8
+ import sys
9
+ import textwrap
10
+ from dataclasses import dataclass
11
+ from pathlib import Path
12
+ from typing import Any
13
+
14
+ from openai import OpenAI
15
+
16
+ PACKAGE_PARENT = Path(__file__).resolve().parent.parent
17
+ if str(PACKAGE_PARENT) not in sys.path:
18
+ sys.path.insert(0, str(PACKAGE_PARENT))
19
+
20
+ from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv, CyberAnalystObservation
21
+
22
+
23
+ ENV_NAME = "Cyber_analyst"
24
+ ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
25
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
26
+ MODEL_NAME = os.getenv("MODEL_NAME", "google/gemma-4-31B-it:fastest")
27
+ TEMPERATURE = float(os.getenv("TEMPERATURE", "0.0"))
28
+ MAX_TOKENS = int(os.getenv("MAX_TOKENS", "512"))
29
+ MAX_STEPS = int(os.getenv("MAX_STEPS", "12"))
30
+ SEED = int(os.getenv("SEED", "7"))
31
+ TASK_IDS = [
32
+ "secret_exposure_easy",
33
+ "missing_security_headers_medium",
34
+ "authz_boundary_hard",
35
+ ]
36
+
37
+ SYSTEM_PROMPT = textwrap.dedent(
38
+ """
39
+ You are running a bounded Cyber Analyst benchmark. You may only choose one
40
+ tool call from the provided tool catalog per turn. All evidence is synthetic;
41
+ do not request shell access, live network access, or external investigation.
42
+
43
+ Return exactly one compact JSON object and no surrounding text:
44
+ {"tool_name":"<tool name>","args":{...}}
45
+
46
+ To complete an episode, first discover relevant evidence, then create and
47
+ validate a finding, then submit a report_json with findings that include
48
+ finding_type, evidence_ids, impact, and remediation.
49
+ """
50
+ ).strip()
51
+
52
+
53
+ @dataclass(frozen=True)
54
+ class LLMConfig:
55
+ base_url: str
56
+ model_name: str
57
+ temperature: float
58
+ max_tokens: int
59
+
60
+
61
+ class ModelActionError(RuntimeError):
62
+ """Raised when the model cannot provide a valid benchmark action."""
63
+
64
+
65
+ def build_llm_config() -> LLMConfig:
66
+ return LLMConfig(
67
+ base_url=API_BASE_URL,
68
+ model_name=MODEL_NAME,
69
+ temperature=TEMPERATURE,
70
+ max_tokens=MAX_TOKENS,
71
+ )
72
+
73
+
74
+ def build_openai_client() -> OpenAI:
75
+ """Return an OpenAI-compatible client for the Hugging Face Router."""
76
+
77
+ return OpenAI(base_url=API_BASE_URL, api_key=os.environ["HF_TOKEN"])
78
+
79
+
80
+ def single_line(value: str) -> str:
81
+ return " ".join(str(value).split())
82
+
83
+
84
+ def action_to_log(action: CyberAnalystAction) -> str:
85
+ payload = {"tool_name": action.tool_name, "args": action.args}
86
+ return single_line(json.dumps(payload, sort_keys=True, separators=(",", ":")))
87
+
88
+
89
+ def log_start(task_id: str, llm_config: LLMConfig) -> None:
90
+ print(
91
+ f"[START] task={task_id} env={ENV_NAME} model={llm_config.model_name}",
92
+ flush=True,
93
+ )
94
+
95
+
96
+ def log_step(
97
+ step: int, action: CyberAnalystAction, reward: float | None, done: bool, error: str
98
+ ) -> None:
99
+ reward_value = 0.0 if reward is None else float(reward)
100
+ error_value = single_line(error) if error else "null"
101
+ print(
102
+ f"[STEP] step={step} action={action_to_log(action)} "
103
+ f"reward={reward_value:.2f} done={str(done).lower()} error={error_value}",
104
+ flush=True,
105
+ )
106
+
107
+
108
+ def log_end(task_id: str, success: bool, steps: int, score: float, rewards: list[float]) -> None:
109
+ rewards_text = ",".join(f"{reward:.2f}" for reward in rewards)
110
+ print(
111
+ f"[END] task={task_id} success={str(success).lower()} "
112
+ f"steps={steps} score={score:.2f} rewards={rewards_text}",
113
+ flush=True,
114
+ )
115
+
116
+
117
+ def observation_payload(obs: CyberAnalystObservation) -> dict[str, Any]:
118
+ return {
119
+ "task_id": obs.task_id,
120
+ "alert": obs.alert,
121
+ "phase": obs.phase,
122
+ "tool_catalog": obs.tool_catalog,
123
+ "tool_result": obs.tool_result,
124
+ "evidence_ids": obs.evidence_ids,
125
+ "candidate_findings": obs.candidate_findings,
126
+ "verified_findings": obs.verified_findings,
127
+ "step_budget_remaining": obs.step_budget_remaining,
128
+ "score_breakdown": obs.score_breakdown,
129
+ "error": obs.error,
130
+ }
131
+
132
+
133
+ def build_user_prompt(task_id: str, step: int, obs: CyberAnalystObservation) -> str:
134
+ payload = {
135
+ "task_id": task_id,
136
+ "step": step,
137
+ "observation": observation_payload(obs),
138
+ }
139
+ return textwrap.dedent(
140
+ f"""
141
+ Choose the next benchmark tool call.
142
+ Current state JSON:
143
+ {json.dumps(payload, sort_keys=True)}
144
+ """
145
+ ).strip()
146
+
147
+
148
+ def extract_json_object(text: str) -> dict[str, Any]:
149
+ content = text.strip()
150
+ if content.startswith("```"):
151
+ lines = content.splitlines()
152
+ if lines and lines[0].startswith("```"):
153
+ lines = lines[1:]
154
+ if lines and lines[-1].startswith("```"):
155
+ lines = lines[:-1]
156
+ content = "\n".join(lines).strip()
157
+
158
+ try:
159
+ decoded = json.loads(content)
160
+ except json.JSONDecodeError as exc:
161
+ raise ModelActionError(f"model_parse_error: {exc.msg}") from exc
162
+
163
+ if not isinstance(decoded, dict):
164
+ raise ModelActionError("model_parse_error: response is not a JSON object")
165
+ return decoded
166
+
167
+
168
+ def parse_model_action(text: str) -> CyberAnalystAction:
169
+ payload = extract_json_object(text)
170
+ tool_name = payload.get("tool_name")
171
+ args = payload.get("args", {})
172
+
173
+ if not isinstance(tool_name, str) or not tool_name:
174
+ raise ModelActionError("model_parse_error: missing tool_name")
175
+ if not isinstance(args, dict):
176
+ raise ModelActionError("model_parse_error: args must be an object")
177
+
178
+ return CyberAnalystAction(tool_name=tool_name, args=args)
179
+
180
+
181
+ def get_model_action(
182
+ client: OpenAI,
183
+ llm_config: LLMConfig,
184
+ task_id: str,
185
+ step: int,
186
+ obs: CyberAnalystObservation,
187
+ ) -> CyberAnalystAction:
188
+ try:
189
+ completion = client.chat.completions.create(
190
+ model=llm_config.model_name,
191
+ messages=[
192
+ {"role": "system", "content": SYSTEM_PROMPT},
193
+ {"role": "user", "content": build_user_prompt(task_id, step, obs)},
194
+ ],
195
+ temperature=llm_config.temperature,
196
+ max_tokens=llm_config.max_tokens,
197
+ stream=False,
198
+ )
199
+ except Exception as exc:
200
+ raise ModelActionError(f"model_request_error: {exc}") from exc
201
+
202
+ text = (completion.choices[0].message.content or "").strip()
203
+ if not text:
204
+ raise ModelActionError("model_parse_error: empty response")
205
+ return parse_model_action(text)
206
+
207
+
208
+ def error_action(error: Exception) -> CyberAnalystAction:
209
+ message = single_line(str(error))
210
+ if message.startswith("model_request_error"):
211
+ tool_name = "model_request_error"
212
+ elif message.startswith("model_parse_error"):
213
+ tool_name = "model_parse_error"
214
+ else:
215
+ tool_name = "model_action_error"
216
+ return CyberAnalystAction(
217
+ tool_name=tool_name,
218
+ args={"message": message[:500]},
219
+ )
220
+
221
+
222
+ def run_task(task_id: str, client: OpenAI, llm_config: LLMConfig) -> None:
223
+ log_start(task_id, llm_config)
224
+ rewards: list[float] = []
225
+ steps_taken = 0
226
+ final_score = 0.01
227
+ success = False
228
+
229
+ try:
230
+ with CyberAnalystEnv(base_url=ENV_URL).sync() as env:
231
+ reset_result = env.reset(task_id=task_id, seed=SEED)
232
+ obs = reset_result.observation
233
+
234
+ for step in range(1, MAX_STEPS + 1):
235
+ if obs.done:
236
+ break
237
+
238
+ model_failed = False
239
+ try:
240
+ action = get_model_action(client, llm_config, task_id, step, obs)
241
+ except ModelActionError as exc:
242
+ action = error_action(exc)
243
+ model_failed = True
244
+
245
+ result = env.step(action)
246
+ obs = result.observation
247
+ reward = float(result.reward or 0.0)
248
+ rewards.append(reward)
249
+ steps_taken = step
250
+
251
+ log_step(step, action, result.reward, result.done, obs.error)
252
+
253
+ if isinstance(obs.tool_result, dict) and "score" in obs.tool_result:
254
+ final_score = float(obs.tool_result["score"])
255
+
256
+ if result.done or model_failed:
257
+ success = final_score > 0.5
258
+ break
259
+
260
+ except Exception as exc:
261
+ action = CyberAnalystAction(
262
+ tool_name="runtime_error",
263
+ args={"message": single_line(str(exc))[:500]},
264
+ )
265
+ steps_taken = max(steps_taken, 1)
266
+ rewards.append(0.01)
267
+ log_step(steps_taken, action, 0.01, True, single_line(str(exc)))
268
+
269
+ log_end(task_id, success, steps_taken, final_score, rewards)
270
+
271
+
272
+ def selected_task_ids() -> list[str]:
273
+ task_override = os.getenv("TASK_NAME")
274
+ return [task_override] if task_override else TASK_IDS
275
+
276
+
277
+ def main() -> None:
278
+ llm_config = build_llm_config()
279
+ try:
280
+ client = build_openai_client()
281
+ except KeyError:
282
+ print("HF_TOKEN must be set for inference.", file=sys.stderr, flush=True)
283
+ raise SystemExit(2) from None
284
+
285
+ for task_id in selected_task_ids():
286
+ run_task(task_id, client, llm_config)
287
+
288
+
289
+ if __name__ == "__main__":
290
+ main()
openenv_Cyber_analyst.egg-info/PKG-INFO ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: openenv-Cyber_analyst
3
+ Version: 0.1.0
4
+ Summary: Cyber Analyst environment for OpenEnv
5
+ Requires-Python: >=3.10
6
+ Requires-Dist: openenv-core[core]>=0.2.2
7
+ Requires-Dist: openai>=1.0.0
8
+ Provides-Extra: dev
9
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
10
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
openenv_Cyber_analyst.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ README.md
2
+ __init__.py
3
+ client.py
4
+ inference.py
5
+ models.py
6
+ pyproject.toml
7
+ ./__init__.py
8
+ ./client.py
9
+ ./inference.py
10
+ ./models.py
11
+ openenv_Cyber_analyst.egg-info/PKG-INFO
12
+ openenv_Cyber_analyst.egg-info/SOURCES.txt
13
+ openenv_Cyber_analyst.egg-info/dependency_links.txt
14
+ openenv_Cyber_analyst.egg-info/entry_points.txt
15
+ openenv_Cyber_analyst.egg-info/requires.txt
16
+ openenv_Cyber_analyst.egg-info/top_level.txt
17
+ server/Cyber_analyst_environment.py
18
+ server/__init__.py
19
+ server/app.py
20
+ server/graders.py
21
+ server/tasks.py
22
+ tests/test_environment.py
openenv_Cyber_analyst.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
openenv_Cyber_analyst.egg-info/entry_points.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [console_scripts]
2
+ server = Cyber_analyst.server.app:main
openenv_Cyber_analyst.egg-info/requires.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ openenv-core[core]>=0.2.2
2
+ openai>=1.0.0
3
+
4
+ [dev]
5
+ pytest>=8.0.0
6
+ pytest-cov>=4.0.0
openenv_Cyber_analyst.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Cyber_analyst
sample_inference_script.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference Script Example
3
+ ===================================
4
+ MANDATORY
5
+ - Before submitting, ensure the following variables are defined in your environment configuration:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
10
+ method
11
+
12
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
13
+ (and should reflect your active inference setup):
14
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
15
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
16
+
17
+ - The inference script must be named `inference.py` and placed in the root directory of the project
18
+ - Participants must use OpenAI Client for all LLM calls using above variables
19
+
20
+ STDOUT FORMAT
21
+ - The script must emit exactly three line types to stdout, in this order:
22
+
23
+ [START] task=<task_name> env=<benchmark> model=<model_name>
24
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
25
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
26
+
27
+ Rules:
28
+ - One [START] line at episode begin.
29
+ - One [STEP] line per step, immediately after env.step() returns.
30
+ - One [END] line after env.close(), always emitted (even on exception).
31
+ - reward and rewards are formatted to 2 decimal places.
32
+ - done and success are lowercase booleans: true or false.
33
+ - error is the raw last_action_error string, or null if none.
34
+ - All fields on a single line with no newlines within a line.
35
+ - Each tasks should return score in [0, 1]
36
+
37
+ Example:
38
+ [START] task=click-test env=miniwob model=Qwen3-VL-30B
39
+ [STEP] step=1 action=click('123') reward=0.00 done=false error=null
40
+ [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
41
+ [STEP] step=3 action=click('789') reward=1.00 done=true error=null
42
+ [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
43
+ """
44
+
45
+ import asyncio
46
+ import os
47
+ import textwrap
48
+ from typing import List, Optional
49
+
50
+ from openai import OpenAI
51
+
52
+ from my_env_v4 import MyEnvV4Action, MyEnvV4Env
53
+ IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
54
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
55
+
56
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
57
+ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
58
+ TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
59
+ BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
60
+ MAX_STEPS = 8
61
+ TEMPERATURE = 0.7
62
+ MAX_TOKENS = 150
63
+ SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
64
+
65
+ # Max possible reward: each token contributes 0.1, across all steps
66
+ _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
67
+ MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
68
+
69
+ SYSTEM_PROMPT = textwrap.dedent(
70
+ """
71
+ You are interacting with a simple echo environment.
72
+ Each turn you must send a message. The environment will echo it back.
73
+ Reward is proportional to message length: reward = len(message) * 0.1
74
+ Your goal is to maximize total reward by sending meaningful, substantive messages.
75
+ Reply with exactly one message string — no quotes, no prefixes, just the message text.
76
+ """
77
+ ).strip()
78
+
79
+
80
+ def log_start(task: str, env: str, model: str) -> None:
81
+ print(f"[START] task={task} env={env} model={model}", flush=True)
82
+
83
+
84
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
85
+ error_val = error if error else "null"
86
+ done_val = str(done).lower()
87
+ print(
88
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
89
+ flush=True,
90
+ )
91
+
92
+
93
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
94
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
95
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
96
+
97
+
98
+ def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
99
+ history_block = "\n".join(history[-4:]) if history else "None"
100
+ return textwrap.dedent(
101
+ f"""
102
+ Step: {step}
103
+ Last echoed message: {last_echoed!r}
104
+ Last reward: {last_reward:.2f}
105
+ Previous steps:
106
+ {history_block}
107
+ Send your next message.
108
+ """
109
+ ).strip()
110
+
111
+
112
+ def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
113
+ user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
114
+ try:
115
+ completion = client.chat.completions.create(
116
+ model=MODEL_NAME,
117
+ messages=[
118
+ {"role": "system", "content": SYSTEM_PROMPT},
119
+ {"role": "user", "content": user_prompt},
120
+ ],
121
+ temperature=TEMPERATURE,
122
+ max_tokens=MAX_TOKENS,
123
+ stream=False,
124
+ )
125
+ text = (completion.choices[0].message.content or "").strip()
126
+ return text if text else "hello"
127
+ except Exception as exc:
128
+ print(f"[DEBUG] Model request failed: {exc}", flush=True)
129
+ return "hello"
130
+
131
+
132
+ async def main() -> None:
133
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
134
+
135
+ env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
136
+
137
+ history: List[str] = []
138
+ rewards: List[float] = []
139
+ steps_taken = 0
140
+ score = 0.0
141
+ success = False
142
+
143
+ log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
144
+
145
+ try:
146
+ result = await env.reset() # OpenENV.reset()
147
+ last_echoed = result.observation.echoed_message
148
+ last_reward = 0.0
149
+
150
+ for step in range(1, MAX_STEPS + 1):
151
+ if result.done:
152
+ break
153
+
154
+ message = get_model_message(client, step, last_echoed, last_reward, history)
155
+
156
+ result = await env.step(MyEnvV4Action(message=message))
157
+ obs = result.observation
158
+
159
+ reward = result.reward or 0.0
160
+ done = result.done
161
+ error = None
162
+
163
+ rewards.append(reward)
164
+ steps_taken = step
165
+ last_echoed = obs.echoed_message
166
+ last_reward = reward
167
+
168
+ log_step(step=step, action=message, reward=reward, done=done, error=error)
169
+
170
+ history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
171
+
172
+ if done:
173
+ break
174
+
175
+ score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
176
+ score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
177
+ success = score >= SUCCESS_SCORE_THRESHOLD
178
+
179
+ finally:
180
+ try:
181
+ await env.close()
182
+ except Exception as e:
183
+ print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
184
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
185
+
186
+
187
+ if __name__ == "__main__":
188
+ asyncio.run(main())
scripts/validate-submission.sh ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # validate-submission.sh — OpenEnv Submission Validator
4
+ #
5
+ # Checks that your HF Space is live, Docker image builds, and openenv validate passes.
6
+ #
7
+ # Prerequisites:
8
+ # - Docker: https://docs.docker.com/get-docker/
9
+ # - openenv-core: pip install openenv-core
10
+ # - curl (usually pre-installed)
11
+ #
12
+ # Run:
13
+ # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
14
+ #
15
+ # Or download and run locally:
16
+ # chmod +x validate-submission.sh
17
+ # ./validate-submission.sh <ping_url> [repo_dir]
18
+ #
19
+ # Arguments:
20
+ # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
21
+ # repo_dir Path to your repo (default: current directory)
22
+ #
23
+ # Examples:
24
+ # ./validate-submission.sh https://my-team.hf.space
25
+ # ./validate-submission.sh https://my-team.hf.space ./my-repo
26
+ #
27
+
28
+ set -uo pipefail
29
+
30
+ DOCKER_BUILD_TIMEOUT=600
31
+ if [ -t 1 ]; then
32
+ RED='\033[0;31m'
33
+ GREEN='\033[0;32m'
34
+ YELLOW='\033[1;33m'
35
+ BOLD='\033[1m'
36
+ NC='\033[0m'
37
+ else
38
+ RED='' GREEN='' YELLOW='' BOLD='' NC=''
39
+ fi
40
+
41
+ run_with_timeout() {
42
+ local secs="$1"; shift
43
+ if command -v timeout &>/dev/null; then
44
+ timeout "$secs" "$@"
45
+ elif command -v gtimeout &>/dev/null; then
46
+ gtimeout "$secs" "$@"
47
+ else
48
+ "$@" &
49
+ local pid=$!
50
+ ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
51
+ local watcher=$!
52
+ wait "$pid" 2>/dev/null
53
+ local rc=$?
54
+ kill "$watcher" 2>/dev/null
55
+ wait "$watcher" 2>/dev/null
56
+ return $rc
57
+ fi
58
+ }
59
+
60
+ portable_mktemp() {
61
+ local prefix="${1:-validate}"
62
+ mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
63
+ }
64
+
65
+ CLEANUP_FILES=()
66
+ cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
67
+ trap cleanup EXIT
68
+
69
+ PING_URL="${1:-}"
70
+ REPO_DIR="${2:-.}"
71
+
72
+ if [ -z "$PING_URL" ]; then
73
+ printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
74
+ printf "\n"
75
+ printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
76
+ printf " repo_dir Path to your repo (default: current directory)\n"
77
+ exit 1
78
+ fi
79
+
80
+ if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
81
+ printf "Error: directory '%s' not found\n" "${2:-.}"
82
+ exit 1
83
+ fi
84
+ PING_URL="${PING_URL%/}"
85
+ export PING_URL
86
+ PASS=0
87
+
88
+ log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
89
+ pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
90
+ fail() { log "${RED}FAILED${NC} -- $1"; }
91
+ hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
92
+ stop_at() {
93
+ printf "\n"
94
+ printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
95
+ exit 1
96
+ }
97
+
98
+ printf "\n"
99
+ printf "${BOLD}========================================${NC}\n"
100
+ printf "${BOLD} OpenEnv Submission Validator${NC}\n"
101
+ printf "${BOLD}========================================${NC}\n"
102
+ log "Repo: $REPO_DIR"
103
+ log "Ping URL: $PING_URL"
104
+ printf "\n"
105
+
106
+ log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
107
+
108
+ CURL_OUTPUT=$(portable_mktemp "validate-curl")
109
+ CLEANUP_FILES+=("$CURL_OUTPUT")
110
+ HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
111
+ -H "Content-Type: application/json" -d '{}' \
112
+ "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
113
+
114
+ if [ "$HTTP_CODE" = "200" ]; then
115
+ pass "HF Space is live and responds to /reset"
116
+ elif [ "$HTTP_CODE" = "000" ]; then
117
+ fail "HF Space not reachable (connection failed or timed out)"
118
+ hint "Check your network connection and that the Space is running."
119
+ hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
120
+ stop_at "Step 1"
121
+ else
122
+ fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
123
+ hint "Make sure your Space is running and the URL is correct."
124
+ hint "Try opening $PING_URL in your browser first."
125
+ stop_at "Step 1"
126
+ fi
127
+
128
+ log "${BOLD}Step 2/3: Running docker build${NC} ..."
129
+
130
+ if ! command -v docker &>/dev/null; then
131
+ fail "docker command not found"
132
+ hint "Install Docker: https://docs.docker.com/get-docker/"
133
+ stop_at "Step 2"
134
+ fi
135
+
136
+ if [ -f "$REPO_DIR/Dockerfile" ]; then
137
+ DOCKER_CONTEXT="$REPO_DIR"
138
+ DOCKERFILE="$REPO_DIR/Dockerfile"
139
+ elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
140
+ DOCKER_CONTEXT="$REPO_DIR"
141
+ DOCKERFILE="$REPO_DIR/server/Dockerfile"
142
+ else
143
+ fail "No Dockerfile found in repo root or server/ directory"
144
+ stop_at "Step 2"
145
+ fi
146
+
147
+ log " Found Dockerfile at $DOCKERFILE"
148
+ log " Docker context: $DOCKER_CONTEXT"
149
+
150
+ BUILD_OK=false
151
+ BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build -f "$DOCKERFILE" "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
152
+
153
+ if [ "$BUILD_OK" = true ]; then
154
+ pass "Docker build succeeded"
155
+ else
156
+ fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
157
+ printf "%s\n" "$BUILD_OUTPUT" | tail -20
158
+ stop_at "Step 2"
159
+ fi
160
+
161
+ log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
162
+
163
+ if ! command -v openenv &>/dev/null; then
164
+ fail "openenv command not found"
165
+ hint "Install it: pip install openenv-core"
166
+ stop_at "Step 3"
167
+ fi
168
+
169
+ VALIDATE_OK=false
170
+ VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
171
+
172
+ if [ "$VALIDATE_OK" = true ]; then
173
+ pass "openenv validate passed"
174
+ [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
175
+ else
176
+ fail "openenv validate failed"
177
+ printf "%s\n" "$VALIDATE_OUTPUT"
178
+ stop_at "Step 3"
179
+ fi
180
+
181
+ printf "\n"
182
+ printf "${BOLD}========================================${NC}\n"
183
+ printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
184
+ printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
185
+ printf "${BOLD}========================================${NC}\n"
186
+ printf "\n"
187
+
188
+ exit 0
server/Cyber_analyst_environment.py ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """SecOps Evidence Gym environment implementation."""
8
+
9
+ from __future__ import annotations
10
+
11
+ import hashlib
12
+ import json
13
+ from collections import Counter
14
+ from typing import Any
15
+ from uuid import uuid4
16
+
17
+ from openenv.core.env_server.interfaces import Environment
18
+
19
+ try:
20
+ from ..models import (
21
+ CyberAnalystAction,
22
+ CyberAnalystObservation,
23
+ CyberAnalystState,
24
+ )
25
+ from .graders import safe_reward, score_report
26
+ from .tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
27
+ except ImportError: # pragma: no cover - supports direct module execution
28
+ from models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
29
+ from server.graders import safe_reward, score_report
30
+ from server.tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
31
+
32
+
33
+ class CyberAnalystEnvironment(
34
+ Environment[CyberAnalystAction, CyberAnalystObservation, CyberAnalystState]
35
+ ):
36
+ """A safe, deterministic evidence-grounded cyber analyst benchmark."""
37
+
38
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True
39
+ MAX_STEPS = 12
40
+ REPEAT_HARD_STOP = 6
41
+
42
+ def __init__(self):
43
+ super().__init__()
44
+ self._scenario: dict[str, Any] = {}
45
+ self._state = CyberAnalystState()
46
+ self._discovered_evidence: set[str] = set()
47
+ self._candidate_findings: dict[str, dict[str, Any]] = {}
48
+ self._verified_findings: list[dict[str, Any]] = []
49
+ self._validated_finding_ids: set[str] = set()
50
+ self._action_counts: Counter[str] = Counter()
51
+ self._last_score_breakdown: dict[str, Any] = {}
52
+ self._trajectory_events: list[dict[str, Any]] = []
53
+ self._initialize_episode(DEFAULT_TASK_ID, seed=None, episode_id=None)
54
+
55
+ def reset(
56
+ self,
57
+ seed: int | None = None,
58
+ episode_id: str | None = None,
59
+ task_id: str = DEFAULT_TASK_ID,
60
+ **_: Any,
61
+ ) -> CyberAnalystObservation:
62
+ """Reset the selected deterministic task."""
63
+
64
+ self._initialize_episode(task_id=task_id, seed=seed, episode_id=episode_id)
65
+ tool_result = {
66
+ "message": "Cyber Analyst environment ready.",
67
+ "allowed_scope": "Synthetic artifacts only. No live targets or shell.",
68
+ }
69
+ obs = self._observation(
70
+ tool_result={
71
+ **tool_result,
72
+ "trajectory_jsonl": self.export_trajectory_jsonl(),
73
+ },
74
+ reward=0.01,
75
+ )
76
+ self._record_trajectory("reset", None, tool_result, obs.reward, obs.done, obs.error)
77
+ return obs
78
+
79
+ def step( # type: ignore[override]
80
+ self,
81
+ action: CyberAnalystAction,
82
+ timeout_s: float | None = None,
83
+ **_: Any,
84
+ ) -> CyberAnalystObservation:
85
+ """Execute one bounded simulator tool call."""
86
+
87
+ del timeout_s
88
+
89
+ if self._state.done:
90
+ tool_result = {"message": "Episode is already complete."}
91
+ obs = self._observation(
92
+ tool_result=tool_result,
93
+ reward=0.01,
94
+ done=True,
95
+ error="episode_already_done",
96
+ )
97
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
98
+ return obs
99
+
100
+ self._state.step_count += 1
101
+ self._state.step_budget_remaining = max(
102
+ 0, self.MAX_STEPS - self._state.step_count
103
+ )
104
+
105
+ signature = self._action_signature(action)
106
+ self._action_counts[signature] += 1
107
+ repeat_count = self._action_counts[signature]
108
+
109
+ if repeat_count >= self.REPEAT_HARD_STOP:
110
+ self._state.phase = "done"
111
+ self._state.done = True
112
+ self._last_score_breakdown = {
113
+ "score": 0.03,
114
+ "repeat_hard_stop": True,
115
+ "signature": signature,
116
+ }
117
+ tool_result = {"message": "Episode stopped after repeated identical actions."}
118
+ obs = self._observation(
119
+ tool_result=tool_result,
120
+ reward=0.03,
121
+ done=True,
122
+ error="repeat_hard_stop",
123
+ )
124
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
125
+ return obs
126
+
127
+ handler = getattr(self, f"_tool_{action.tool_name}", None)
128
+ if handler is None:
129
+ tool_result = {
130
+ "ok": False,
131
+ "message": f"Unsupported tool: {action.tool_name}",
132
+ "available_tools": [tool["name"] for tool in TOOL_CATALOG],
133
+ }
134
+ obs = self._step_observation(
135
+ tool_result=tool_result,
136
+ repeat_count=repeat_count,
137
+ error="unsupported_tool",
138
+ )
139
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
140
+ return obs
141
+
142
+ try:
143
+ result, reward_delta, done = handler(action.args)
144
+ error = ""
145
+ except Exception as exc: # pragma: no cover - defensive rollout guard
146
+ result = {"ok": False, "message": str(exc)}
147
+ reward_delta = -0.05
148
+ done = False
149
+ error = exc.__class__.__name__
150
+
151
+ if self._state.step_budget_remaining <= 0 and not done:
152
+ done = True
153
+ self._state.phase = "done"
154
+ self._state.done = True
155
+ result = {
156
+ **result,
157
+ "timeout": True,
158
+ "message": "Step budget exhausted before report submission.",
159
+ }
160
+ reward_delta -= 0.10
161
+
162
+ obs = self._step_observation(
163
+ tool_result=result,
164
+ repeat_count=repeat_count,
165
+ reward_delta=reward_delta,
166
+ done=done,
167
+ error=error,
168
+ )
169
+ self._record_trajectory("step", action, result, obs.reward, obs.done, obs.error)
170
+ return obs
171
+
172
+ @property
173
+ def state(self) -> CyberAnalystState:
174
+ """Return the current episode state summary."""
175
+
176
+ return self._state
177
+
178
+ def _initialize_episode(
179
+ self, task_id: str, seed: int | None, episode_id: str | None
180
+ ) -> None:
181
+ self._scenario = build_scenario(task_id, seed)
182
+ self._discovered_evidence = set()
183
+ self._candidate_findings = {}
184
+ self._verified_findings = []
185
+ self._validated_finding_ids = set()
186
+ self._action_counts = Counter()
187
+ self._last_score_breakdown = {}
188
+ self._trajectory_events = []
189
+ self._state = CyberAnalystState(
190
+ episode_id=episode_id or str(uuid4()),
191
+ step_count=0,
192
+ task_id=self._scenario["task_id"],
193
+ seed=seed,
194
+ phase="investigate",
195
+ step_budget_remaining=self.MAX_STEPS,
196
+ recent_evidence_ids=[],
197
+ verified_finding_ids=[],
198
+ done=False,
199
+ )
200
+
201
+ def export_trajectory_jsonl(self) -> str:
202
+ """Return the current episode trajectory as JSONL for offline analysis."""
203
+
204
+ return "\n".join(
205
+ json.dumps(event, sort_keys=True, default=str)
206
+ for event in self._trajectory_events
207
+ )
208
+
209
+ def _record_trajectory(
210
+ self,
211
+ event_type: str,
212
+ action: CyberAnalystAction | None,
213
+ tool_result: dict[str, Any],
214
+ reward: float | int | None,
215
+ done: bool,
216
+ error: str,
217
+ ) -> None:
218
+ action_payload = None
219
+ if action is not None:
220
+ action_payload = action.model_dump(exclude_none=True)
221
+ self._trajectory_events.append(
222
+ {
223
+ "episode_id": self._state.episode_id,
224
+ "task_id": self._state.task_id,
225
+ "seed": self._state.seed,
226
+ "event_type": event_type,
227
+ "step": self._state.step_count,
228
+ "phase": self._state.phase,
229
+ "action": action_payload,
230
+ "tool_result": tool_result,
231
+ "evidence_ids": sorted(self._discovered_evidence),
232
+ "verified_finding_ids": list(self._state.verified_finding_ids),
233
+ "reward": reward,
234
+ "done": done,
235
+ "error": error,
236
+ }
237
+ )
238
+
239
+ def _observation(
240
+ self,
241
+ tool_result: dict[str, Any] | None = None,
242
+ reward: float = 0.01,
243
+ done: bool | None = None,
244
+ error: str = "",
245
+ ) -> CyberAnalystObservation:
246
+ done_value = self._state.done if done is None else done
247
+ return CyberAnalystObservation(
248
+ task_id=self._scenario.get("task_id", ""),
249
+ alert=self._scenario.get("alert", ""),
250
+ phase=self._state.phase,
251
+ tool_catalog=TOOL_CATALOG,
252
+ tool_result=tool_result or {},
253
+ evidence_ids=sorted(self._discovered_evidence),
254
+ verified_findings=list(self._verified_findings),
255
+ candidate_findings=list(self._candidate_findings.values()),
256
+ step_budget_remaining=self._state.step_budget_remaining,
257
+ score_breakdown=dict(self._last_score_breakdown),
258
+ error=error,
259
+ done=done_value,
260
+ reward=safe_reward(reward),
261
+ )
262
+
263
+ def _step_observation(
264
+ self,
265
+ tool_result: dict[str, Any],
266
+ repeat_count: int,
267
+ reward_delta: float = 0.0,
268
+ done: bool = False,
269
+ error: str = "",
270
+ ) -> CyberAnalystObservation:
271
+ reward = 0.04 + reward_delta - 0.01
272
+ if repeat_count > 2:
273
+ reward -= 0.03 * (repeat_count - 2)
274
+
275
+ if done:
276
+ self._state.phase = "done"
277
+ self._state.done = True
278
+
279
+ self._state.recent_evidence_ids = sorted(self._discovered_evidence)[-5:]
280
+ self._state.verified_finding_ids = [
281
+ finding["finding_id"] for finding in self._verified_findings
282
+ ]
283
+
284
+ return self._observation(
285
+ tool_result=tool_result,
286
+ reward=safe_reward(reward),
287
+ done=self._state.done,
288
+ error=error,
289
+ )
290
+
291
+ def _action_signature(self, action: CyberAnalystAction) -> str:
292
+ payload = {
293
+ "tool_name": action.tool_name,
294
+ "args": action.args,
295
+ }
296
+ encoded = json.dumps(payload, sort_keys=True, default=str)
297
+ return hashlib.sha256(encoded.encode("utf-8")).hexdigest()[:16]
298
+
299
+ def _record_evidence(self, evidence_ids: list[str]) -> int:
300
+ relevant = set(self._scenario.get("required_evidence", [])) | set(
301
+ self._scenario.get("supporting_evidence", [])
302
+ )
303
+ new_relevant = 0
304
+ for evidence_id in evidence_ids:
305
+ if evidence_id not in self._discovered_evidence and evidence_id in relevant:
306
+ new_relevant += 1
307
+ self._discovered_evidence.add(evidence_id)
308
+ return new_relevant
309
+
310
+ def _filter_entries(
311
+ self, entries: list[dict[str, Any]], service_id: str = "", query: str = ""
312
+ ) -> list[dict[str, Any]]:
313
+ normalized_service = self._resolve_service_id(service_id).lower()
314
+ normalized_query = query.strip().lower()
315
+ matches: list[dict[str, Any]] = []
316
+ for entry in entries:
317
+ service_matches = (
318
+ not normalized_service
319
+ or str(entry.get("service_id", "")).lower() == normalized_service
320
+ )
321
+ search_blob = " ".join(
322
+ [
323
+ str(entry.get("text", "")),
324
+ str(entry.get("source", "")),
325
+ " ".join(str(tag) for tag in entry.get("tags", [])),
326
+ ]
327
+ ).lower()
328
+ query_matches = not normalized_query or normalized_query in search_blob
329
+ if service_matches and query_matches:
330
+ matches.append(entry)
331
+ return matches
332
+
333
+ def _resolve_service_id(self, service_id: str) -> str:
334
+ normalized = service_id.strip()
335
+ aliases = self._scenario.get("service_aliases", {})
336
+ return str(aliases.get(normalized, normalized))
337
+
338
+ def _evidence_payload(self, entries: list[dict[str, Any]]) -> dict[str, Any]:
339
+ evidence_ids = [entry["evidence_id"] for entry in entries]
340
+ new_relevant = self._record_evidence(evidence_ids)
341
+ return {
342
+ "ok": True,
343
+ "evidence_ids": evidence_ids,
344
+ "new_relevant_evidence": new_relevant,
345
+ "entries": [
346
+ {
347
+ "evidence_id": entry["evidence_id"],
348
+ "service_id": entry.get("service_id", ""),
349
+ "source": entry.get("source", ""),
350
+ "text": entry.get("text", ""),
351
+ }
352
+ for entry in entries
353
+ ],
354
+ }
355
+
356
+ def _tool_list_assets(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
357
+ del args
358
+ return {"ok": True, "assets": self._scenario["assets"]}, 0.0, False
359
+
360
+ def _tool_get_log_events(
361
+ self, args: dict[str, Any]
362
+ ) -> tuple[dict[str, Any], float, bool]:
363
+ entries = self._filter_entries(
364
+ self._scenario.get("logs", []),
365
+ service_id=str(args.get("service_id", "")),
366
+ query=str(args.get("query", "")),
367
+ )
368
+ payload = self._evidence_payload(entries)
369
+ return payload, 0.02 * payload["new_relevant_evidence"], False
370
+
371
+ def _tool_check_security_headers(
372
+ self, args: dict[str, Any]
373
+ ) -> tuple[dict[str, Any], float, bool]:
374
+ requested_service = self._resolve_service_id(str(args.get("service_id", ""))).lower()
375
+ snapshots = self._scenario.get("headers", {})
376
+ results = []
377
+ evidence_ids = []
378
+ for service_id, snapshot in snapshots.items():
379
+ if requested_service and service_id.lower() != requested_service:
380
+ continue
381
+ evidence_ids.append(snapshot["evidence_id"])
382
+ results.append(
383
+ {
384
+ "service_id": service_id,
385
+ "evidence_id": snapshot["evidence_id"],
386
+ "present": snapshot.get("present", []),
387
+ "missing": snapshot.get("missing", []),
388
+ "passed": not snapshot.get("missing"),
389
+ }
390
+ )
391
+ new_relevant = self._record_evidence(evidence_ids)
392
+ return (
393
+ {
394
+ "ok": True,
395
+ "evidence_ids": evidence_ids,
396
+ "new_relevant_evidence": new_relevant,
397
+ "header_results": results,
398
+ },
399
+ 0.02 * new_relevant,
400
+ False,
401
+ )
402
+
403
+ def _tool_search_repo(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
404
+ entries = self._filter_entries(
405
+ self._scenario.get("repo", []), query=str(args.get("query", ""))
406
+ )
407
+ payload = self._evidence_payload(entries)
408
+ return payload, 0.02 * payload["new_relevant_evidence"], False
409
+
410
+ def _tool_scan_dependencies(
411
+ self, args: dict[str, Any]
412
+ ) -> tuple[dict[str, Any], float, bool]:
413
+ del args
414
+ payload = self._evidence_payload(self._scenario.get("dependencies", []))
415
+ return payload, 0.02 * payload["new_relevant_evidence"], False
416
+
417
+ def _tool_create_finding(
418
+ self, args: dict[str, Any]
419
+ ) -> tuple[dict[str, Any], float, bool]:
420
+ evidence_ids = args.get("evidence_ids", [])
421
+ if isinstance(evidence_ids, str):
422
+ evidence_ids = [evidence_ids]
423
+ evidence_ids = [str(evidence_id) for evidence_id in evidence_ids]
424
+
425
+ finding_id = f"FND-{len(self._candidate_findings) + 1:03d}"
426
+ finding = {
427
+ "finding_id": finding_id,
428
+ "finding_type": str(args.get("finding_type", "")),
429
+ "evidence_ids": evidence_ids,
430
+ "severity_guess": str(args.get("severity_guess", "")),
431
+ "remediation": str(args.get("remediation", "")),
432
+ "validated": False,
433
+ "matching_gt_id": None,
434
+ }
435
+ self._candidate_findings[finding_id] = finding
436
+
437
+ well_formed = bool(
438
+ finding["finding_type"] and evidence_ids and finding["remediation"]
439
+ )
440
+ return (
441
+ {"ok": True, "finding_id": finding_id, "finding": finding},
442
+ 0.03 if well_formed else 0.0,
443
+ False,
444
+ )
445
+
446
+ def _tool_validate_finding(
447
+ self, args: dict[str, Any]
448
+ ) -> tuple[dict[str, Any], float, bool]:
449
+ finding_id = str(args.get("finding_id", ""))
450
+ finding = self._candidate_findings.get(finding_id)
451
+ if finding is None:
452
+ return (
453
+ {"ok": False, "message": f"Unknown finding_id: {finding_id}"},
454
+ -0.03,
455
+ False,
456
+ )
457
+
458
+ expected_type = self._scenario["finding_type"]
459
+ required_evidence = set(self._scenario.get("required_evidence", []))
460
+ supplied_evidence = set(finding.get("evidence_ids", []))
461
+ verified = (
462
+ finding.get("finding_type") == expected_type
463
+ and bool(required_evidence & supplied_evidence)
464
+ )
465
+ self._validated_finding_ids.add(finding_id)
466
+ finding["validated"] = verified
467
+ finding["matching_gt_id"] = self._scenario["ground_truth_id"] if verified else None
468
+
469
+ if verified and not any(
470
+ item["finding_id"] == finding_id for item in self._verified_findings
471
+ ):
472
+ self._verified_findings.append(dict(finding))
473
+
474
+ return (
475
+ {
476
+ "ok": True,
477
+ "finding_id": finding_id,
478
+ "verified": verified,
479
+ "matching_gt_id": finding["matching_gt_id"],
480
+ },
481
+ 0.08 if verified else -0.02,
482
+ False,
483
+ )
484
+
485
+ def _tool_submit_report(
486
+ self, args: dict[str, Any]
487
+ ) -> tuple[dict[str, Any], float, bool]:
488
+ report = args.get("report_json", {})
489
+ score, breakdown = score_report(
490
+ self._scenario["task_id"],
491
+ report,
492
+ verified_findings=self._verified_findings,
493
+ validation_attempted=bool(self._validated_finding_ids),
494
+ )
495
+ self._last_score_breakdown = breakdown
496
+ return (
497
+ {
498
+ "ok": True,
499
+ "submitted": True,
500
+ "score": score,
501
+ "score_breakdown": breakdown,
502
+ "trajectory_jsonl": self.export_trajectory_jsonl(),
503
+ },
504
+ score,
505
+ True,
506
+ )
server/graders.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic graders for the Cyber Analyst OpenEnv tasks."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from typing import Any
7
+
8
+ try:
9
+ from .tasks import SCENARIOS
10
+ except ImportError: # pragma: no cover - supports direct module execution
11
+ from tasks import SCENARIOS
12
+
13
+
14
+ MIN_SCORE = 0.01
15
+ MAX_SCORE = 0.99
16
+
17
+
18
+ def safe_reward(score: float | int | None) -> float:
19
+ """Clamp validator-facing scores to the strict open interval (0, 1)."""
20
+
21
+ try:
22
+ value = float(score if score is not None else 0.0)
23
+ except (TypeError, ValueError):
24
+ value = 0.0
25
+ return max(MIN_SCORE, min(MAX_SCORE, value))
26
+
27
+
28
+ def _coerce_report(report: Any) -> dict[str, Any]:
29
+ if isinstance(report, dict):
30
+ return report
31
+ if isinstance(report, str):
32
+ try:
33
+ decoded = json.loads(report)
34
+ except json.JSONDecodeError:
35
+ return {"summary": report, "findings": []}
36
+ return decoded if isinstance(decoded, dict) else {"findings": []}
37
+ return {"findings": []}
38
+
39
+
40
+ def _text_contains_any(text: str, keywords: list[str]) -> bool:
41
+ lowered = text.lower()
42
+ return any(keyword.lower() in lowered for keyword in keywords)
43
+
44
+
45
+ def _report_findings(report: dict[str, Any]) -> list[dict[str, Any]]:
46
+ findings = report.get("findings", [])
47
+ if isinstance(findings, dict):
48
+ findings = [findings]
49
+ return [finding for finding in findings if isinstance(finding, dict)]
50
+
51
+
52
+ def score_report(
53
+ task_id: str,
54
+ report: Any,
55
+ verified_findings: list[dict[str, Any]] | None = None,
56
+ validation_attempted: bool = False,
57
+ ) -> tuple[float, dict[str, Any]]:
58
+ """Score a submitted report against one task's deterministic ground truth."""
59
+
60
+ scenario = SCENARIOS.get(task_id)
61
+ report_dict = _coerce_report(report)
62
+ report_findings = _report_findings(report_dict)
63
+ verified_findings = verified_findings or []
64
+
65
+ if scenario is None:
66
+ return MIN_SCORE, {"unknown_task": task_id}
67
+
68
+ expected_type = scenario["finding_type"]
69
+ expected_evidence = set(scenario.get("required_evidence", [])) | set(
70
+ scenario.get("supporting_evidence", [])
71
+ )
72
+
73
+ matching_verified = [
74
+ finding
75
+ for finding in verified_findings
76
+ if finding.get("finding_type") == expected_type
77
+ ]
78
+ matching_report = [
79
+ finding for finding in report_findings if finding.get("finding_type") == expected_type
80
+ ]
81
+
82
+ score = 0.05
83
+ breakdown: dict[str, Any] = {
84
+ "base": 0.05,
85
+ "verified_correct": 0.0,
86
+ "valid_evidence": 0.0,
87
+ "actionable_remediation": 0.0,
88
+ "hallucination_penalty": 0.0,
89
+ "validation_penalty": 0.0,
90
+ }
91
+
92
+ if matching_verified and matching_report:
93
+ impact_text = " ".join(
94
+ str(finding.get("impact", "")) + " " + str(finding.get("description", ""))
95
+ for finding in matching_report
96
+ )
97
+ if _text_contains_any(impact_text, scenario.get("impact_keywords", [])):
98
+ score += 0.60
99
+ breakdown["verified_correct"] = 0.60
100
+
101
+ report_evidence: set[str] = set()
102
+ for finding in matching_report:
103
+ evidence_ids = finding.get("evidence_ids", [])
104
+ if isinstance(evidence_ids, str):
105
+ evidence_ids = [evidence_ids]
106
+ report_evidence.update(str(evidence_id) for evidence_id in evidence_ids)
107
+
108
+ if report_evidence & expected_evidence:
109
+ score += 0.15
110
+ breakdown["valid_evidence"] = 0.15
111
+
112
+ remediation_text = " ".join(
113
+ str(finding.get("remediation", "")) for finding in matching_report
114
+ )
115
+ if _text_contains_any(remediation_text, scenario.get("remediation_keywords", [])):
116
+ score += 0.15
117
+ breakdown["actionable_remediation"] = 0.15
118
+
119
+ verified_types = {finding.get("finding_type") for finding in verified_findings}
120
+ hallucinated = [
121
+ finding
122
+ for finding in report_findings
123
+ if finding.get("finding_type") not in verified_types
124
+ ]
125
+ if hallucinated:
126
+ penalty = 0.40 * len(hallucinated)
127
+ score -= penalty
128
+ breakdown["hallucination_penalty"] = -penalty
129
+
130
+ if not validation_attempted:
131
+ score -= 0.20
132
+ breakdown["validation_penalty"] = -0.20
133
+
134
+ final_score = safe_reward(score)
135
+ breakdown["raw_score"] = round(score, 4)
136
+ breakdown["score"] = final_score
137
+ return final_score, breakdown
138
+
139
+
140
+ def _payload_from_args(*args: Any, **kwargs: Any) -> dict[str, Any]:
141
+ if args and isinstance(args[0], dict):
142
+ payload = dict(args[0])
143
+ else:
144
+ payload = {}
145
+ payload.update(kwargs)
146
+ return payload
147
+
148
+
149
+ def grade_task(task_id: str, *args: Any, **kwargs: Any) -> float:
150
+ """Manifest-friendly grader adapter."""
151
+
152
+ payload = _payload_from_args(*args, **kwargs)
153
+ report = payload.get("report") or payload.get("report_json") or payload
154
+ verified_findings = payload.get("verified_findings", [])
155
+ validation_attempted = bool(payload.get("validation_attempted", False))
156
+ score, _ = score_report(task_id, report, verified_findings, validation_attempted)
157
+ return score
158
+
159
+
160
+ def grade_secret_exposure_easy(*args: Any, **kwargs: Any) -> float:
161
+ return grade_task("secret_exposure_easy", *args, **kwargs)
162
+
163
+
164
+ def grade_missing_security_headers_medium(*args: Any, **kwargs: Any) -> float:
165
+ return grade_task("missing_security_headers_medium", *args, **kwargs)
166
+
167
+
168
+ def grade_authz_boundary_hard(*args: Any, **kwargs: Any) -> float:
169
+ return grade_task("authz_boundary_hard", *args, **kwargs)
server/tasks.py ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic scenario registry for SecOps Evidence Gym."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from copy import deepcopy
6
+ from random import Random
7
+ from typing import Any
8
+
9
+
10
+ DEFAULT_TASK_ID = "secret_exposure_easy"
11
+
12
+ TOOL_CATALOG: list[dict[str, Any]] = [
13
+ {
14
+ "name": "list_assets",
15
+ "description": "List synthetic services, routes, and artifact collections.",
16
+ "args": {},
17
+ },
18
+ {
19
+ "name": "get_log_events",
20
+ "description": "Return sanitized telemetry evidence ids for a service/query.",
21
+ "args": {"service_id": "str", "query": "str"},
22
+ },
23
+ {
24
+ "name": "check_security_headers",
25
+ "description": "Inspect a service header snapshot and return pass/fail evidence.",
26
+ "args": {"service_id": "str"},
27
+ },
28
+ {
29
+ "name": "search_repo",
30
+ "description": "Search synthetic repo/config snippets for evidence ids.",
31
+ "args": {"query": "str"},
32
+ },
33
+ {
34
+ "name": "scan_dependencies",
35
+ "description": "Inspect a synthetic dependency manifest excerpt.",
36
+ "args": {},
37
+ },
38
+ {
39
+ "name": "create_finding",
40
+ "description": "Store a candidate finding for verifier review.",
41
+ "args": {
42
+ "finding_type": "str",
43
+ "evidence_ids": "list[str]",
44
+ "severity_guess": "str",
45
+ "remediation": "str",
46
+ },
47
+ },
48
+ {
49
+ "name": "validate_finding",
50
+ "description": "Run the deterministic verifier for a candidate finding.",
51
+ "args": {"finding_id": "str"},
52
+ },
53
+ {
54
+ "name": "submit_report",
55
+ "description": "Submit the final structured report and end the episode.",
56
+ "args": {"report_json": "dict"},
57
+ },
58
+ ]
59
+
60
+
61
+ SCENARIOS: dict[str, dict[str, Any]] = {
62
+ "secret_exposure_easy": {
63
+ "task_id": "secret_exposure_easy",
64
+ "difficulty": "easy",
65
+ "title": "Secret exposure in repo snapshot",
66
+ "alert": (
67
+ "Repository hygiene alert: a synthetic config snapshot may contain "
68
+ "credential-like material. Investigate, cite evidence, validate, and "
69
+ "submit a remediation report."
70
+ ),
71
+ "ground_truth_id": "GT-SECRET-001",
72
+ "finding_type": "secret_exposure",
73
+ "severity": "high",
74
+ "required_evidence": ["EVID-101"],
75
+ "remediation_keywords": ["rotate", "remove"],
76
+ "impact_keywords": ["secret", "credential", "api key", "leak"],
77
+ "assets": {
78
+ "services": ["gateway", "profile-service", "admin-service"],
79
+ "routes": ["/login", "/profile", "/admin/export"],
80
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
81
+ },
82
+ "repo": [
83
+ {
84
+ "evidence_id": "EVID-101",
85
+ "source": "repo_snapshot/config/profile-service.env",
86
+ "service_id": "profile-service",
87
+ "tags": ["api key", "secret", "credential", "config"],
88
+ "text": (
89
+ "PROFILE_EXPORT_API_KEY=CYBERLAB_DO_NOT_USE_API_KEY_7F3A "
90
+ "# synthetic lab credential"
91
+ ),
92
+ },
93
+ {
94
+ "evidence_id": "EVID-102",
95
+ "source": "repo_snapshot/config/gateway.env",
96
+ "service_id": "gateway",
97
+ "tags": ["config", "benign"],
98
+ "text": "LOG_LEVEL=info; FEATURE_SAFE_HEADERS=true",
99
+ },
100
+ ],
101
+ "logs": [
102
+ {
103
+ "evidence_id": "EVID-103",
104
+ "service_id": "profile-service",
105
+ "tags": ["benign", "startup"],
106
+ "text": "profile-service boot completed for synthetic tenant secops-lab",
107
+ }
108
+ ],
109
+ "headers": {
110
+ "gateway": {
111
+ "evidence_id": "EVID-104",
112
+ "present": ["Strict-Transport-Security", "Content-Security-Policy"],
113
+ "missing": [],
114
+ }
115
+ },
116
+ "dependencies": [
117
+ {
118
+ "evidence_id": "EVID-105",
119
+ "source": "repo_snapshot/requirements.lock",
120
+ "tags": ["dependency", "benign"],
121
+ "text": "fastapi==0.115.0; pydantic==2.8.2",
122
+ }
123
+ ],
124
+ },
125
+ "missing_security_headers_medium": {
126
+ "task_id": "missing_security_headers_medium",
127
+ "difficulty": "medium",
128
+ "title": "Missing security headers",
129
+ "alert": (
130
+ "Gateway response-hardening alert: verify whether required security "
131
+ "headers are missing or weak and submit evidence-backed remediation."
132
+ ),
133
+ "ground_truth_id": "GT-HEADERS-001",
134
+ "finding_type": "missing_security_headers",
135
+ "severity": "medium",
136
+ "required_evidence": ["EVID-201"],
137
+ "remediation_keywords": ["hsts", "csp"],
138
+ "impact_keywords": ["header", "hsts", "csp", "clickjacking"],
139
+ "assets": {
140
+ "services": ["gateway", "profile-service", "admin-service"],
141
+ "routes": ["/login", "/profile", "/admin/export"],
142
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
143
+ },
144
+ "repo": [
145
+ {
146
+ "evidence_id": "EVID-202",
147
+ "source": "repo_snapshot/gateway/security_headers.py",
148
+ "service_id": "gateway",
149
+ "tags": ["headers", "config"],
150
+ "text": "X-Frame-Options is set, but HSTS and CSP are not configured.",
151
+ }
152
+ ],
153
+ "logs": [
154
+ {
155
+ "evidence_id": "EVID-203",
156
+ "service_id": "gateway",
157
+ "tags": ["benign", "response"],
158
+ "text": "GET /profile 200 request_id=req-442 synthetic header audit",
159
+ }
160
+ ],
161
+ "headers": {
162
+ "gateway": {
163
+ "evidence_id": "EVID-201",
164
+ "present": ["X-Frame-Options", "X-Content-Type-Options"],
165
+ "missing": ["Strict-Transport-Security", "Content-Security-Policy"],
166
+ }
167
+ },
168
+ "dependencies": [
169
+ {
170
+ "evidence_id": "EVID-204",
171
+ "source": "repo_snapshot/requirements.lock",
172
+ "tags": ["dependency", "benign"],
173
+ "text": "starlette==0.38.2; uvicorn==0.30.1",
174
+ }
175
+ ],
176
+ },
177
+ "authz_boundary_hard": {
178
+ "task_id": "authz_boundary_hard",
179
+ "difficulty": "hard",
180
+ "title": "Authorisation boundary misconfiguration",
181
+ "alert": (
182
+ "Access-control drift alert: investigate a route/role mismatch in the "
183
+ "synthetic admin service and submit a validated remediation report."
184
+ ),
185
+ "ground_truth_id": "GT-AUTHZ-001",
186
+ "finding_type": "authz_boundary_misconfiguration",
187
+ "severity": "critical",
188
+ "required_evidence": ["EVID-301"],
189
+ "supporting_evidence": ["EVID-302"],
190
+ "remediation_keywords": ["least privilege", "policy", "regression"],
191
+ "impact_keywords": ["authorization", "authorisation", "role", "admin"],
192
+ "assets": {
193
+ "services": ["gateway", "profile-service", "admin-service"],
194
+ "routes": ["/login", "/profile", "/admin/export"],
195
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
196
+ },
197
+ "repo": [
198
+ {
199
+ "evidence_id": "EVID-301",
200
+ "source": "repo_snapshot/admin-service/policy_matrix.yaml",
201
+ "service_id": "admin-service",
202
+ "tags": ["authorization", "role", "policy", "admin export"],
203
+ "text": (
204
+ "route=/admin/export allowed_roles=[admin, analyst] "
205
+ "expected_roles=[admin]"
206
+ ),
207
+ }
208
+ ],
209
+ "logs": [
210
+ {
211
+ "evidence_id": "EVID-302",
212
+ "service_id": "admin-service",
213
+ "tags": ["authorization", "role", "admin export"],
214
+ "text": (
215
+ "request_id=req-913 route=/admin/export role=analyst "
216
+ "decision=allow synthetic boundary-check event"
217
+ ),
218
+ },
219
+ {
220
+ "evidence_id": "EVID-303",
221
+ "service_id": "gateway",
222
+ "tags": ["benign", "auth"],
223
+ "text": "request_id=req-912 route=/profile role=user decision=allow",
224
+ },
225
+ ],
226
+ "headers": {
227
+ "admin-service": {
228
+ "evidence_id": "EVID-304",
229
+ "present": ["Strict-Transport-Security", "Content-Security-Policy"],
230
+ "missing": [],
231
+ }
232
+ },
233
+ "dependencies": [
234
+ {
235
+ "evidence_id": "EVID-305",
236
+ "source": "repo_snapshot/requirements.lock",
237
+ "tags": ["dependency", "benign"],
238
+ "text": "pyyaml==6.0.2; fastapi==0.115.0",
239
+ }
240
+ ],
241
+ },
242
+ }
243
+
244
+
245
+ def list_task_ids() -> list[str]:
246
+ return list(SCENARIOS)
247
+
248
+
249
+ def build_scenario(task_id: str | None, seed: int | None = None) -> dict[str, Any]:
250
+ """Return a deep-copied scenario with deterministic benign variation."""
251
+
252
+ selected_task_id = task_id if task_id in SCENARIOS else DEFAULT_TASK_ID
253
+ scenario = deepcopy(SCENARIOS[selected_task_id])
254
+ scenario["seed"] = seed
255
+
256
+ rng = Random(seed if seed is not None else 0)
257
+ service_alias_sets = [
258
+ ["gateway", "profile-service", "admin-service"],
259
+ ["edge-gateway", "user-profile", "admin-service"],
260
+ ["public-gateway", "profile-api", "backoffice-admin"],
261
+ ]
262
+ aliases = service_alias_sets[rng.randrange(len(service_alias_sets))]
263
+ original_services = scenario["assets"]["services"]
264
+ alias_map = dict(zip(original_services, aliases, strict=True))
265
+
266
+ scenario["service_aliases"] = alias_map
267
+ scenario["assets"]["services"] = [alias_map.get(s, s) for s in original_services]
268
+
269
+ for collection_name in ("repo", "logs"):
270
+ for item in scenario.get(collection_name, []):
271
+ service_id = item.get("service_id")
272
+ if service_id in alias_map:
273
+ item["service_id"] = alias_map[service_id]
274
+
275
+ scenario["headers"] = {
276
+ alias_map.get(service_id, service_id): snapshot
277
+ for service_id, snapshot in scenario.get("headers", {}).items()
278
+ }
279
+
280
+ for entries_name in ("repo", "logs", "dependencies"):
281
+ rng.shuffle(scenario.get(entries_name, []))
282
+
283
+ return scenario
tests/conftest.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ import sys
2
+ from pathlib import Path
3
+
4
+
5
+ PACKAGE_PARENT = Path(__file__).resolve().parents[2]
6
+ if str(PACKAGE_PARENT) not in sys.path:
7
+ sys.path.insert(0, str(PACKAGE_PARENT))
tests/test_environment.py ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from Cyber_analyst.models import CyberAnalystAction
2
+ from Cyber_analyst.server.Cyber_analyst_environment import CyberAnalystEnvironment
3
+ from Cyber_analyst.server.graders import (
4
+ grade_authz_boundary_hard,
5
+ grade_missing_security_headers_medium,
6
+ grade_secret_exposure_easy,
7
+ safe_reward,
8
+ )
9
+
10
+
11
+ def _run_success_path(task_id, actions):
12
+ env = CyberAnalystEnvironment()
13
+ obs = env.reset(task_id=task_id, seed=7)
14
+ assert obs.task_id == task_id
15
+
16
+ for action in actions:
17
+ obs = env.step(action)
18
+
19
+ assert obs.done is True
20
+ assert obs.tool_result["score"] > 0.5
21
+ assert 0.01 <= obs.tool_result["score"] <= 0.99
22
+ assert obs.error == ""
23
+ return obs
24
+
25
+
26
+ def test_secret_exposure_success_path():
27
+ report = {
28
+ "findings": [
29
+ {
30
+ "finding_type": "secret_exposure",
31
+ "evidence_ids": ["EVID-101"],
32
+ "impact": "A synthetic API key secret is exposed in config.",
33
+ "remediation": "Remove the key and rotate the credential.",
34
+ }
35
+ ]
36
+ }
37
+ obs = _run_success_path(
38
+ "secret_exposure_easy",
39
+ [
40
+ CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}),
41
+ CyberAnalystAction(
42
+ tool_name="create_finding",
43
+ args={
44
+ "finding_type": "secret_exposure",
45
+ "evidence_ids": ["EVID-101"],
46
+ "severity_guess": "high",
47
+ "remediation": "Remove and rotate the synthetic credential.",
48
+ },
49
+ ),
50
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
51
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
52
+ ],
53
+ )
54
+ assert obs.verified_findings[0]["matching_gt_id"] == "GT-SECRET-001"
55
+ assert "trajectory_jsonl" in obs.tool_result
56
+ assert "search_repo" in obs.tool_result["trajectory_jsonl"]
57
+
58
+
59
+ def test_missing_security_headers_success_path():
60
+ report = {
61
+ "findings": [
62
+ {
63
+ "finding_type": "missing_security_headers",
64
+ "evidence_ids": ["EVID-201"],
65
+ "impact": "The gateway is missing HSTS and CSP headers.",
66
+ "remediation": "Add HSTS and CSP at the gateway.",
67
+ }
68
+ ]
69
+ }
70
+ obs = _run_success_path(
71
+ "missing_security_headers_medium",
72
+ [
73
+ CyberAnalystAction(
74
+ tool_name="check_security_headers", args={"service_id": "gateway"}
75
+ ),
76
+ CyberAnalystAction(
77
+ tool_name="create_finding",
78
+ args={
79
+ "finding_type": "missing_security_headers",
80
+ "evidence_ids": ["EVID-201"],
81
+ "severity_guess": "medium",
82
+ "remediation": "Add HSTS and CSP headers.",
83
+ },
84
+ ),
85
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
86
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
87
+ ],
88
+ )
89
+ assert obs.score_breakdown["valid_evidence"] == 0.15
90
+
91
+
92
+ def test_authz_boundary_success_path_with_alias_compatible_service_ids():
93
+ report = {
94
+ "findings": [
95
+ {
96
+ "finding_type": "authz_boundary_misconfiguration",
97
+ "evidence_ids": ["EVID-301", "EVID-302"],
98
+ "impact": "The admin route authorization policy allows an analyst role.",
99
+ "remediation": "Apply least privilege in the policy and add a regression test.",
100
+ }
101
+ ]
102
+ }
103
+ obs = _run_success_path(
104
+ "authz_boundary_hard",
105
+ [
106
+ CyberAnalystAction(tool_name="list_assets", args={}),
107
+ CyberAnalystAction(
108
+ tool_name="get_log_events",
109
+ args={"service_id": "admin-service", "query": "admin export"},
110
+ ),
111
+ CyberAnalystAction(tool_name="search_repo", args={"query": "admin export"}),
112
+ CyberAnalystAction(
113
+ tool_name="create_finding",
114
+ args={
115
+ "finding_type": "authz_boundary_misconfiguration",
116
+ "evidence_ids": ["EVID-301", "EVID-302"],
117
+ "severity_guess": "critical",
118
+ "remediation": "Apply least privilege and add a regression test.",
119
+ },
120
+ ),
121
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
122
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
123
+ ],
124
+ )
125
+ assert obs.score_breakdown["actionable_remediation"] == 0.15
126
+
127
+
128
+ def test_invalid_tool_returns_observation_error():
129
+ env = CyberAnalystEnvironment()
130
+ env.reset(task_id="secret_exposure_easy", seed=1)
131
+ obs = env.step(CyberAnalystAction(tool_name="shell", args={"cmd": "whoami"}))
132
+ assert obs.done is False
133
+ assert obs.error == "unsupported_tool"
134
+ assert obs.tool_result["ok"] is False
135
+
136
+
137
+ def test_hallucinated_report_scores_low_but_in_range():
138
+ env = CyberAnalystEnvironment()
139
+ env.reset(task_id="secret_exposure_easy", seed=1)
140
+ obs = env.step(
141
+ CyberAnalystAction(
142
+ tool_name="submit_report",
143
+ args={
144
+ "report_json": {
145
+ "findings": [
146
+ {
147
+ "finding_type": "remote_code_execution",
148
+ "evidence_ids": [],
149
+ "impact": "Unsupported claim.",
150
+ "remediation": "Unsupported remediation.",
151
+ }
152
+ ]
153
+ }
154
+ },
155
+ )
156
+ )
157
+ assert obs.done is True
158
+ assert obs.tool_result["score"] == 0.01
159
+
160
+
161
+ def test_repeated_action_hard_stops_episode():
162
+ env = CyberAnalystEnvironment()
163
+ env.reset(task_id="secret_exposure_easy", seed=1)
164
+ obs = None
165
+ for _ in range(6):
166
+ obs = env.step(CyberAnalystAction(tool_name="list_assets", args={}))
167
+ assert obs is not None
168
+ assert obs.done is True
169
+ assert obs.error == "repeat_hard_stop"
170
+
171
+
172
+ def test_seed_determinism_for_assets():
173
+ env_one = CyberAnalystEnvironment()
174
+ env_two = CyberAnalystEnvironment()
175
+ env_one.reset(task_id="authz_boundary_hard", seed=22)
176
+ env_two.reset(task_id="authz_boundary_hard", seed=22)
177
+ obs_one = env_one.step(CyberAnalystAction(tool_name="list_assets", args={}))
178
+ obs_two = env_two.step(CyberAnalystAction(tool_name="list_assets", args={}))
179
+ assert obs_one.tool_result == obs_two.tool_result
180
+
181
+
182
+ def test_grader_adapters_and_clamp_are_strictly_in_range():
183
+ assert safe_reward(-1) == 0.01
184
+ assert safe_reward(2) == 0.99
185
+ assert 0.01 <= grade_secret_exposure_easy() <= 0.99
186
+ assert 0.01 <= grade_missing_security_headers_medium() <= 0.99
187
+ assert 0.01 <= grade_authz_boundary_hard() <= 0.99
tests/test_inference.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+
3
+ from Cyber_analyst.inference import (
4
+ ModelActionError,
5
+ action_to_log,
6
+ error_action,
7
+ parse_model_action,
8
+ )
9
+
10
+
11
+ def test_parse_model_action_accepts_compact_json():
12
+ action = parse_model_action('{"tool_name":"search_repo","args":{"query":"api key"}}')
13
+
14
+ assert action.tool_name == "search_repo"
15
+ assert action.args == {"query": "api key"}
16
+
17
+
18
+ def test_parse_model_action_accepts_fenced_json():
19
+ action = parse_model_action(
20
+ """```json
21
+ {"tool_name":"list_assets","args":{}}
22
+ ```"""
23
+ )
24
+
25
+ assert action.tool_name == "list_assets"
26
+ assert action.args == {}
27
+
28
+
29
+ def test_parse_model_action_rejects_malformed_json():
30
+ with pytest.raises(ModelActionError, match="model_parse_error"):
31
+ parse_model_action("search the repo for api keys")
32
+
33
+
34
+ def test_action_to_log_is_single_line_json():
35
+ action = parse_model_action('{"tool_name":"search_repo","args":{"query":"api\\nkey"}}')
36
+
37
+ logged = action_to_log(action)
38
+
39
+ assert "\n" not in logged
40
+ assert logged == '{"args":{"query":"api\\nkey"},"tool_name":"search_repo"}'
41
+
42
+
43
+ def test_error_action_uses_strict_diagnostic_tool_name():
44
+ action = error_action(ModelActionError("model_parse_error: empty response"))
45
+
46
+ assert action.tool_name == "model_parse_error"
47
+ assert action.args == {"message": "model_parse_error: empty response"}