Spaces:
Sleeping
Sleeping
File size: 8,010 Bytes
d63a1ba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | ---
title: VulnOps Reasoning Benchmark
emoji: π‘οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
---
# VulnOps OpenEnv
`vulnops` is an OpenEnv benchmark for open-source vulnerability operations. The agent plays the role of a maintainer or security analyst working through incoming vulnerability cases, revealing supporting evidence, filling a structured draft, and submitting the correct next maintainer action.
This benchmark is intentionally not a bug-fixing environment and not a generic classifier. It models a real workflow: validating advisories, identifying affected packages and versions, weighing severity versus exploitability, and deciding whether to patch or publish an advisory.
## Data sources
The benchmark now pulls case data from live public vulnerability feeds at runtime:
- OSV for package identity, advisory details, affected ranges, and references
- NVD for normalized CVE descriptions and CVSS severity metadata
- EPSS for exploitability scoring signals
The environment normalizes those live responses into hidden ground truth on `reset()`. To keep tests, local development, and offline execution stable, each task also includes a bundled fallback snapshot that is used when the APIs are unavailable.
In addition to the task-specific fallbacks, the container now ships with a broader cache of 200 provider-backed fallback snapshots under `data/snapshots/`. That keeps the image self-sufficient and gives us room to expand the benchmark without depending entirely on live API availability.
## Why this is useful
- Real-world utility: OSS maintainers triage reports like these every week.
- Deterministic grading: each case has hidden ground truth and a reproducible scorer.
- Multi-step rewards: the agent earns signal for revealing good evidence and filling the draft correctly before final submission.
- Lightweight deployment: no VM, browser, or external datasets are required at runtime.
## Environment interface
The environment implements the standard OpenEnv APIs:
- `reset(task_id=...) -> VulnTriageObservation`
- `step(VulnTriageAction) -> VulnTriageObservation`
- `state -> VulnTriageState`
### Action space
`VulnTriageAction` has these fields:
- `action_type`: one of `read_report`, `inspect_evidence`, `search_nvd_database`, `fetch_commit_diff`, `message_maintainer`, `set_validity`, `set_affected_package`, `set_affected_versions`, `set_severity`, `set_exploitability`, `set_next_action`, `set_missing_information`, `request_more_info`, `submit_triage`
- `evidence_id`: used with `inspect_evidence`
- `value`: generic value for label-setting and missing-information actions
- `rationale`: optional free-form note
### Observation space
`VulnTriageObservation` returns:
- task metadata: `task_id`, `difficulty`, `objective`
- `report_summary`
- `visible_evidence`
- `available_evidence`
- `draft`
- `action_history`
- `steps_remaining`
- `score_breakdown`
- `final_score`
- standard OpenEnv fields: `reward`, `done`, `metadata`
## Task ladder
### 1. GuardDog Path Traversal
- Difficulty: easy
- Goal: Validate the report, identify the package and fixed range, and choose `patch`.
### 2. Invenio Multi-Branch XSS
- Difficulty: medium
- Goal: Resolve affected versions across multiple release lines and extract truth despite decoy severity signals.
### 3. Requests Auth Header Leak
- Difficulty: medium
- Goal: Ignore severe threat-intel decoys and use `fetch_commit_diff` to read the Python fix manually.
### 4. Gradio Upload XSS
- Difficulty: hard
- Goal: Actively `message_maintainer` to discover the lack of a patch and avoid catastrophic penalties by choosing `request_info`.
## Baseline Scores
The benchmark includes a baseline evaluation script (`inference.py`). Tested against **Qwen3:30b** using the interactive action space:
- **Average Score (0-1.0):** `0.3104`
- **Reasoning Gap:** `68.96%`
*Frontier models struggle with proactive tool-use (`search_nvd_database`, `fetch_commit_diff`, `message_maintainer`) instead of passive reading, creating a massive optimization valley for RL evaluation.*
## Reward design
Per-step reward is shaped to encourage realistic behavior:
- positive reward for reading the report, revealing new relevant evidence, and setting a draft field correctly
- negative reward for repeated evidence inspection, empty or incorrect updates, and premature or low-evidence submission
- final submission reward equals the normalized grader score in `[0.0, 1.0]`, with a small penalty for submitting with too little evidence
### Grader weights
- validity: `0.20`
- affected package: `0.10`
- affected versions: `0.10`
- severity: `0.20`
- exploitability: `0.15`
- next action: `0.15`
- missing-information handling: `0.10`
## Project structure
```text
.
βββ __init__.py
βββ client.py
βββ inference.py
βββ models.py
βββ openenv.yaml
βββ pyproject.toml
βββ server
βββ app.py
βββ cases.py
βββ Dockerfile
βββ graders.py
βββ vuln_triage_env_environment.py
```
## Setup
### Local Python setup
```bash
python -m pip install -e ".[dev]"
```
### Run the environment locally
```bash
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Validate the environment
```bash
openenv validate .
```
## Inference baseline
The required root-level `inference.py` supports two modes:
- `--policy openai`: uses the OpenAI Python client, reading credentials from `OPENAI_API_KEY` or `HF_TOKEN`, model name from `MODEL_NAME`, and optional base URL from `API_BASE_URL`
- `--policy heuristic`: deterministic offline smoke test for local development
### Local direct benchmark run
```bash
python inference.py --policy heuristic
```
### Against a running local or remote server
```bash
export ENV_BASE_URL=http://localhost:8000
python inference.py --policy openai --model "$MODEL_NAME"
```
## Docker
Build and run:
```bash
docker build -t vulnops .
docker run -p 8000:8000 vulnops
```
## Hugging Face Space deployment
This project is packaged for a container-based FastAPI Space. The Space should be tagged with `openenv` and pointed at the provided `Dockerfile`.
## Expected baseline behavior
The heuristic policy should score `1.0` on all three bundled fallback snapshots. The OpenAI baseline is intended as the hackathon submission baseline and should be reproducible with `temperature=0`.
## Local LoRA learnability check
This repo now includes a local LoRA pipeline for a quick "is the environment learnable?" check with `Qwen/Qwen3.5-4B`.
On Apple Silicon, the recommended path is now `MLX`, not the older PyTorch `MPS` path.
### What it does
- generates deterministic heuristic transitions from the environment
- expands them into prompt-variant SFT examples
- runs LoRA SFT with checkpointing
- evaluates the base model and adapted model back on `vulnops`
- writes append-only logs so interrupted runs still leave useful evidence
### Install the training extra
```bash
python -m pip install -e ".[train]"
```
### Recommended MLX path
```bash
python -m pip install mlx mlx-lm
./scripts/start_mlx_training.sh
```
Artifacts are written under `artifacts/mlx_qwen3_4b/`:
- `run_manifest.json`: current status and latest known checkpoint
- `data/train.jsonl`: MLX-ready SFT records
- `logs/mlx_train.log`: main training log
- `logs/nohup.out`: launcher stdout/stderr
- `metrics/speed_mlx.json`: parsed speed summary
- `adapters/`: MLX adapter artifacts
- `training_summary.json`: final run status
If you stop the run midway, rerun `python scripts/run_mlx_training.py --model Qwen/Qwen3.5-4B --output-root artifacts/mlx_qwen3_4b`.
It will reuse the prepared dataset and resume from the saved adapter file when present.
### Current speed comparison
On this Mac, the saved local benchmark showed:
- PyTorch `MPS`: about `72.5s/step`
- MLX: about `16.4s/step`
See [artifacts/speed_comparison.json](/Users/adithyavardhan/Tweeks/hack/artifacts/speed_comparison.json).
|