File size: 8,010 Bytes
d63a1ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
title: VulnOps Reasoning Benchmark
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
---
# VulnOps OpenEnv

`vulnops` is an OpenEnv benchmark for open-source vulnerability operations. The agent plays the role of a maintainer or security analyst working through incoming vulnerability cases, revealing supporting evidence, filling a structured draft, and submitting the correct next maintainer action.

This benchmark is intentionally not a bug-fixing environment and not a generic classifier. It models a real workflow: validating advisories, identifying affected packages and versions, weighing severity versus exploitability, and deciding whether to patch or publish an advisory.

## Data sources

The benchmark now pulls case data from live public vulnerability feeds at runtime:

- OSV for package identity, advisory details, affected ranges, and references
- NVD for normalized CVE descriptions and CVSS severity metadata
- EPSS for exploitability scoring signals

The environment normalizes those live responses into hidden ground truth on `reset()`. To keep tests, local development, and offline execution stable, each task also includes a bundled fallback snapshot that is used when the APIs are unavailable.

In addition to the task-specific fallbacks, the container now ships with a broader cache of 200 provider-backed fallback snapshots under `data/snapshots/`. That keeps the image self-sufficient and gives us room to expand the benchmark without depending entirely on live API availability.

## Why this is useful

- Real-world utility: OSS maintainers triage reports like these every week.
- Deterministic grading: each case has hidden ground truth and a reproducible scorer.
- Multi-step rewards: the agent earns signal for revealing good evidence and filling the draft correctly before final submission.
- Lightweight deployment: no VM, browser, or external datasets are required at runtime.

## Environment interface

The environment implements the standard OpenEnv APIs:

- `reset(task_id=...) -> VulnTriageObservation`
- `step(VulnTriageAction) -> VulnTriageObservation`
- `state -> VulnTriageState`

### Action space

`VulnTriageAction` has these fields:

- `action_type`: one of `read_report`, `inspect_evidence`, `search_nvd_database`, `fetch_commit_diff`, `message_maintainer`, `set_validity`, `set_affected_package`, `set_affected_versions`, `set_severity`, `set_exploitability`, `set_next_action`, `set_missing_information`, `request_more_info`, `submit_triage`
- `evidence_id`: used with `inspect_evidence`
- `value`: generic value for label-setting and missing-information actions
- `rationale`: optional free-form note

### Observation space

`VulnTriageObservation` returns:

- task metadata: `task_id`, `difficulty`, `objective`
- `report_summary`
- `visible_evidence`
- `available_evidence`
- `draft`
- `action_history`
- `steps_remaining`
- `score_breakdown`
- `final_score`
- standard OpenEnv fields: `reward`, `done`, `metadata`

## Task ladder

### 1. GuardDog Path Traversal
- Difficulty: easy
- Goal: Validate the report, identify the package and fixed range, and choose `patch`.

### 2. Invenio Multi-Branch XSS
- Difficulty: medium
- Goal: Resolve affected versions across multiple release lines and extract truth despite decoy severity signals.

### 3. Requests Auth Header Leak
- Difficulty: medium
- Goal: Ignore severe threat-intel decoys and use `fetch_commit_diff` to read the Python fix manually.

### 4. Gradio Upload XSS
- Difficulty: hard
- Goal: Actively `message_maintainer` to discover the lack of a patch and avoid catastrophic penalties by choosing `request_info`.

## Baseline Scores

The benchmark includes a baseline evaluation script (`inference.py`). Tested against **Qwen3:30b** using the interactive action space:

- **Average Score (0-1.0):** `0.3104`
- **Reasoning Gap:** `68.96%`

*Frontier models struggle with proactive tool-use (`search_nvd_database`, `fetch_commit_diff`, `message_maintainer`) instead of passive reading, creating a massive optimization valley for RL evaluation.*

## Reward design

Per-step reward is shaped to encourage realistic behavior:

- positive reward for reading the report, revealing new relevant evidence, and setting a draft field correctly
- negative reward for repeated evidence inspection, empty or incorrect updates, and premature or low-evidence submission
- final submission reward equals the normalized grader score in `[0.0, 1.0]`, with a small penalty for submitting with too little evidence

### Grader weights

- validity: `0.20`
- affected package: `0.10`
- affected versions: `0.10`
- severity: `0.20`
- exploitability: `0.15`
- next action: `0.15`
- missing-information handling: `0.10`

## Project structure

```text
.
β”œβ”€β”€ __init__.py
β”œβ”€β”€ client.py
β”œβ”€β”€ inference.py
β”œβ”€β”€ models.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ pyproject.toml
└── server
    β”œβ”€β”€ app.py
    β”œβ”€β”€ cases.py
    β”œβ”€β”€ Dockerfile
    β”œβ”€β”€ graders.py
    └── vuln_triage_env_environment.py
```

## Setup

### Local Python setup

```bash
python -m pip install -e ".[dev]"
```

### Run the environment locally

```bash
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Validate the environment

```bash
openenv validate .
```

## Inference baseline

The required root-level `inference.py` supports two modes:

- `--policy openai`: uses the OpenAI Python client, reading credentials from `OPENAI_API_KEY` or `HF_TOKEN`, model name from `MODEL_NAME`, and optional base URL from `API_BASE_URL`
- `--policy heuristic`: deterministic offline smoke test for local development

### Local direct benchmark run

```bash
python inference.py --policy heuristic
```

### Against a running local or remote server

```bash
export ENV_BASE_URL=http://localhost:8000
python inference.py --policy openai --model "$MODEL_NAME"
```

## Docker

Build and run:

```bash
docker build -t vulnops .
docker run -p 8000:8000 vulnops
```

## Hugging Face Space deployment

This project is packaged for a container-based FastAPI Space. The Space should be tagged with `openenv` and pointed at the provided `Dockerfile`.

## Expected baseline behavior

The heuristic policy should score `1.0` on all three bundled fallback snapshots. The OpenAI baseline is intended as the hackathon submission baseline and should be reproducible with `temperature=0`.

## Local LoRA learnability check

This repo now includes a local LoRA pipeline for a quick "is the environment learnable?" check with `Qwen/Qwen3.5-4B`.

On Apple Silicon, the recommended path is now `MLX`, not the older PyTorch `MPS` path.

### What it does

- generates deterministic heuristic transitions from the environment
- expands them into prompt-variant SFT examples
- runs LoRA SFT with checkpointing
- evaluates the base model and adapted model back on `vulnops`
- writes append-only logs so interrupted runs still leave useful evidence

### Install the training extra

```bash
python -m pip install -e ".[train]"
```

### Recommended MLX path

```bash
python -m pip install mlx mlx-lm
./scripts/start_mlx_training.sh
```

Artifacts are written under `artifacts/mlx_qwen3_4b/`:

- `run_manifest.json`: current status and latest known checkpoint
- `data/train.jsonl`: MLX-ready SFT records
- `logs/mlx_train.log`: main training log
- `logs/nohup.out`: launcher stdout/stderr
- `metrics/speed_mlx.json`: parsed speed summary
- `adapters/`: MLX adapter artifacts
- `training_summary.json`: final run status

If you stop the run midway, rerun `python scripts/run_mlx_training.py --model Qwen/Qwen3.5-4B --output-root artifacts/mlx_qwen3_4b`.
It will reuse the prepared dataset and resume from the saved adapter file when present.

### Current speed comparison

On this Mac, the saved local benchmark showed:

- PyTorch `MPS`: about `72.5s/step`
- MLX: about `16.4s/step`

See [artifacts/speed_comparison.json](/Users/adithyavardhan/Tweeks/hack/artifacts/speed_comparison.json).