---
title: VulnOps Reasoning Benchmark
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
---
# VulnOps OpenEnv

`vulnops` is an OpenEnv benchmark for open-source vulnerability operations. The agent plays the role of a maintainer or security analyst working through incoming vulnerability cases, revealing supporting evidence, filling a structured draft, and submitting the correct next maintainer action.

This benchmark is intentionally not a bug-fixing environment and not a generic classifier. It models a real workflow: validating advisories, identifying affected packages and versions, weighing severity versus exploitability, and deciding whether to patch or publish an advisory.

## Data sources

The benchmark now pulls case data from live public vulnerability feeds at runtime:

- OSV for package identity, advisory details, affected ranges, and references
- NVD for normalized CVE descriptions and CVSS severity metadata
- EPSS for exploitability scoring signals

The environment normalizes those live responses into hidden ground truth on `reset()`. To keep tests, local development, and offline execution stable, each task also includes a bundled fallback snapshot that is used when the APIs are unavailable.

In addition to the task-specific fallbacks, the container now ships with a broader cache of 200 provider-backed fallback snapshots under `data/snapshots/`. That keeps the image self-sufficient and gives us room to expand the benchmark without depending entirely on live API availability.

## Why this is useful

- Real-world utility: OSS maintainers triage reports like these every week.
- Deterministic grading: each case has hidden ground truth and a reproducible scorer.
- Multi-step rewards: the agent earns signal for revealing good evidence and filling the draft correctly before final submission.
- Lightweight deployment: no VM, browser, or external datasets are required at runtime.

## Environment interface

The environment implements the standard OpenEnv APIs:

- `reset(task_id=...) -> VulnTriageObservation`
- `step(VulnTriageAction) -> VulnTriageObservation`
- `state -> VulnTriageState`

### Action space

`VulnTriageAction` has these fields:

- `action_type`: one of `read_report`, `inspect_evidence`, `search_nvd_database`, `fetch_commit_diff`, `message_maintainer`, `set_validity`, `set_affected_package`, `set_affected_versions`, `set_severity`, `set_exploitability`, `set_next_action`, `set_missing_information`, `request_more_info`, `submit_triage`
- `evidence_id`: used with `inspect_evidence`
- `value`: generic value for label-setting and missing-information actions
- `rationale`: optional free-form note

### Observation space

`VulnTriageObservation` returns:

- task metadata: `task_id`, `difficulty`, `objective`
- `report_summary`
- `visible_evidence`
- `available_evidence`
- `draft`
- `action_history`
- `steps_remaining`
- `score_breakdown`
- `final_score`
- standard OpenEnv fields: `reward`, `done`, `metadata`

## Task ladder

### 1. GuardDog Path Traversal
- Difficulty: easy
- Goal: Validate the report, identify the package and fixed range, and choose `patch`.

### 2. Invenio Multi-Branch XSS
- Difficulty: medium
- Goal: Resolve affected versions across multiple release lines and extract truth despite decoy severity signals.

### 3. Requests Auth Header Leak
- Difficulty: medium
- Goal: Ignore severe threat-intel decoys and use `fetch_commit_diff` to read the Python fix manually.

### 4. Gradio Upload XSS
- Difficulty: hard
- Goal: Actively `message_maintainer` to discover the lack of a patch and avoid catastrophic penalties by choosing `request_info`.

## Baseline Scores

The benchmark includes a baseline evaluation script (`inference.py`). Tested against **Qwen3:30b** using the interactive action space:

- **Average Score (0-1.0):** `0.3104`
- **Reasoning Gap:** `68.96%`

*Frontier models struggle with proactive tool-use (`search_nvd_database`, `fetch_commit_diff`, `message_maintainer`) instead of passive reading, creating a massive optimization valley for RL evaluation.*

## Reward design

Per-step reward is shaped to encourage realistic behavior:

- positive reward for reading the report, revealing new relevant evidence, and setting a draft field correctly
- negative reward for repeated evidence inspection, empty or incorrect updates, and premature or low-evidence submission
- final submission reward equals the normalized grader score in `[0.0, 1.0]`, with a small penalty for submitting with too little evidence

### Grader weights

- validity: `0.20`
- affected package: `0.10`
- affected versions: `0.10`
- severity: `0.20`
- exploitability: `0.15`
- next action: `0.15`
- missing-information handling: `0.10`

## Project structure

```text
.
├── __init__.py
├── client.py
├── inference.py
├── models.py
├── openenv.yaml
├── pyproject.toml
└── server
    ├── app.py
    ├── cases.py
    ├── Dockerfile
    ├── graders.py
    └── vuln_triage_env_environment.py
```

## Setup

### Local Python setup

```bash
python -m pip install -e ".[dev]"
```

### Run the environment locally

```bash
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Validate the environment

```bash
openenv validate .
```

## Inference baseline

The required root-level `inference.py` supports two modes:

- `--policy openai`: uses the OpenAI Python client, reading credentials from `OPENAI_API_KEY` or `HF_TOKEN`, model name from `MODEL_NAME`, and optional base URL from `API_BASE_URL`
- `--policy heuristic`: deterministic offline smoke test for local development

### Local direct benchmark run

```bash
python inference.py --policy heuristic
```

### Against a running local or remote server

```bash
export ENV_BASE_URL=http://localhost:8000
python inference.py --policy openai --model "$MODEL_NAME"
```

## Docker

Build and run:

```bash
docker build -t vulnops .
docker run -p 8000:8000 vulnops
```

## Hugging Face Space deployment

This project is packaged for a container-based FastAPI Space. The Space should be tagged with `openenv` and pointed at the provided `Dockerfile`.

## Expected baseline behavior

The heuristic policy should score `1.0` on all three bundled fallback snapshots. The OpenAI baseline is intended as the hackathon submission baseline and should be reproducible with `temperature=0`.

## Local LoRA learnability check

This repo now includes a local LoRA pipeline for a quick "is the environment learnable?" check with `Qwen/Qwen3.5-4B`.

On Apple Silicon, the recommended path is now `MLX`, not the older PyTorch `MPS` path.

### What it does

- generates deterministic heuristic transitions from the environment
- expands them into prompt-variant SFT examples
- runs LoRA SFT with checkpointing
- evaluates the base model and adapted model back on `vulnops`
- writes append-only logs so interrupted runs still leave useful evidence

### Install the training extra

```bash
python -m pip install -e ".[train]"
```

### Recommended MLX path

```bash
python -m pip install mlx mlx-lm
./scripts/start_mlx_training.sh
```

Artifacts are written under `artifacts/mlx_qwen3_4b/`:

- `run_manifest.json`: current status and latest known checkpoint
- `data/train.jsonl`: MLX-ready SFT records
- `logs/mlx_train.log`: main training log
- `logs/nohup.out`: launcher stdout/stderr
- `metrics/speed_mlx.json`: parsed speed summary
- `adapters/`: MLX adapter artifacts
- `training_summary.json`: final run status

If you stop the run midway, rerun `python scripts/run_mlx_training.py --model Qwen/Qwen3.5-4B --output-root artifacts/mlx_qwen3_4b`.
It will reuse the prepared dataset and resume from the saved adapter file when present.

### Current speed comparison

On this Mac, the saved local benchmark showed:

- PyTorch `MPS`: about `72.5s/step`
- MLX: about `16.4s/step`

See [artifacts/speed_comparison.json](/Users/adithyavardhan/Tweeks/hack/artifacts/speed_comparison.json).