Spaces:
Sleeping
title: VulnOps Reasoning Benchmark
emoji: π‘οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
VulnOps OpenEnv
vulnops is an OpenEnv benchmark for open-source vulnerability operations. The agent plays the role of a maintainer or security analyst working through incoming vulnerability cases, revealing supporting evidence, filling a structured draft, and submitting the correct next maintainer action.
This benchmark is intentionally not a bug-fixing environment and not a generic classifier. It models a real workflow: validating advisories, identifying affected packages and versions, weighing severity versus exploitability, and deciding whether to patch or publish an advisory.
Data sources
The benchmark now pulls case data from live public vulnerability feeds at runtime:
- OSV for package identity, advisory details, affected ranges, and references
- NVD for normalized CVE descriptions and CVSS severity metadata
- EPSS for exploitability scoring signals
The environment normalizes those live responses into hidden ground truth on reset(). To keep tests, local development, and offline execution stable, each task also includes a bundled fallback snapshot that is used when the APIs are unavailable.
In addition to the task-specific fallbacks, the container now ships with a broader cache of 200 provider-backed fallback snapshots under data/snapshots/. That keeps the image self-sufficient and gives us room to expand the benchmark without depending entirely on live API availability.
Why this is useful
- Real-world utility: OSS maintainers triage reports like these every week.
- Deterministic grading: each case has hidden ground truth and a reproducible scorer.
- Multi-step rewards: the agent earns signal for revealing good evidence and filling the draft correctly before final submission.
- Lightweight deployment: no VM, browser, or external datasets are required at runtime.
Environment interface
The environment implements the standard OpenEnv APIs:
reset(task_id=...) -> VulnTriageObservationstep(VulnTriageAction) -> VulnTriageObservationstate -> VulnTriageState
Action space
VulnTriageAction has these fields:
action_type: one ofread_report,inspect_evidence,search_nvd_database,fetch_commit_diff,message_maintainer,set_validity,set_affected_package,set_affected_versions,set_severity,set_exploitability,set_next_action,set_missing_information,request_more_info,submit_triageevidence_id: used withinspect_evidencevalue: generic value for label-setting and missing-information actionsrationale: optional free-form note
Observation space
VulnTriageObservation returns:
- task metadata:
task_id,difficulty,objective report_summaryvisible_evidenceavailable_evidencedraftaction_historysteps_remainingscore_breakdownfinal_score- standard OpenEnv fields:
reward,done,metadata
Task ladder
1. GuardDog Path Traversal
- Difficulty: easy
- Goal: Validate the report, identify the package and fixed range, and choose
patch.
2. Invenio Multi-Branch XSS
- Difficulty: medium
- Goal: Resolve affected versions across multiple release lines and extract truth despite decoy severity signals.
3. Requests Auth Header Leak
- Difficulty: medium
- Goal: Ignore severe threat-intel decoys and use
fetch_commit_diffto read the Python fix manually.
4. Gradio Upload XSS
- Difficulty: hard
- Goal: Actively
message_maintainerto discover the lack of a patch and avoid catastrophic penalties by choosingrequest_info.
Baseline Scores
The benchmark includes a baseline evaluation script (inference.py). Tested against Qwen3:30b using the interactive action space:
- Average Score (0-1.0):
0.3104 - Reasoning Gap:
68.96%
Frontier models struggle with proactive tool-use (search_nvd_database, fetch_commit_diff, message_maintainer) instead of passive reading, creating a massive optimization valley for RL evaluation.
Reward design
Per-step reward is shaped to encourage realistic behavior:
- positive reward for reading the report, revealing new relevant evidence, and setting a draft field correctly
- negative reward for repeated evidence inspection, empty or incorrect updates, and premature or low-evidence submission
- final submission reward equals the normalized grader score in
[0.0, 1.0], with a small penalty for submitting with too little evidence
Grader weights
- validity:
0.20 - affected package:
0.10 - affected versions:
0.10 - severity:
0.20 - exploitability:
0.15 - next action:
0.15 - missing-information handling:
0.10
Project structure
.
βββ __init__.py
βββ client.py
βββ inference.py
βββ models.py
βββ openenv.yaml
βββ pyproject.toml
βββ server
βββ app.py
βββ cases.py
βββ Dockerfile
βββ graders.py
βββ vuln_triage_env_environment.py
Setup
Local Python setup
python -m pip install -e ".[dev]"
Run the environment locally
uvicorn server.app:app --host 0.0.0.0 --port 8000
Validate the environment
openenv validate .
Inference baseline
The required root-level inference.py supports two modes:
--policy openai: uses the OpenAI Python client, reading credentials fromOPENAI_API_KEYorHF_TOKEN, model name fromMODEL_NAME, and optional base URL fromAPI_BASE_URL--policy heuristic: deterministic offline smoke test for local development
Local direct benchmark run
python inference.py --policy heuristic
Against a running local or remote server
export ENV_BASE_URL=http://localhost:8000
python inference.py --policy openai --model "$MODEL_NAME"
Docker
Build and run:
docker build -t vulnops .
docker run -p 8000:8000 vulnops
Hugging Face Space deployment
This project is packaged for a container-based FastAPI Space. The Space should be tagged with openenv and pointed at the provided Dockerfile.
Expected baseline behavior
The heuristic policy should score 1.0 on all three bundled fallback snapshots. The OpenAI baseline is intended as the hackathon submission baseline and should be reproducible with temperature=0.
Local LoRA learnability check
This repo now includes a local LoRA pipeline for a quick "is the environment learnable?" check with Qwen/Qwen3.5-4B.
On Apple Silicon, the recommended path is now MLX, not the older PyTorch MPS path.
What it does
- generates deterministic heuristic transitions from the environment
- expands them into prompt-variant SFT examples
- runs LoRA SFT with checkpointing
- evaluates the base model and adapted model back on
vulnops - writes append-only logs so interrupted runs still leave useful evidence
Install the training extra
python -m pip install -e ".[train]"
Recommended MLX path
python -m pip install mlx mlx-lm
./scripts/start_mlx_training.sh
Artifacts are written under artifacts/mlx_qwen3_4b/:
run_manifest.json: current status and latest known checkpointdata/train.jsonl: MLX-ready SFT recordslogs/mlx_train.log: main training loglogs/nohup.out: launcher stdout/stderrmetrics/speed_mlx.json: parsed speed summaryadapters/: MLX adapter artifactstraining_summary.json: final run status
If you stop the run midway, rerun python scripts/run_mlx_training.py --model Qwen/Qwen3.5-4B --output-root artifacts/mlx_qwen3_4b.
It will reuse the prepared dataset and resume from the saved adapter file when present.
Current speed comparison
On this Mac, the saved local benchmark showed:
- PyTorch
MPS: about72.5s/step - MLX: about
16.4s/step