vulnops / README.md
Adhitya-Vardhan
Initial commit: VulnOps OpenEnv benchmark
d63a1ba
metadata
title: VulnOps Reasoning Benchmark
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv

VulnOps OpenEnv

vulnops is an OpenEnv benchmark for open-source vulnerability operations. The agent plays the role of a maintainer or security analyst working through incoming vulnerability cases, revealing supporting evidence, filling a structured draft, and submitting the correct next maintainer action.

This benchmark is intentionally not a bug-fixing environment and not a generic classifier. It models a real workflow: validating advisories, identifying affected packages and versions, weighing severity versus exploitability, and deciding whether to patch or publish an advisory.

Data sources

The benchmark now pulls case data from live public vulnerability feeds at runtime:

  • OSV for package identity, advisory details, affected ranges, and references
  • NVD for normalized CVE descriptions and CVSS severity metadata
  • EPSS for exploitability scoring signals

The environment normalizes those live responses into hidden ground truth on reset(). To keep tests, local development, and offline execution stable, each task also includes a bundled fallback snapshot that is used when the APIs are unavailable.

In addition to the task-specific fallbacks, the container now ships with a broader cache of 200 provider-backed fallback snapshots under data/snapshots/. That keeps the image self-sufficient and gives us room to expand the benchmark without depending entirely on live API availability.

Why this is useful

  • Real-world utility: OSS maintainers triage reports like these every week.
  • Deterministic grading: each case has hidden ground truth and a reproducible scorer.
  • Multi-step rewards: the agent earns signal for revealing good evidence and filling the draft correctly before final submission.
  • Lightweight deployment: no VM, browser, or external datasets are required at runtime.

Environment interface

The environment implements the standard OpenEnv APIs:

  • reset(task_id=...) -> VulnTriageObservation
  • step(VulnTriageAction) -> VulnTriageObservation
  • state -> VulnTriageState

Action space

VulnTriageAction has these fields:

  • action_type: one of read_report, inspect_evidence, search_nvd_database, fetch_commit_diff, message_maintainer, set_validity, set_affected_package, set_affected_versions, set_severity, set_exploitability, set_next_action, set_missing_information, request_more_info, submit_triage
  • evidence_id: used with inspect_evidence
  • value: generic value for label-setting and missing-information actions
  • rationale: optional free-form note

Observation space

VulnTriageObservation returns:

  • task metadata: task_id, difficulty, objective
  • report_summary
  • visible_evidence
  • available_evidence
  • draft
  • action_history
  • steps_remaining
  • score_breakdown
  • final_score
  • standard OpenEnv fields: reward, done, metadata

Task ladder

1. GuardDog Path Traversal

  • Difficulty: easy
  • Goal: Validate the report, identify the package and fixed range, and choose patch.

2. Invenio Multi-Branch XSS

  • Difficulty: medium
  • Goal: Resolve affected versions across multiple release lines and extract truth despite decoy severity signals.

3. Requests Auth Header Leak

  • Difficulty: medium
  • Goal: Ignore severe threat-intel decoys and use fetch_commit_diff to read the Python fix manually.

4. Gradio Upload XSS

  • Difficulty: hard
  • Goal: Actively message_maintainer to discover the lack of a patch and avoid catastrophic penalties by choosing request_info.

Baseline Scores

The benchmark includes a baseline evaluation script (inference.py). Tested against Qwen3:30b using the interactive action space:

  • Average Score (0-1.0): 0.3104
  • Reasoning Gap: 68.96%

Frontier models struggle with proactive tool-use (search_nvd_database, fetch_commit_diff, message_maintainer) instead of passive reading, creating a massive optimization valley for RL evaluation.

Reward design

Per-step reward is shaped to encourage realistic behavior:

  • positive reward for reading the report, revealing new relevant evidence, and setting a draft field correctly
  • negative reward for repeated evidence inspection, empty or incorrect updates, and premature or low-evidence submission
  • final submission reward equals the normalized grader score in [0.0, 1.0], with a small penalty for submitting with too little evidence

Grader weights

  • validity: 0.20
  • affected package: 0.10
  • affected versions: 0.10
  • severity: 0.20
  • exploitability: 0.15
  • next action: 0.15
  • missing-information handling: 0.10

Project structure

.
β”œβ”€β”€ __init__.py
β”œβ”€β”€ client.py
β”œβ”€β”€ inference.py
β”œβ”€β”€ models.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ pyproject.toml
└── server
    β”œβ”€β”€ app.py
    β”œβ”€β”€ cases.py
    β”œβ”€β”€ Dockerfile
    β”œβ”€β”€ graders.py
    └── vuln_triage_env_environment.py

Setup

Local Python setup

python -m pip install -e ".[dev]"

Run the environment locally

uvicorn server.app:app --host 0.0.0.0 --port 8000

Validate the environment

openenv validate .

Inference baseline

The required root-level inference.py supports two modes:

  • --policy openai: uses the OpenAI Python client, reading credentials from OPENAI_API_KEY or HF_TOKEN, model name from MODEL_NAME, and optional base URL from API_BASE_URL
  • --policy heuristic: deterministic offline smoke test for local development

Local direct benchmark run

python inference.py --policy heuristic

Against a running local or remote server

export ENV_BASE_URL=http://localhost:8000
python inference.py --policy openai --model "$MODEL_NAME"

Docker

Build and run:

docker build -t vulnops .
docker run -p 8000:8000 vulnops

Hugging Face Space deployment

This project is packaged for a container-based FastAPI Space. The Space should be tagged with openenv and pointed at the provided Dockerfile.

Expected baseline behavior

The heuristic policy should score 1.0 on all three bundled fallback snapshots. The OpenAI baseline is intended as the hackathon submission baseline and should be reproducible with temperature=0.

Local LoRA learnability check

This repo now includes a local LoRA pipeline for a quick "is the environment learnable?" check with Qwen/Qwen3.5-4B.

On Apple Silicon, the recommended path is now MLX, not the older PyTorch MPS path.

What it does

  • generates deterministic heuristic transitions from the environment
  • expands them into prompt-variant SFT examples
  • runs LoRA SFT with checkpointing
  • evaluates the base model and adapted model back on vulnops
  • writes append-only logs so interrupted runs still leave useful evidence

Install the training extra

python -m pip install -e ".[train]"

Recommended MLX path

python -m pip install mlx mlx-lm
./scripts/start_mlx_training.sh

Artifacts are written under artifacts/mlx_qwen3_4b/:

  • run_manifest.json: current status and latest known checkpoint
  • data/train.jsonl: MLX-ready SFT records
  • logs/mlx_train.log: main training log
  • logs/nohup.out: launcher stdout/stderr
  • metrics/speed_mlx.json: parsed speed summary
  • adapters/: MLX adapter artifacts
  • training_summary.json: final run status

If you stop the run midway, rerun python scripts/run_mlx_training.py --model Qwen/Qwen3.5-4B --output-root artifacts/mlx_qwen3_4b. It will reuse the prepared dataset and resume from the saved adapter file when present.

Current speed comparison

On this Mac, the saved local benchmark showed:

  • PyTorch MPS: about 72.5s/step
  • MLX: about 16.4s/step

See artifacts/speed_comparison.json.