Spaces:

Adhitya122
/

vulnops

Sleeping

App Files Files Community

vulnops / README.md

Adhitya-Vardhan

Initial commit: VulnOps OpenEnv benchmark

d63a1ba 30 days ago

preview code

raw

history blame contribute delete

8.01 kB

metadata

title: VulnOps Reasoning Benchmark
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv

VulnOps OpenEnv

vulnops is an OpenEnv benchmark for open-source vulnerability operations. The agent plays the role of a maintainer or security analyst working through incoming vulnerability cases, revealing supporting evidence, filling a structured draft, and submitting the correct next maintainer action.

This benchmark is intentionally not a bug-fixing environment and not a generic classifier. It models a real workflow: validating advisories, identifying affected packages and versions, weighing severity versus exploitability, and deciding whether to patch or publish an advisory.

Data sources

The benchmark now pulls case data from live public vulnerability feeds at runtime:

OSV for package identity, advisory details, affected ranges, and references
NVD for normalized CVE descriptions and CVSS severity metadata
EPSS for exploitability scoring signals

The environment normalizes those live responses into hidden ground truth on reset(). To keep tests, local development, and offline execution stable, each task also includes a bundled fallback snapshot that is used when the APIs are unavailable.

In addition to the task-specific fallbacks, the container now ships with a broader cache of 200 provider-backed fallback snapshots under data/snapshots/. That keeps the image self-sufficient and gives us room to expand the benchmark without depending entirely on live API availability.

Why this is useful

Real-world utility: OSS maintainers triage reports like these every week.
Deterministic grading: each case has hidden ground truth and a reproducible scorer.
Multi-step rewards: the agent earns signal for revealing good evidence and filling the draft correctly before final submission.
Lightweight deployment: no VM, browser, or external datasets are required at runtime.

Environment interface

The environment implements the standard OpenEnv APIs:

reset(task_id=...) -> VulnTriageObservation
step(VulnTriageAction) -> VulnTriageObservation
state -> VulnTriageState

Action space

VulnTriageAction has these fields:

action_type: one of read_report, inspect_evidence, search_nvd_database, fetch_commit_diff, message_maintainer, set_validity, set_affected_package, set_affected_versions, set_severity, set_exploitability, set_next_action, set_missing_information, request_more_info, submit_triage
evidence_id: used with inspect_evidence
value: generic value for label-setting and missing-information actions
rationale: optional free-form note

Observation space

VulnTriageObservation returns:

task metadata: task_id, difficulty, objective
report_summary
visible_evidence
available_evidence
draft
action_history
steps_remaining
score_breakdown
final_score
standard OpenEnv fields: reward, done, metadata

Task ladder

1. GuardDog Path Traversal

Difficulty: easy
Goal: Validate the report, identify the package and fixed range, and choose patch.

2. Invenio Multi-Branch XSS

Difficulty: medium
Goal: Resolve affected versions across multiple release lines and extract truth despite decoy severity signals.

3. Requests Auth Header Leak

Difficulty: medium
Goal: Ignore severe threat-intel decoys and use fetch_commit_diff to read the Python fix manually.

4. Gradio Upload XSS

Difficulty: hard
Goal: Actively message_maintainer to discover the lack of a patch and avoid catastrophic penalties by choosing request_info.

Baseline Scores

The benchmark includes a baseline evaluation script (inference.py). Tested against Qwen3:30b using the interactive action space:

Average Score (0-1.0): 0.3104
Reasoning Gap: 68.96%

Frontier models struggle with proactive tool-use (search_nvd_database, fetch_commit_diff, message_maintainer) instead of passive reading, creating a massive optimization valley for RL evaluation.

Reward design

Per-step reward is shaped to encourage realistic behavior:

positive reward for reading the report, revealing new relevant evidence, and setting a draft field correctly
negative reward for repeated evidence inspection, empty or incorrect updates, and premature or low-evidence submission
final submission reward equals the normalized grader score in [0.0, 1.0], with a small penalty for submitting with too little evidence

Grader weights

validity: 0.20
affected package: 0.10
affected versions: 0.10
severity: 0.20
exploitability: 0.15
next action: 0.15
missing-information handling: 0.10

Project structure

.
├── __init__.py
├── client.py
├── inference.py
├── models.py
├── openenv.yaml
├── pyproject.toml
└── server
    ├── app.py
    ├── cases.py
    ├── Dockerfile
    ├── graders.py
    └── vuln_triage_env_environment.py

Setup

Local Python setup

python -m pip install -e ".[dev]"

Run the environment locally

uvicorn server.app:app --host 0.0.0.0 --port 8000

Validate the environment

openenv validate .

Inference baseline

The required root-level inference.py supports two modes:

--policy openai: uses the OpenAI Python client, reading credentials from OPENAI_API_KEY or HF_TOKEN, model name from MODEL_NAME, and optional base URL from API_BASE_URL
--policy heuristic: deterministic offline smoke test for local development

Local direct benchmark run

python inference.py --policy heuristic

Against a running local or remote server

export ENV_BASE_URL=http://localhost:8000
python inference.py --policy openai --model "$MODEL_NAME"

Docker

Build and run:

docker build -t vulnops .
docker run -p 8000:8000 vulnops

Hugging Face Space deployment

This project is packaged for a container-based FastAPI Space. The Space should be tagged with openenv and pointed at the provided Dockerfile.

Expected baseline behavior

The heuristic policy should score 1.0 on all three bundled fallback snapshots. The OpenAI baseline is intended as the hackathon submission baseline and should be reproducible with temperature=0.

Local LoRA learnability check

This repo now includes a local LoRA pipeline for a quick "is the environment learnable?" check with Qwen/Qwen3.5-4B.

On Apple Silicon, the recommended path is now MLX, not the older PyTorch MPS path.

What it does

generates deterministic heuristic transitions from the environment
expands them into prompt-variant SFT examples
runs LoRA SFT with checkpointing
evaluates the base model and adapted model back on vulnops
writes append-only logs so interrupted runs still leave useful evidence

Install the training extra

python -m pip install -e ".[train]"

Recommended MLX path

python -m pip install mlx mlx-lm
./scripts/start_mlx_training.sh

Artifacts are written under artifacts/mlx_qwen3_4b/:

run_manifest.json: current status and latest known checkpoint
data/train.jsonl: MLX-ready SFT records
logs/mlx_train.log: main training log
logs/nohup.out: launcher stdout/stderr
metrics/speed_mlx.json: parsed speed summary
adapters/: MLX adapter artifacts
training_summary.json: final run status

If you stop the run midway, rerun python scripts/run_mlx_training.py --model Qwen/Qwen3.5-4B --output-root artifacts/mlx_qwen3_4b. It will reuse the prepared dataset and resume from the saved adapter file when present.

Current speed comparison

On this Mac, the saved local benchmark showed:

PyTorch MPS: about 72.5s/step
MLX: about 16.4s/step

See artifacts/speed_comparison.json.