Spaces:

obversarystudios
/

agent-threat-map

Running

App Files Files Community

agent-threat-map / README.md

obversarystudios

Threat-map metrics + observable geometry (embed/cluster/MI)

6c3043e verified 1 day ago

preview code

raw

history blame contribute delete

5.15 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: Agent Threat Map Observatory
emoji: 🧭
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Threat-map benchmark with metrics and geometry

Agent Threat Map

Agent Threat Map is a research benchmark and observability scaffold for mapping fragile behavior in model-agent systems.

Instead of asking only whether a model answered correctly, this project asks where and how a model breaks under agent-like pressure.

The benchmark focuses on:

Prompt injection
Tool-output injection
Retrieval poisoning
Memory poisoning
Secret exfiltration
Unauthorized action

The goal is to make hidden model-agent failure modes visible, structured, and easier to compare across runs.

What this is

A JSONL probe set (data/threat_probe_seed.jsonl).
A rule-based evaluator and expanded run metrics (distributions, severity weighting, composite indices, failure-mode histograms, per-category rollups).
Observable geometry (same family as failure-geometry-demo): TF-IDF + SVD embeddings of scored cases → KMeans → mutual information vs category, severity, and pass/fail (agent_threat_map/observability.py). Results appear as observability in exported JSON and in the Gradio Observable geometry tab.
A Gradio Space UI (app.py) for scoring pasted responses and exporting JSON reports.

What this is not

This is not a production vulnerability scanner, security certification system, or complete red-team suite. See docs/limitations.md.

Dataset schema (JSONL)

Each line is one probe:

Field	Description
`id`	Stable identifier
`category`	One of the six families above
`fragility_area`	Finer-grained label
`severity`	`low` / `medium` / `high` / `critical`
`task`	What the agent is asked to do
`input`	User/tool/context text (may contain adversarial content)
`expected_behavior`	What safe handling looks like
`failure_modes`	Strings describing ways the run can go wrong
`observable_signal`	What a reviewer should look for
`safe_response_pattern`	Optional hints for benign completions

Regenerate the seed file (60 probes) with:

python3 scripts/generate_threat_seed.py

Run locally

From this directory (agent-threat-map/):

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 examples/run_local_eval.py
python3 app.py

pip install -r requirements.txt installs scikit-learn (needed for observability / geometry and for examples/run_local_eval.py). examples/run_local_eval.py writes reports/sample_report.json using a canned safe-ish response over all probes.

If pip install fails or .venv/bin/python is missing, remove the broken env (rm -rf .venv), ensure PyPI is reachable (DNS/network), recreate the venv, and run pip install -r requirements.txt again. Do not commit .venv/ (it is gitignored).

Hugging Face Space

The YAML front matter above is what Hugging Face reads when this README lives at the Space repo root. Deploy by copying this folder to the Space (or hf upload — see scripts/push_spaces.sh in the parent monorepo).

Runtime: Python 3.10+ supported; Python 3.13 needs audioop-lts (already listed in requirements.txt).
No API keys required for the threat-map UI (manual paste only).

Metrics overview

Run-level metrics are documented in docs/scoring.md. Highlights:

Distribution: mean / median / P90 / max risk; weighted risk stats.
Severity-aware: severity-weighted pass rate; high-stakes failure rate.
Signals: boundary-language rate; safe vs unsafe signal totals / ratio.
Composites: resilience index, exposure index, fragility spread (risk std dev).
Slices: by category, by severity tier, failure-mode histogram, worst cases.
Observable geometry: MI(cluster, category), MI(cluster, severity), MI(cluster, pass_fail) plus 2-D scatter coordinates per case (needs ≥5 cases by default).

Related Spaces

failure-geometry-demo — CARB failure geometry with sklearn baselines (no API key).
carb-observability-space — same observability shape via HF Inference API (HF_TOKEN secret required).
obversarystudios.org — research narrative.

What you should do on your machine

Git: Commit and push agent-threat-map/ from your monorepo; merge any remote drift on GitHub first.
Hub: Create Space obversarystudios/agent-threat-map (or your namespace) if it does not exist, then run bash scripts/push_spaces.sh from the repo root (after hf auth login).
Smoke-test: After pip install -r requirements.txt, run python3 examples/run_local_eval.py and confirm reports/sample_report.json contains "observability".

License

See LICENSE.