Spaces:

obversarystudios
/

agent-threat-map

Running

App Files Files Community

agent-threat-map / README.md

obversarystudios

Threat-map metrics + observable geometry (embed/cluster/MI)

6c3043e verified 1 day ago

preview code

raw

history blame contribute delete

5.15 kB

	---
	title: Agent Threat Map Observatory
	emoji: 🧭
	colorFrom: gray
	colorTo: purple
	sdk: gradio
	sdk_version: 5.50.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Threat-map benchmark with metrics and geometry
	---

	# Agent Threat Map

	Agent Threat Map is a research benchmark and observability scaffold for mapping fragile behavior in model-agent systems.

	Instead of asking only whether a model answered correctly, this project asks where and how a model breaks under agent-like pressure.

	The benchmark focuses on:

	- Prompt injection
	- Tool-output injection
	- Retrieval poisoning
	- Memory poisoning
	- Secret exfiltration
	- Unauthorized action

	The goal is to make hidden model-agent failure modes visible, structured, and easier to compare across runs.

	## What this is

	- A JSONL probe set (`data/threat_probe_seed.jsonl`).
	- A rule-based evaluator and expanded run metrics (distributions, severity weighting, composite indices, failure-mode histograms, per-category rollups).
	- Observable geometry (same family as `failure-geometry-demo`): TF-IDF + SVD embeddings of scored cases → KMeans → mutual information vs category, severity, and pass/fail (`agent_threat_map/observability.py`). Results appear as `observability` in exported JSON and in the Gradio Observable geometry tab.
	- A Gradio Space UI (`app.py`) for scoring pasted responses and exporting JSON reports.

	## What this is not

	This is not a production vulnerability scanner, security certification system, or complete red-team suite. See [docs/limitations.md](docs/limitations.md).

	## Dataset schema (JSONL)

	Each line is one probe:

	\| Field \| Description \|
	\| --- \| --- \|
	\| `id` \| Stable identifier \|
	\| `category` \| One of the six families above \|
	\| `fragility_area` \| Finer-grained label \|
	\| `severity` \| `low` / `medium` / `high` / `critical` \|
	\| `task` \| What the agent is asked to do \|
	\| `input` \| User/tool/context text (may contain adversarial content) \|
	\| `expected_behavior` \| What safe handling looks like \|
	\| `failure_modes` \| Strings describing ways the run can go wrong \|
	\| `observable_signal` \| What a reviewer should look for \|
	\| `safe_response_pattern` \| Optional hints for benign completions \|

	Regenerate the seed file (60 probes) with:

	```bash
	python3 scripts/generate_threat_seed.py
	```

	## Run locally

	From this directory (`agent-threat-map/`):

	```bash
	python3 -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	python3 examples/run_local_eval.py
	python3 app.py
	```

	`pip install -r requirements.txt` installs `scikit-learn` (needed for `observability` / geometry and for `examples/run_local_eval.py`). `examples/run_local_eval.py` writes `reports/sample_report.json` using a canned safe-ish response over all probes.

	If `pip install` fails or `.venv/bin/python` is missing, remove the broken env (`rm -rf .venv`), ensure PyPI is reachable (DNS/network), recreate the venv, and run `pip install -r requirements.txt` again. Do not commit `.venv/` (it is gitignored).

	## Hugging Face Space

	The YAML front matter above is what Hugging Face reads when this README lives at the Space repo root. Deploy by copying this folder to the Space (or `hf upload` — see `scripts/push_spaces.sh` in the parent monorepo).

	- Runtime: Python 3.10+ supported; Python 3.13 needs `audioop-lts` (already listed in `requirements.txt`).
	- No API keys required for the threat-map UI (manual paste only).

	## Metrics overview

	Run-level metrics are documented in [docs/scoring.md](docs/scoring.md). Highlights:

	- Distribution: mean / median / P90 / max risk; weighted risk stats.
	- Severity-aware: severity-weighted pass rate; high-stakes failure rate.
	- Signals: boundary-language rate; safe vs unsafe signal totals / ratio.
	- Composites: resilience index, exposure index, fragility spread (risk std dev).
	- Slices: by category, by severity tier, failure-mode histogram, worst cases.
	- Observable geometry: `MI(cluster, category)`, `MI(cluster, severity)`, `MI(cluster, pass_fail)` plus 2-D scatter coordinates per case (needs ≥5 cases by default).

	## Related Spaces

	- [failure-geometry-demo](https://huggingface.co/spaces/obversarystudios/failure-geometry-demo) — CARB failure geometry with sklearn baselines (no API key).
	- [carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space) — same observability shape via HF Inference API (`HF_TOKEN` secret required).
	- [obversarystudios.org](https://obversarystudios.org) — research narrative.

	## What you should do on your machine

	1. Git: Commit and push `agent-threat-map/` from your monorepo; merge any remote drift on GitHub first.
	2. Hub: Create Space `obversarystudios/agent-threat-map` (or your namespace) if it does not exist, then run `bash scripts/push_spaces.sh` from the repo root (after `hf auth login`).
	3. Smoke-test: After `pip install -r requirements.txt`, run `python3 examples/run_local_eval.py` and confirm `reports/sample_report.json` contains `"observability"`.

	## License

	See [LICENSE](LICENSE).