File size: 5,148 Bytes
5c5457c
6c3043e
 
 
 
5c5457c
6c3043e
5c5457c
 
6c3043e
 
5c5457c
 
6c3043e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: Agent Threat Map Observatory
emoji: 🧭
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Threat-map benchmark with metrics and geometry
---

# Agent Threat Map

Agent Threat Map is a research benchmark and observability scaffold for mapping fragile behavior in model-agent systems.

Instead of asking only whether a model answered correctly, this project asks **where and how** a model breaks under agent-like pressure.

The benchmark focuses on:

- Prompt injection
- Tool-output injection
- Retrieval poisoning
- Memory poisoning
- Secret exfiltration
- Unauthorized action

The goal is to make hidden model-agent failure modes **visible**, **structured**, and easier to compare across runs.

## What this is

- A JSONL probe set (`data/threat_probe_seed.jsonl`).
- A rule-based evaluator and **expanded run metrics** (distributions, severity weighting, composite indices, failure-mode histograms, per-category rollups).
- **Observable geometry** (same *family* as `failure-geometry-demo`): TF-IDF + SVD embeddings of scored cases → KMeans → mutual information vs category, severity, and pass/fail (`agent_threat_map/observability.py`). Results appear as `observability` in exported JSON and in the Gradio **Observable geometry** tab.
- A Gradio Space UI (`app.py`) for scoring pasted responses and exporting JSON reports.

## What this is not

This is **not** a production vulnerability scanner, security certification system, or complete red-team suite. See [docs/limitations.md](docs/limitations.md).

## Dataset schema (JSONL)

Each line is one probe:

| Field | Description |
| --- | --- |
| `id` | Stable identifier |
| `category` | One of the six families above |
| `fragility_area` | Finer-grained label |
| `severity` | `low` / `medium` / `high` / `critical` |
| `task` | What the agent is asked to do |
| `input` | User/tool/context text (may contain adversarial content) |
| `expected_behavior` | What safe handling looks like |
| `failure_modes` | Strings describing ways the run can go wrong |
| `observable_signal` | What a reviewer should look for |
| `safe_response_pattern` | Optional hints for benign completions |

Regenerate the seed file (60 probes) with:

```bash
python3 scripts/generate_threat_seed.py
```

## Run locally

From this directory (`agent-threat-map/`):

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 examples/run_local_eval.py
python3 app.py
```

`pip install -r requirements.txt` installs **`scikit-learn`** (needed for `observability` / geometry and for `examples/run_local_eval.py`). `examples/run_local_eval.py` writes `reports/sample_report.json` using a canned safe-ish response over all probes.

If **`pip install` fails** or **`.venv/bin/python` is missing**, remove the broken env (`rm -rf .venv`), ensure **PyPI is reachable** (DNS/network), recreate the venv, and run `pip install -r requirements.txt` again. Do not commit `.venv/` (it is gitignored).

## Hugging Face Space

The YAML **front matter above** is what Hugging Face reads when this README lives at the **Space repo root**. Deploy by copying this folder to the Space (or `hf upload` — see `scripts/push_spaces.sh` in the parent monorepo).

- **Runtime:** Python 3.10+ supported; **Python 3.13** needs `audioop-lts` (already listed in `requirements.txt`).
- **No API keys** required for the threat-map UI (manual paste only).

## Metrics overview

Run-level metrics are documented in [docs/scoring.md](docs/scoring.md). Highlights:

- **Distribution:** mean / median / P90 / max risk; weighted risk stats.
- **Severity-aware:** severity-weighted pass rate; high-stakes failure rate.
- **Signals:** boundary-language rate; safe vs unsafe signal totals / ratio.
- **Composites:** resilience index, exposure index, fragility spread (risk std dev).
- **Slices:** by category, by severity tier, failure-mode histogram, worst cases.
- **Observable geometry:** `MI(cluster, category)`, `MI(cluster, severity)`, `MI(cluster, pass_fail)` plus 2-D scatter coordinates per case (needs ≥5 cases by default).

## Related Spaces

- **[failure-geometry-demo](https://huggingface.co/spaces/obversarystudios/failure-geometry-demo)** — CARB failure geometry with sklearn baselines (no API key).
- **[carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space)** — same observability shape via HF Inference API (`HF_TOKEN` secret required).
- **[obversarystudios.org](https://obversarystudios.org)** — research narrative.

## What you should do on your machine

1. **Git:** Commit and push `agent-threat-map/` from your monorepo; merge any remote drift on GitHub first.
2. **Hub:** Create Space `obversarystudios/agent-threat-map` (or your namespace) if it does not exist, then run `bash scripts/push_spaces.sh` from the repo root (after `hf auth login`).
3. **Smoke-test:** After `pip install -r requirements.txt`, run `python3 examples/run_local_eval.py` and confirm `reports/sample_report.json` contains `"observability"`.

## License

See [LICENSE](LICENSE).