agent-threat-map / README.md
obversarystudios's picture
Threat-map metrics + observable geometry (embed/cluster/MI)
6c3043e verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: Agent Threat Map Observatory
emoji: 🧭
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Threat-map benchmark with metrics and geometry

Agent Threat Map

Agent Threat Map is a research benchmark and observability scaffold for mapping fragile behavior in model-agent systems.

Instead of asking only whether a model answered correctly, this project asks where and how a model breaks under agent-like pressure.

The benchmark focuses on:

  • Prompt injection
  • Tool-output injection
  • Retrieval poisoning
  • Memory poisoning
  • Secret exfiltration
  • Unauthorized action

The goal is to make hidden model-agent failure modes visible, structured, and easier to compare across runs.

What this is

  • A JSONL probe set (data/threat_probe_seed.jsonl).
  • A rule-based evaluator and expanded run metrics (distributions, severity weighting, composite indices, failure-mode histograms, per-category rollups).
  • Observable geometry (same family as failure-geometry-demo): TF-IDF + SVD embeddings of scored cases → KMeans → mutual information vs category, severity, and pass/fail (agent_threat_map/observability.py). Results appear as observability in exported JSON and in the Gradio Observable geometry tab.
  • A Gradio Space UI (app.py) for scoring pasted responses and exporting JSON reports.

What this is not

This is not a production vulnerability scanner, security certification system, or complete red-team suite. See docs/limitations.md.

Dataset schema (JSONL)

Each line is one probe:

Field Description
id Stable identifier
category One of the six families above
fragility_area Finer-grained label
severity low / medium / high / critical
task What the agent is asked to do
input User/tool/context text (may contain adversarial content)
expected_behavior What safe handling looks like
failure_modes Strings describing ways the run can go wrong
observable_signal What a reviewer should look for
safe_response_pattern Optional hints for benign completions

Regenerate the seed file (60 probes) with:

python3 scripts/generate_threat_seed.py

Run locally

From this directory (agent-threat-map/):

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 examples/run_local_eval.py
python3 app.py

pip install -r requirements.txt installs scikit-learn (needed for observability / geometry and for examples/run_local_eval.py). examples/run_local_eval.py writes reports/sample_report.json using a canned safe-ish response over all probes.

If pip install fails or .venv/bin/python is missing, remove the broken env (rm -rf .venv), ensure PyPI is reachable (DNS/network), recreate the venv, and run pip install -r requirements.txt again. Do not commit .venv/ (it is gitignored).

Hugging Face Space

The YAML front matter above is what Hugging Face reads when this README lives at the Space repo root. Deploy by copying this folder to the Space (or hf upload — see scripts/push_spaces.sh in the parent monorepo).

  • Runtime: Python 3.10+ supported; Python 3.13 needs audioop-lts (already listed in requirements.txt).
  • No API keys required for the threat-map UI (manual paste only).

Metrics overview

Run-level metrics are documented in docs/scoring.md. Highlights:

  • Distribution: mean / median / P90 / max risk; weighted risk stats.
  • Severity-aware: severity-weighted pass rate; high-stakes failure rate.
  • Signals: boundary-language rate; safe vs unsafe signal totals / ratio.
  • Composites: resilience index, exposure index, fragility spread (risk std dev).
  • Slices: by category, by severity tier, failure-mode histogram, worst cases.
  • Observable geometry: MI(cluster, category), MI(cluster, severity), MI(cluster, pass_fail) plus 2-D scatter coordinates per case (needs ≥5 cases by default).

Related Spaces

What you should do on your machine

  1. Git: Commit and push agent-threat-map/ from your monorepo; merge any remote drift on GitHub first.
  2. Hub: Create Space obversarystudios/agent-threat-map (or your namespace) if it does not exist, then run bash scripts/push_spaces.sh from the repo root (after hf auth login).
  3. Smoke-test: After pip install -r requirements.txt, run python3 examples/run_local_eval.py and confirm reports/sample_report.json contains "observability".

License

See LICENSE.