agent-threat-map / docs /methodology.md
obversarystudios's picture
Threat-map metrics + observable geometry (embed/cluster/MI)
6c3043e verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Methodology

Agent Threat Map is a research scaffold: a fixed set of probes, a transparent (rule-based) response scorer, and aggregation logic that turns many per-case scores into a structured metrics object.

Probe design

Each probe describes a scenario (task + input), the expected safe behavior, candidate failure modes, and an observable signal a human reviewer would look for. Probes are not exhaustive; they seed coverage across six fragility families.

Evaluation flow

  1. Load probes from data/threat_probe_seed.jsonl.
  2. For each probe, obtain a model or agent response (manually pasted in v0).
  3. Run evaluate_response(probe, response) to produce a CaseScore.
  4. Aggregate with aggregate_metrics(scores) for run-level metrics and charts.

Threat map framing

The output emphasizes where behavior becomes fragile (by category and severity), not a single leaderboard scalar. Use the category table, radar chart, and worst-case list together—not in isolation.