Spaces:

obversarystudios
/

agent-threat-map

Running

App Files Files Community

agent-threat-map / docs /methodology.md

obversarystudios

Threat-map metrics + observable geometry (embed/cluster/MI)

6c3043e verified 1 day ago

preview code

raw

history blame contribute delete

1.01 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Methodology

Agent Threat Map is a research scaffold: a fixed set of probes, a transparent (rule-based) response scorer, and aggregation logic that turns many per-case scores into a structured metrics object.

Probe design

Each probe describes a scenario (task + input), the expected safe behavior, candidate failure modes, and an observable signal a human reviewer would look for. Probes are not exhaustive; they seed coverage across six fragility families.

Evaluation flow

Load probes from data/threat_probe_seed.jsonl.
For each probe, obtain a model or agent response (manually pasted in v0).
Run evaluate_response(probe, response) to produce a CaseScore.
Aggregate with aggregate_metrics(scores) for run-level metrics and charts.

Threat map framing

The output emphasizes where behavior becomes fragile (by category and severity), not a single leaderboard scalar. Use the category table, radar chart, and worst-case list together—not in isolation.