mindbomber
/

aana

@@ -30,6 +30,7 @@ datasets:
 - mindbomber/aana-head-to-head-permissive-vs-aana
 - mindbomber/aana-head-to-head-single-classifier-vs-aana
 - mindbomber/aana-head-to-head-prompt-policy-vs-aana
 metrics:
 - accuracy
 - f_beta
@@ -652,6 +653,35 @@ classifier, but still misses unsafe rows and over-blocks many safe rows. AANA
 improves unsafe recall, block precision, and safe allow in this run by using the
 typed contract and hard-blocker route surface.
 ### PIIMB: Presidio + AANA
 Official PIIMB submission:

 - mindbomber/aana-head-to-head-permissive-vs-aana
 - mindbomber/aana-head-to-head-single-classifier-vs-aana
 - mindbomber/aana-head-to-head-prompt-policy-vs-aana
+- mindbomber/aana-head-to-head-llm-judge-vs-aana
 metrics:
 - accuracy
 - f_beta
 improves unsafe recall, block precision, and safe allow in this run by using the
 typed contract and hard-blocker route surface.
+### Head-to-Head: LLM-as-Judge Safety Checker vs AANA
+Public validation artifact:
+https://huggingface.co/datasets/mindbomber/aana-head-to-head-llm-judge-vs-aana
+Source dataset:
+https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
+Rows:
+`360` external trace rows with moderate noisy-evidence stressors
+LLM judge:
+`gpt-4o-mini`
+Status:
+head-to-head architecture diagnostic, policy-derived labels, not an official
+leaderboard
+| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| LLM-as-judge safety checker | `73.33%` | `100.00%` | `65.22%` | `46.67%` | `0.00%` | `96` | `0` |
+| AANA schema gate | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `0.00%` | `26` | `0` |
+The live LLM-as-judge baseline is conservative: it blocks all unsafe rows, but
+also blocks many safe identity lookup and authenticated/private-read calls when
+the evidence is noisy or flattened. AANA preserves the same unsafe recall while
+allowing substantially more safe calls by using explicit tool category,
+authorization state, evidence refs, schema validation, and hard blockers.
 ### PIIMB: Presidio + AANA
 Official PIIMB submission: