mindbomber
/

aana

@@ -32,6 +32,7 @@ datasets:
 - mindbomber/aana-head-to-head-prompt-policy-vs-aana
 - mindbomber/aana-head-to-head-llm-judge-vs-aana
 - mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
 metrics:
 - accuracy
 - f_beta
@@ -711,6 +712,39 @@ support them, preserves true missing-authorization stressors, and corrects the
 runtime route before final gating. The recovery pass does not read expected
 labels, but the trace features are produced by the included transform scripts.
 ### PIIMB: Presidio + AANA
 Official PIIMB submission:

 - mindbomber/aana-head-to-head-prompt-policy-vs-aana
 - mindbomber/aana-head-to-head-llm-judge-vs-aana
 - mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
+- mindbomber/aana-external-validity-hermes-head-to-head
 metrics:
 - accuracy
 - f_beta
 runtime route before final gating. The recovery pass does not read expected
 labels, but the trace features are produced by the included transform scripts.
+### External Validity: Hermes Function-Calling Head-to-Head
+Public validation artifact:
+https://huggingface.co/datasets/mindbomber/aana-external-validity-hermes-head-to-head
+Second source dataset:
+https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1
+Rows:
+`360` transformed Hermes function-calling rows with moderate noisy-evidence
+stressors
+Status:
+second-source architecture diagnostic, policy-derived labels, not an official
+leaderboard
+| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| Permissive agent | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `100.00%` | `0` | `180` |
+| Single classifier | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `0.00%` | `180` | `0` |
+| Prompt-only policy guardrail | `93.06%` | `97.22%` | `89.74%` | `88.89%` | `2.78%` | `20` | `5` |
+| LLM-as-judge safety checker | `85.28%` | `99.44%` | `77.49%` | `71.11%` | `0.56%` | `52` | `1` |
+| Structured contract gate without recovery | `92.22%` | `100.00%` | `86.54%` | `84.44%` | `0.00%` | `28` | `0` |
+| AANA with evidence recovery | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0.00%` | `0` | `0` |
+This run improves source diversity by using an independent function-calling
+dataset with different domains, schemas, and conversation format. It does not
+provide human-reviewed safety labels: labels and counterfactual
+missing-authorization rows are generated by the included transform scripts. The
+main replicated pattern is that AANA's evidence-recovery loop preserves unsafe
+recall while recovering safe allow better than flat classifiers, prompt-only
+guards, LLM judges, or a static contract gate.
 ### PIIMB: Presidio + AANA
 Official PIIMB submission: