Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -31,6 +31,7 @@ datasets:
|
|
| 31 |
- mindbomber/aana-head-to-head-single-classifier-vs-aana
|
| 32 |
- mindbomber/aana-head-to-head-prompt-policy-vs-aana
|
| 33 |
- mindbomber/aana-head-to-head-llm-judge-vs-aana
|
|
|
|
| 34 |
metrics:
|
| 35 |
- accuracy
|
| 36 |
- f_beta
|
|
@@ -682,6 +683,34 @@ the evidence is noisy or flattened. AANA preserves the same unsafe recall while
|
|
| 682 |
allowing substantially more safe calls by using explicit tool category,
|
| 683 |
authorization state, evidence refs, schema validation, and hard blockers.
|
| 684 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 685 |
### PIIMB: Presidio + AANA
|
| 686 |
|
| 687 |
Official PIIMB submission:
|
|
|
|
| 31 |
- mindbomber/aana-head-to-head-single-classifier-vs-aana
|
| 32 |
- mindbomber/aana-head-to-head-prompt-policy-vs-aana
|
| 33 |
- mindbomber/aana-head-to-head-llm-judge-vs-aana
|
| 34 |
+
- mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
|
| 35 |
metrics:
|
| 36 |
- accuracy
|
| 37 |
- f_beta
|
|
|
|
| 683 |
allowing substantially more safe calls by using explicit tool category,
|
| 684 |
authorization state, evidence refs, schema validation, and hard blockers.
|
| 685 |
|
| 686 |
+
### Head-to-Head: Contract Gate Without Recovery vs AANA
|
| 687 |
+
|
| 688 |
+
Public validation artifact:
|
| 689 |
+
https://huggingface.co/datasets/mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
|
| 690 |
+
|
| 691 |
+
Source dataset:
|
| 692 |
+
https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
|
| 693 |
+
|
| 694 |
+
Rows:
|
| 695 |
+
`360` external trace rows with moderate noisy-evidence stressors
|
| 696 |
+
|
| 697 |
+
Status:
|
| 698 |
+
head-to-head architecture diagnostic, policy-derived labels, not an official
|
| 699 |
+
leaderboard
|
| 700 |
+
|
| 701 |
+
| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
|
| 702 |
+
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
| 703 |
+
| Structured contract gate without recovery | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `0.00%` | `26` | `0` |
|
| 704 |
+
| AANA with evidence recovery | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0.00%` | `0` | `0` |
|
| 705 |
+
|
| 706 |
+
The bare contract gate consumes the noisy emitted event as-is. AANA adds a
|
| 707 |
+
correction/evidence-recovery pass that reconstructs recoverable auth,
|
| 708 |
+
validation, and confirmation evidence from source trace features, removes
|
| 709 |
+
injected noisy missing-authorization refs when the source trace does not
|
| 710 |
+
support them, preserves true missing-authorization stressors, and corrects the
|
| 711 |
+
runtime route before final gating. The recovery pass does not read expected
|
| 712 |
+
labels, but the trace features are produced by the included transform scripts.
|
| 713 |
+
|
| 714 |
### PIIMB: Presidio + AANA
|
| 715 |
|
| 716 |
Official PIIMB submission:
|