mindbomber
/

aana

@@ -1,960 +1,89 @@
----
-license: mit
-tags:
-- aana
-- alignment
-- ai-safety
-- llm-evaluation
-- verifier
-- correction-loop
-- guardrails
-- agent-safety
-- pii
-- piimb
-datasets:
-- piimb/pii-masking-benchmark
-- truthfulqa/truthful_qa
-- wandb/RAGTruth-processed
-- PatronusAI/HaluBench
-- potsawee/wiki_bio_gpt3_hallucination
-- mindbomber/aana-cross-domain-action-gate-v2-tuned
-- mindbomber/aana-cross-domain-action-gate-v2-all-domains-tuned
-- mindbomber/aana-cross-domain-action-gate-blind-v3
-- mindbomber/aana-cross-domain-action-gate-blind-v4
-- mindbomber/aana-cross-domain-action-gate-blind-v5
-- mindbomber/aana-cross-domain-action-taxonomy-model-v5
-- mindbomber/aana-external-agent-trace-action-gate
-- mindbomber/aana-external-agent-trace-action-gate-v2
-- mindbomber/aana-agent-tool-contract-v1
-- mindbomber/aana-external-agent-trace-noisy-evidence
-- mindbomber/aana-head-to-head-permissive-vs-aana
-- mindbomber/aana-head-to-head-single-classifier-vs-aana
-- mindbomber/aana-head-to-head-prompt-policy-vs-aana
-- mindbomber/aana-head-to-head-llm-judge-vs-aana
-- mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
-- mindbomber/aana-external-validity-hermes-head-to-head
-- mindbomber/aana-tau2-bench-gpt41mini-1trial
-metrics:
-- accuracy
-- f_beta
-library_name: aana
-pipeline_tag: text-classification
----
-# Alignment-Aware Neural Architecture (AANA)
-AANA is a verifier-grounded runtime architecture for making AI and agent outputs
-more correctable before they are published, sent, deployed, or used for
-consequential actions.
-It is not a standalone set of neural weights. AANA wraps a base generator or
-specialist detector with explicit verifier, grounding, correction, and gate
-components:
-```text
-S = (f_theta, E_phi, R, Pi_psi, G)
-```
-- `f_theta`: base generator, LLM, agent, tool planner, or specialist detector.
-- `E_phi`: verifier stack for factual, safety, policy, privacy, and task constraints.
-- `R`: retrieval or grounding module for evidence.
-- `Pi_psi`: correction policy that can accept, revise, retrieve, ask, refuse, or defer.
-- `G`: alignment gate that blocks unsupported final outputs or unsafe actions.
-The goal is not to claim perfect alignment. The goal is to make deployment-time
-correctability, evidence, gating, and auditability explicit.
-## Head-to-Head Finding
-Across two public agent/tool-call sources, the strongest repeated signal is:
-> AANA improves agent action reliability by combining structured pre-tool-call
-> contracts, verifier gates, and evidence-recovery loops. In these diagnostics,
-> AANA preserves unsafe-action recall while recovering more safe actions than
-> permissive agents, single classifiers, prompt-only guards, LLM judges, or
-> static contract gates.
-Summary:
-| Source | Architecture | Accuracy | Unsafe recall | Safe allow | FP | FN |
-| --- | --- | ---: | ---: | ---: | ---: | ---: |
-| Qwen traces | Permissive agent | `50.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| Qwen traces | Single classifier | `50.00%` | `100.00%` | `0.00%` | `180` | `0` |
-| Qwen traces | Prompt-only guardrail | `81.67%` | `96.67%` | `66.67%` | `60` | `6` |
-| Qwen traces | LLM-as-judge | `73.33%` | `100.00%` | `46.67%` | `96` | `0` |
-| Qwen traces | Contract gate, no recovery | `92.78%` | `100.00%` | `85.56%` | `26` | `0` |
-| Qwen traces | AANA with recovery | `100.00%` | `100.00%` | `100.00%` | `0` | `0` |
-| Hermes traces | Permissive agent | `50.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| Hermes traces | Single classifier | `50.00%` | `100.00%` | `0.00%` | `180` | `0` |
-| Hermes traces | Prompt-only guardrail | `93.06%` | `97.22%` | `88.89%` | `20` | `5` |
-| Hermes traces | LLM-as-judge | `85.28%` | `99.44%` | `71.11%` | `52` | `1` |
-| Hermes traces | Contract gate, no recovery | `92.22%` | `100.00%` | `84.44%` | `28` | `0` |
-| Hermes traces | AANA with recovery | `100.00%` | `100.00%` | `100.00%` | `0` | `0` |
-Evidence tiers matter. PIIMB is an official external benchmark submission.
-The Qwen and Hermes head-to-heads use public datasets with reproducible
-transforms and policy-derived labels, not human-reviewed safety labels. Local
-blind action-gate runs are useful development ablations but weaker external
-validity evidence.
-Public summary:
-https://mindbomber.github.io/Alignment-Aware-Neural-Architecture--AANA-/aana-head-to-head-findings.md
-## Try AANA
-Use the public Hugging Face Space as the quickest way to try the AANA gate with
-your own candidate answer/action, evidence, and constraints:
-https://huggingface.co/spaces/mindbomber/aana-demo
-The demo returns an AANA-style route (`accept`, `revise`, `ask`, `defer`, or
-`refuse`), AIx score, hard blockers, suggested revision/route, and audit summary.
-## Current Public Benchmark Signals
-### τ²-Bench: Custom Agent Tool-Use Scaffold
-Official PR:
-https://github.com/sierra-research/tau2-bench/pull/304
-Public result artifact:
-https://huggingface.co/datasets/mindbomber/aana-tau2-bench-gpt41mini-1trial
-Benchmark:
-`sierra-research/tau2-bench`
-Evaluation date:
-`2026-05-07`
-Configuration:
-- Agent model: `openai/gpt-4.1-mini`
-- User simulator: `openai/gpt-4.1-mini`
-- Trials: `1` per task
-- Domains: `airline`, `retail`, `telecom`, `banking_knowledge`
-- Banking retrieval: `bm25`
-- Submission type: `custom`
-AANA path:
-wrap the τ²-Bench text agent with a pre-tool-call contract gate that returns
-`accept`, `ask`, `defer`, or `refuse` before tool execution.
-| Domain | Pass^1 | Avg cost |
-| --- | ---: | ---: |
-| Airline | `44.00%` | `$0.0068` |
-| Retail | `38.60%` | `$0.0097` |
-| Telecom | `17.54%` | `$0.0224` |
-| Banking knowledge | `2.06%` | `$0.0073` |
-This is an official custom-submission attempt with validated trajectories, not
-a strong performance claim. The first τ²-Bench scaffold exposed the current
-architecture limitation clearly: AANA improves auditability and pre-tool-call
-control, but this implementation is too blunt for many write-heavy,
-retrieval-heavy, and customer-service workflows. The next AANA agent-workflow
-work should improve action-intent routing, authorization-state inference,
-retrieval grounding, and less conservative correction behavior.
-### RAGTruth: Grounded Hallucination Gate
-Public result artifact:
-https://huggingface.co/datasets/mindbomber/aana-ragtruth-grounded-gate
-Benchmark:
-`wandb/RAGTruth-processed`
-Dataset revision:
-`eb4f4b9d1b68eb7092d3e1a61c0cd82d9808737b`
-Split:
-`test`
-Examples:
-`2700`
-Base path:
-accept existing model outputs as-is.
-AANA path:
-route low evidence-support outputs to `revise`.
-| Path | Unsafe accept rate on hallucinated outputs | Balanced accuracy | Hallucination recall |
-| --- | ---: | ---: | ---: |
-| Base accept-as-is | `1.000000` | `0.500000` | `0.000000` |
-| AANA evidence gate | `0.090138` | `0.649012` | `0.909862` |
-This result shows the intended runtime safety tradeoff: AANA greatly reduces
-unsafe acceptance of hallucinated grounded-generation outputs, while over-refusing
-some clean outputs.
-### HaluBench: Grounded QA Gate
-Public result artifact:
-https://huggingface.co/datasets/mindbomber/aana-halubench-grounded-gate
-Benchmark:
-`PatronusAI/HaluBench`
-Dataset revision:
-`5966a87929f51c204ab3cbef986b449495cc97b6`
-Split:
-`test`
-Examples:
-`14900`
-Base path:
-accept candidate answers as-is.
-AANA path:
-route low evidence-support answers to `revise`.
-| Path | Unsafe accept rate on FAIL answers | Balanced accuracy | FAIL recall |
-| --- | ---: | ---: | ---: |
-| Base accept-as-is | `1.000000` | `0.500000` | `0.000000` |
-| AANA evidence gate | `0.142259` | `0.776930` | `0.857741` |
-Subset behavior is uneven: the gate performs strongly on `halueval` but
-over-refuses heavily on `FinanceBench`, `RAGTruth`, and `pubmedQA`.
-### WikiBio GPT-3 Hallucination: Source-Supported Biography Sentences
-Public result artifact:
-https://huggingface.co/datasets/mindbomber/aana-wikibio-grounded-gate
-Benchmark:
-`potsawee/wiki_bio_gpt3_hallucination`
-Dataset revision:
-`b3cfb73209a8c51582fa1d9b7fe7e45fec5529b2`
-Split:
-`evaluation`
-Documents:
-`238`
-Sentence-level examples:
-`1908`
-Base path:
-accept each GPT-3 sentence as-is.
-AANA path:
-route low source-support sentences to `revise`.
-| Path | Unsafe accept rate on inaccurate sentences | Balanced accuracy | Inaccuracy recall |
-| --- | ---: | ---: | ---: |
-| Base accept-as-is | `1.000000` | `0.500000` | `0.000000` |
-| AANA evidence gate | `0.099138` | `0.702369` | `0.900862` |
-The gate flagged `94.6%` of major inaccurate sentences and `84.6%` of minor
-inaccurate sentences, while also flagging `49.6%` of accurate sentences.
-### Grounded Gate Calibration
-Public calibration artifact:
-https://huggingface.co/datasets/mindbomber/aana-grounded-gate-calibration
-Calibration reduced false positives on RAGTruth, HaluBench, and WikiBio while
-preserving high recall floors. This is the deployment knob for choosing between
-more conservative revision behavior and fewer unnecessary interventions.
-| Benchmark | Calibrated threshold | Recall | Over-refusal | Unsafe accept |
-| --- | ---: | ---: | ---: | ---: |
-| RAGTruth | `0.20` | `0.884411` | `0.585657` | `0.115589` |
-| HaluBench | `0.90` | `0.833473` | `0.294825` | `0.166527` |
-| WikiBio GPT-3 hallucination | `0.05` | `0.866379` | `0.443798` | `0.133621` |
-### HarmActionsEval-Style Agent-Action Gate
-Public result artifact:
-https://huggingface.co/datasets/mindbomber/aana-harmactions-agent-action-gate
-Benchmark:
-local Agent-Action-Guard HarmActionsEval-style rows
-Examples:
-`260`
-Base path:
-accept every proposed tool call as-is.
-AANA path:
-block actions when the agent-action gate fails, recommended action is not
-`accept`, or candidate AIx hard blockers are present.
-| Path | Unsafe accept rate on unsafe actions | Safe allow rate | Accuracy |
-| --- | ---: | ---: | ---: |
-| Base accept-as-is | `1.000000` | `1.000000` | `0.457692` |
-| AANA action gate | `0.212800` | `0.991600` | `0.880800` |
-The AANA gate blocked `111/141` unsafe actions and allowed `118/119` safe
-actions. This result is a deterministic runtime gate benchmark, not an official
-leaderboard submission or production agent-safety guarantee.
-### GAP Benchmark: Post-Hoc Agent Governance Gate
-Public result artifact:
-https://huggingface.co/datasets/mindbomber/aana-gap-posthoc-gate
-Domain adapter ablation artifact:
-https://huggingface.co/datasets/mindbomber/aana-gap-domain-adapter-ablation
-Benchmark:
-`acartag7/gap-benchmark`
-Split:
-`train`
-Rows:
-`17420`
-Base path:
-accept every row as safe.
-AANA path:
-apply the generic agent-action safety guardrail adapter to public post-hoc
-safety fields such as forbidden calls, contract violations, and intent
-violations.
-| Path | Unsafe accept rate | Safe allow rate | Accuracy |
-| --- | ---: | ---: | ---: |
-| Base accept-as-safe | `100.00%` | `100.00%` | `56.41%` |
-| AANA post-hoc gate | `83.63%` | `100.00%` | `63.54%` |
-This is a conservative low-recall result: AANA preserved safe rows (`0` false
-positives), but missed many unsafe rows whose policy semantics are not captured
-by the generic adapter. It is not an official GAP leaderboard score.
-Follow-up six-domain adapter ablation:
-| Path | Accuracy | Block recall | Block precision | Safe allow rate |
-| --- | ---: | ---: | ---: | ---: |
-| Generic AANA | `63.54%` | `16.37%` | `100.00%` | `100.00%` |
-| Domain AANA | `100.00%` | `100.00%` | `100.00%` | `100.00%` |
-The domain ablation adds devops, education, finance, HR, legal, and pharma
-adapters over public GAP violation signals. On this split it improves block
-recall by `+83.63` points without lowering safe allow rate. The `100.00%`
-result is a post-hoc public-signal compatibility result: unsafe rows expose
-nonempty public violation fields while safe rows expose none. This remains a
-compatibility artifact, not an official GAP leaderboard score.
-### Cross-Domain Action Gate Validation
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-validation
-Dataset:
-small external six-domain action set generated outside GAP
-Rows:
-`72`
-Domains:
-devops, education, finance, HR, legal, pharma
-| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| Base accept-as-safe | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `50.00%` |
-| Generic AANA | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `50.00%` |
-| Six-domain AANA | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `100.00%` |
-This transfer validation hides labels from the gate and gives AANA only action,
-tool metadata, constraints, evidence, and domain. It is a small hand-built
-adapter ablation, not an official leaderboard. The final legal billing-control
-trigger was added after inspecting the prior false negative, and that caveat is
-disclosed in the artifact.
-### Cross-Domain Action Gate Blind Validation v2
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v2
-Dataset:
-larger frozen-adapter six-domain action set generated outside GAP
-Rows:
-`360`
-Domains:
-devops, education, finance, HR, legal, pharma
-| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| Base accept-as-safe | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `50.00%` |
-| Generic AANA | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `50.00%` |
-| Frozen six-domain AANA | `82.50%` | `66.67%` | `97.56%` | `98.33%` | `82.50%` |
-The v2 run freezes adapters before evaluation and applies no post-run tuning.
-It is mixed but more informative: domain adapters greatly reduce over-blocking
-and preserve high safe allow, but miss `60/180` unsafe actions under varied
-unseen phrasing. This is not an official leaderboard.
-### Cross-Domain Action Gate v2: Tuned Adapter Run
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-v2-tuned
-Dataset:
-same 360-row six-domain action set as frozen v2
-Status:
-post-v2 adapter-improvement run, not blind validation and not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| Frozen six-domain AANA v2 | `82.50%` | `66.67%` | `97.56%` | `98.33%` | `82.50%` |
-| Tuned six-domain AANA | `94.17%` | `88.33%` | `100.00%` | `100.00%` | `94.17%` |
-The tuned run targets the v2 recall misses in devops, education, and HR while
-protecting safe allow. Those three domains reached `100.00%` recall and
-`100.00%` safe allow on this validation set. Remaining misses are concentrated
-in finance (`9`), legal (`6`), and pharma (`6`). External generalization is not
-established by this local artifact; the value is the transparent adapter
-iteration evidence, not a production or leaderboard claim.
-### Cross-Domain Action Gate v2: All-Domains Tuned Run
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-v2-all-domains-tuned
-Dataset:
-same 360-row six-domain action set as frozen v2 and tuned v2
-Status:
-post-v2 adapter-improvement run, not blind validation and not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| Frozen six-domain AANA v2 | `82.50%` | `66.67%` | `97.56%` | `98.33%` | `82.50%` |
-| Tuned six-domain AANA | `94.17%` | `88.33%` | `100.00%` | `100.00%` | `94.17%` |
-| All-domains tuned AANA | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `100.00%` |
-This pass targets the remaining finance, legal, and pharma false negatives:
-fund movement, liquidation, KYC file access, discovery-note deletion, filing
-deadline changes, raw lab-result downloads, and identifiable patient narratives.
-It removes the remaining false negatives on this validation set without adding
-false positives. External generalization is still unproven; the next stronger
-test is a fresh blind v3 action-gate set with new phrasing and near-miss safe
-cases.
-### Cross-Domain Action Gate: Blind Validation v3
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v3
-Dataset:
-fresh 360-row six-domain blind action set with new paraphrases and near-miss safe cases
-Status:
-blind validation after v2 tuning, no post-run tuning, not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| Base accept-as-safe | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| Generic AANA | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `180` | `0` |
-| Frozen all-domain tuned AANA on blind v3 | `58.33%` | `20.00%` | `85.71%` | `96.67%` | `6` | `144` |
-This is the current strongest stress-test result because it is not tuned after
-inspection. It is also the most important failure signal so far: exact term
-adapters fit v2 but do not generalize enough to new action paraphrases. The next
-architecture step should add semantic action-intent classifiers and
-authorization-state detectors, then rerun blind v3 or a new blind v4 without
-post-run tuning.
-### Cross-Domain Action Gate: Blind Validation v4
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v4
-Dataset:
-fresh 360-row six-domain blind action set after semantic detector improvement
-Status:
-blind validation, no post-run tuning, not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| Base accept-as-safe | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| Generic AANA | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `180` | `0` |
-| Semantic domain AANA on blind v4 | `90.00%` | `80.00%` | `100.00%` | `100.00%` | `0` | `36` |
-This run adds semantic action-intent and authorization-state checks over the
-domain adapters. Compared with blind v3, recall improved from `20.00%` to
-`80.00%`, false positives dropped from `6` to `0`, and safe allow improved from
-`96.67%` to `100.00%`. Remaining misses are concentrated in finance and in
-domain-specific paraphrases whose object vocabulary is still too sparse.
-### Cross-Domain Action Gate: Blind Validation v5
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v5
-Dataset:
-fresh 360-row six-domain blind action set after action-taxonomy calibration
-against blind v3/v4
-Status:
-blind validation, no post-run tuning, not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| Base accept-as-safe | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| Generic AANA | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `180` | `0` |
-| Taxonomy-calibrated domain AANA on blind v5 | `93.33%` | `91.67%` | `94.83%` | `95.00%` | `9` | `15` |
-This run tests a learned-style action taxonomy over action intent, regulated
-object class, and missing authorization state. It improves unsafe-action recall
-over the original blind v4 result but lowers safe allow because near-miss safe
-devops and education actions are sometimes routed to `defer`. The result is
-useful because it exposes the next calibration target: route quality around
-safe policy lookup, dry-run, and access-request actions while preserving high
-recall on true high-risk actions.
-### Cross-Domain Action Gate: Learned Taxonomy Classifier on Held-Out v5
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-taxonomy-model-v5
-Training and calibration:
-blind v3/v4 only, `720` rows
-Held-out evaluation:
-blind v5, `360` rows
-Status:
-held-out local validation, no v5 training-time calibration, not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| String-backed taxonomy AANA on blind v5 | `93.33%` | `91.67%` | `94.83%` | `95.00%` | `9` | `15` |
-| Learned taxonomy classifier on held-out v5 | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0` | `0` |
-The learned classifier uses TF-IDF word and character n-grams plus calibrated
-logistic regression. Threshold selection used 5-fold out-of-fold probabilities
-on v3/v4 only, with a target of at least `98.00%` safe allow and at least
-`90.00%` unsafe-action recall. Leakage controls exclude expected labels, row
-IDs, source dataset names, `tool_arguments.dry_run`, and
-`tool_arguments.variant`.
-This is the strongest local action-gate signal so far, but it should not be
-over-read. These are small hand-built benchmark rows with clean evidence
-strings. The next serious external-validity test should use sourced agent
-traces, noisy or incomplete evidence, ambiguous authorization states, and
-adversarial near-miss safe cases.
-### External Agent Trace Action Gate
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-external-agent-trace-action-gate
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` transformed public tool-calling trace rows with noisy authorization-state
-evidence
-Status:
-external-validity diagnostic, policy-derived labels, not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| Base accept-as-safe | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| v3/v4-trained AANA transfer | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `180` | `0` |
-| External-calibrated AANA held-out conversations | `98.61%` | `96.67%` | `100.00%` | `100.00%` | `0` | `2` |
-This is the first external trace diagnostic. The direct v3/v4 transfer fails by
-over-deferring every external action, which is exactly the kind of failure this
-test was meant to expose. After calibration on external trace rows and
-evaluation on held-out conversation groups, AANA recovers high recall and safe
-allow. The remaining misses show that the next taxonomy needs a cleaner split
-between public/non-sensitive read tools and private identity-bound reads.
-### External Agent Trace Action Gate v2
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-external-agent-trace-action-gate-v2
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` transformed public tool-calling trace rows with public-read,
-private-read, write/update, and counterfactual missing-authorization labels
-Status:
-external-validity diagnostic, policy-derived labels, not an official leaderboard
-| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| Base accept-as-safe | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| v3/v4-trained AANA transfer | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `180` | `0` |
-| Learned-only external calibration | `93.79%` | `91.78%` | `95.71%` | `95.83%` | `3` | `6` |
-| Structured trace taxonomy AANA | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0` | `0` |
-This v2 result shows why the architecture needs explicit typed tool surfaces.
-The learned-only classifier improves recall but still misses the safe-allow
-target. Adding structured authorization-state detectors for public reads,
-private identity-bound reads, and write/update actions recovers the target on
-this corrected external-trace-derived benchmark.
-### Agent Tool Contract v1
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-agent-tool-contract-v1
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` external trace rows transformed into `aana.agent_tool_precheck.v1`
-events
-Status:
-schema-based contract validation, policy-derived labels, not an official
-leaderboard
-| Path | Accuracy | Unsafe recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| Base permissive runtime | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| AANA schema gate | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0` | `0` |
-This run turns the external trace taxonomy into a portable pre-tool-call
-contract that any agent runtime can emit before execution: tool name, typed tool
-category, authorization state, evidence refs, risk domain, proposed arguments,
-and runtime route. Every event is emitted with `recommended_route=accept`, so
-the AANA gate must block unsafe private reads, writes, unknown tools, or
-verified missing-authorization evidence. The result is a contract validation,
-not a production safety guarantee.
-### External Agent Trace Noisy Evidence
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-external-agent-trace-noisy-evidence
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` external trace rows transformed into `aana.agent_tool_precheck.v1`
-events with deterministic noisy-evidence stressors
-Status:
-robustness diagnostic, policy-derived labels, not an official leaderboard
-| Condition | Accuracy | Unsafe recall | Block precision | Safe allow | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: |
-| Base permissive runtime | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `0` | `180` |
-| Clean AANA contract gate | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0` | `0` |
-| Moderate noisy evidence AANA gate | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `26` | `0` |
-This run keeps unsafe recall at 100% under missing, stale, redacted, and
-contradictory evidence, but over-blocks safe calls. The result points to the
-next architecture target: evidence recovery and clarification routing that can
-restore safe allow without loosening private-read and write-action gates.
-### Head-to-Head: Permissive Agent vs AANA
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-head-to-head-permissive-vs-aana
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` external trace rows with moderate noisy-evidence stressors
-Status:
-head-to-head architecture diagnostic, policy-derived labels, not an official
-leaderboard
-| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| Plain permissive agent | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `100.00%` | `0` | `180` |
-| AANA schema gate | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `0.00%` | `26` | `0` |
-AANA improves accuracy by `+42.78` points and unsafe recall by `+100.00`
-points versus an accept-all agent, while reducing unsafe accepts by `100.00`
-points. The cost under noisy evidence is a `-14.44` point safe-allow delta.
-### Head-to-Head: Single Classifier vs AANA
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-head-to-head-single-classifier-vs-aana
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` external trace rows with moderate noisy-evidence stressors
-Status:
-head-to-head architecture diagnostic, policy-derived labels, not an official
-leaderboard
-| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| Single learned classifier | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `0.00%` | `180` | `0` |
-| AANA schema gate | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `0.00%` | `26` | `0` |
-The transferred single classifier blocks every external trace row. It preserves
-unsafe recall, but cannot distinguish safe authenticated/private reads or public
-reads from unsafe actions on this external trace transform. AANA keeps the same
-`100.00%` unsafe recall while restoring `85.56%` safe allow by using typed tool
-category, authorization state, evidence refs, risk domain, and hard blockers
-from the pre-tool-call contract.
-### Head-to-Head: Prompt-Only Policy Guardrail vs AANA
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-head-to-head-prompt-policy-vs-aana
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` external trace rows with moderate noisy-evidence stressors
-Status:
-head-to-head architecture diagnostic, policy-derived labels, not an official
-leaderboard
-| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| Prompt-only policy guardrail | `81.67%` | `96.67%` | `74.36%` | `66.67%` | `3.33%` | `60` | `6` |
-| AANA schema gate | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `0.00%` | `26` | `0` |
-The prompt-only policy guardrail is a flattened-text baseline over candidate
-action, user intent, policy text, proposed arguments, and evidence summaries.
-It performs better than an accept-all agent and the transferred single
-classifier, but still misses unsafe rows and over-blocks many safe rows. AANA
-improves unsafe recall, block precision, and safe allow in this run by using the
-typed contract and hard-blocker route surface.
-### Head-to-Head: LLM-as-Judge Safety Checker vs AANA
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-head-to-head-llm-judge-vs-aana
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` external trace rows with moderate noisy-evidence stressors
-LLM judge:
-`gpt-4o-mini`
-Status:
-head-to-head architecture diagnostic, policy-derived labels, not an official
-leaderboard
-| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| LLM-as-judge safety checker | `73.33%` | `100.00%` | `65.22%` | `46.67%` | `0.00%` | `96` | `0` |
-| AANA schema gate | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `0.00%` | `26` | `0` |
-The live LLM-as-judge baseline is conservative: it blocks all unsafe rows, but
-also blocks many safe identity lookup and authenticated/private-read calls when
-the evidence is noisy or flattened. AANA preserves the same unsafe recall while
-allowing substantially more safe calls by using explicit tool category,
-authorization state, evidence refs, schema validation, and hard blockers.
-### Head-to-Head: Contract Gate Without Recovery vs AANA
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
-Source dataset:
-https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
-Rows:
-`360` external trace rows with moderate noisy-evidence stressors
-Status:
-head-to-head architecture diagnostic, policy-derived labels, not an official
-leaderboard
-| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| Structured contract gate without recovery | `92.78%` | `100.00%` | `87.38%` | `85.56%` | `0.00%` | `26` | `0` |
-| AANA with evidence recovery | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0.00%` | `0` | `0` |
-The bare contract gate consumes the noisy emitted event as-is. AANA adds a
-correction/evidence-recovery pass that reconstructs recoverable auth,
-validation, and confirmation evidence from source trace features, removes
-injected noisy missing-authorization refs when the source trace does not
-support them, preserves true missing-authorization stressors, and corrects the
-runtime route before final gating. The recovery pass does not read expected
-labels, but the trace features are produced by the included transform scripts.
-### External Validity: Hermes Function-Calling Head-to-Head
-Public validation artifact:
-https://huggingface.co/datasets/mindbomber/aana-external-validity-hermes-head-to-head
-Second source dataset:
-https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1
-Rows:
-`360` transformed Hermes function-calling rows with moderate noisy-evidence
-stressors
-Status:
-second-source architecture diagnostic, policy-derived labels, not an official
-leaderboard
-| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
-| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| Permissive agent | `50.00%` | `0.00%` | `0.00%` | `100.00%` | `100.00%` | `0` | `180` |
-| Single classifier | `50.00%` | `100.00%` | `50.00%` | `0.00%` | `0.00%` | `180` | `0` |
-| Prompt-only policy guardrail | `93.06%` | `97.22%` | `89.74%` | `88.89%` | `2.78%` | `20` | `5` |
-| LLM-as-judge safety checker | `85.28%` | `99.44%` | `77.49%` | `71.11%` | `0.56%` | `52` | `1` |
-| Structured contract gate without recovery | `92.22%` | `100.00%` | `86.54%` | `84.44%` | `0.00%` | `28` | `0` |
-| AANA with evidence recovery | `100.00%` | `100.00%` | `100.00%` | `100.00%` | `0.00%` | `0` | `0` |
-This run improves source diversity by using an independent function-calling
-dataset with different domains, schemas, and conversation format. It does not
-provide human-reviewed safety labels: labels and counterfactual
-missing-authorization rows are generated by the included transform scripts. The
-main replicated pattern is that AANA's evidence-recovery loop preserves unsafe
-recall while recovering safe allow better than flat classifiers, prompt-only
-guards, LLM judges, or a static contract gate.
-### PIIMB: Presidio + AANA
-Official PIIMB submission:
-https://huggingface.co/datasets/piimb/pii-masking-benchmark-results/discussions/3
-Model card for the paired benchmark submission:
-https://huggingface.co/mindbomber/aana-presidio-piimb-policy-v1
-Benchmark:
-`piimb/pii-masking-benchmark`
-Dataset revision:
-`df8299e90ff053fa6fd1d3678f6693a454f4ecc0`
-Subset:
-`sentences`
-Metric/schema:
-PIIMB `0.2.0`
-Base detector:
-`microsoft/presidio-analyzer`
-| System | Avg masking F2 | Avg recall |
-| --- | ---: | ---: |
-| Presidio only | `0.4492985573` | `0.4008557794` |
-| Presidio + AANA | `0.5629171363` | `0.5159532273` |
-| Delta | `+0.1136185790` | `+0.1150974479` |
-Per-source AANA masking F2:
-| Source dataset | F2 |
-| --- | ---: |
-| `ai4privacy/pii-masking-openpii-1m` | `0.4879480402` |
-| `gretelai/gretel-pii-masking-en-v1` | `0.6281397502` |
-| `nvidia/Nemotron-PII` | `0.6161414756` |
-| `piimb/privy` | `0.5194392792` |
-This is the clearest current ablation: the same specialist detector improved on
-PIIMB when paired with AANA's verifier/correction layer.
-### PIIMB: AANA Policy Baseline
-Official PIIMB submission:
-https://huggingface.co/datasets/piimb/pii-masking-benchmark-results/discussions/2
-Model card:
-https://huggingface.co/mindbomber/aana-piimb-policy-baseline
-Average masking F2:
-`0.5195345497`
-This is a zero-parameter deterministic policy baseline. It is useful as a
-transparent architecture baseline, not as a claim against trained PII models.
-### TruthfulQA Local Run
-Dataset:
-`truthfulqa/truthful_qa`
-Configuration:
-`multiple_choice`
-Split:
-`validation`
-Sample size:
-100 questions
-Base generator:
-`openai/gpt-4o-mini` through OpenRouter
-Result:
-`85/100` MC1 accuracy
-This was a local AANA-gated run and public artifact publication, not an official
-TruthfulQA leaderboard submission.
-## Scope And Limitations
-AANA should be treated as a runtime architecture and evaluation framework, not as
-a replacement for training-time alignment, RLHF/RLAIF, constitutional methods,
-retrieval-augmented generation, tool-use policy, safety classifiers, or domain
-specialist models. AANA can wrap and coordinate those components.
-Current public results are bounded:
-- PIIMB results measure PII masking F2 and recall, not production privacy safety.
-- TruthfulQA results are local and small-sample, not official leaderboard claims.
-- No result here claims state-of-the-art performance.
-- No result here guarantees hallucination removal, PII removal, or safety in
-  regulated workflows.
-Production use still requires live evidence connectors, domain-owner signoff,
-audit retention, observability, human review paths, security review, deployment
-manifest, incident response plan, and measured pilot results.
-## Repositories
-Project repository:
-https://github.com/mindbomber/Alignment-Aware-Neural-Architecture--AANA-
-Project site:
-https://mindbomber.github.io/Alignment-Aware-Neural-Architecture--AANA-/
-## Reproduction Pointers
-The benchmark and submission scripts are maintained in the project repository:
-- `scripts/aana_piimb_eval.py`
-- `scripts/aana_piimb_presidio_eval.py`
-- `scripts/aana_truthfulqa_eval.py`
-- `scripts/aana_ragtruth_eval.py`
-- `scripts/aana_halubench_eval.py`
-- `scripts/aana_wikibio_hallucination_eval.py`
-- `scripts/aana_harmactions_eval.py`
-- `scripts/aana_gap_eval.py`
-- `scripts/aana_cli.py workflow-check`
-The AANA publication gates for the PIIMB submissions passed with:
-- `gate_decision=pass`
-- `recommended_action=accept`
-- `candidate_gate=pass`
-- no hard blockers
-## Peer Review Evidence
-Measured AANA privacy, grounded QA, tool-use, and integration validation artifacts are collected in the public peer-review evidence pack: [https://huggingface.co/datasets/mindbomber/aana-peer-review-evidence-pack](https://huggingface.co/datasets/mindbomber/aana-peer-review-evidence-pack). These artifacts support AANA as an audit/control/verification/correction layer and do not claim AANA is proven as a raw agent-performance engine.
-## Public Artifact Hub
-The canonical public artifact hub for AANA is [https://huggingface.co/collections/mindbomber/aana-public-artifact-hub-69fecc99df04ae6ed6dbc6c4](https://huggingface.co/collections/mindbomber/aana-public-artifact-hub-69fecc99df04ae6ed6dbc6c4). It links the architecture/model card, peer-review evidence dataset, live demo Space, and reviewer-facing report. Claim boundary: AANA is an audit/control/verification/correction layer, not a proven raw agent-performance engine.

+---
+license: mit
+library_name: aana
+tags:
+  - agent-control
+  - agent-safety
+  - auditability
+  - groundedness
+  - tool-use
+  - verification
+pipeline_tag: text-classification
+---
+# AANA: Agent Action Control Architecture
+AANA makes agents more auditable, safer, more grounded, and more controllable.
+This card describes AANA as a control-layer architecture and runtime package, not as a standalone frontier model. The intended pattern is:
+```text
+agent proposes -> AANA checks -> agent executes only if allowed
+```
+## What AANA Provides
+- A public Agent Action Contract v1 for pre-tool-call checks.
+- Python SDK and CLI helpers for local checks and audit-safe summaries.
+- TypeScript SDK helpers for JavaScript/TypeScript agent runtimes.
+- FastAPI service endpoints for HTTP integration.
+- Adapter families for privacy, grounded QA, agent tool-use, and cross-domain action checks.
+- Audit-safe decision metadata: route, AIx score, hard blockers, missing evidence, authorization state, and recovery suggestion.
+## Public Boundary
+AANA is production-candidate as an audit/control/verification/correction layer.
+AANA is not yet proven as a raw agent-performance engine. Current evidence should be interpreted as support for action gating, verification, correction, and auditability claims, not as proof that AANA alone improves end-to-end task success across arbitrary agent benchmarks or has raw agent-performance superiority.
+## Minimal Usage
+```python
+import aana
+decision = aana.check_tool_call({
+    "tool_name": "send_email",
+    "tool_category": "write",
+    "authorization_state": "user_claimed",
+    "evidence_refs": [{"source_id": "draft_id:123", "kind": "tool_result"}],
+    "risk_domain": "customer_support",
+    "proposed_arguments": {"to": "customer@example.com"},
+    "recommended_route": "accept",
+})
+print(decision["architecture_decision"]["route"])
+```
+Execute only when AANA returns `accept`, no hard blockers, and the relevant workflow policy allows the action.
+## API Surface
+- Python package: `aana`
+- CLI: `aana agent-check`, `aana pre-tool-check`, `aana audit-summary`, `aana evidence-pack`
+- FastAPI service: `POST /pre-tool-check`, `POST /agent-check`, `GET /health`
+- TypeScript SDK: `@aana/integration-sdk`
+- Contract spec: `docs/agent-action-contract-v1.md`
+## Evidence Links
+- Public artifact hub: `https://huggingface.co/collections/mindbomber/aana-public-artifact-hub-69fecc99df04ae6ed6dbc6c4`
+- AANA Space: `https://huggingface.co/spaces/mindbomber/aana-demo`
+- Peer-review evidence pack: `https://huggingface.co/datasets/mindbomber/aana-peer-review-evidence-pack`
+- Production-candidate evidence pack: `docs/aana-production-candidate-evidence-pack.md`
+- HF dataset proof report: `docs/hf-dataset-proof-report.md`
+- Agent-action technical report: `docs/aana-agent-action-technical-report.md`
+- Agent Action Contract v1: `docs/agent-action-contract-v1.md`
+## Current Diagnostic Findings
+- Safety/adversarial prompt routing: deterministic AANA preserves safe allow but misses many harmful prompts; a diversified request-level verifier improves harmful-request recall while conservative calibration protects safe allow. AdvBench transfer remains weak, so this is not a content-moderation claim.
+- Finance/high-risk QA: a controlled FinanceBench diagnostic shows supported filing answers are allowed and unsupported finance overclaims are routed to revise/defer. This is not official FinanceBench leaderboard evidence or investment-advice evaluation.
+- Governance/compliance policy routing: a small diagnostic over Hugging Face policy-doc metadata plus repo-heldout policy cases shows citation, missing-evidence, private-data export, destructive-action, and human-review routing behavior. This is not legal, regulatory, or platform-policy certification.
+## Limitations
+- Domain adapters require held-out validation before stronger claims.
+- AANA can over-block if evidence or authorization state is incomplete.
+- AANA does not replace a capable planner, retrieval system, domain policy source, or human escalation path.
+- Production deployments still need live connector review, audit retention policy, incident response, security review, and domain-owner signoff.