mindbomber
/

aana

@@ -33,6 +33,7 @@ datasets:
 - mindbomber/aana-head-to-head-llm-judge-vs-aana
 - mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
 - mindbomber/aana-external-validity-hermes-head-to-head
 metrics:
 - accuracy
 - f_beta
@@ -111,6 +112,48 @@ The demo returns an AANA-style route (`accept`, `revise`, `ask`, `defer`, or
 ## Current Public Benchmark Signals
 ### RAGTruth: Grounded Hallucination Gate
 Public result artifact:

 - mindbomber/aana-head-to-head-llm-judge-vs-aana
 - mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
 - mindbomber/aana-external-validity-hermes-head-to-head
+- mindbomber/aana-tau2-bench-gpt41mini-1trial
 metrics:
 - accuracy
 - f_beta
 ## Current Public Benchmark Signals
+### τ²-Bench: Custom Agent Tool-Use Scaffold
+Official PR:
+https://github.com/sierra-research/tau2-bench/pull/304
+Public result artifact:
+https://huggingface.co/datasets/mindbomber/aana-tau2-bench-gpt41mini-1trial
+Benchmark:
+`sierra-research/tau2-bench`
+Evaluation date:
+`2026-05-07`
+Configuration:
+- Agent model: `openai/gpt-4.1-mini`
+- User simulator: `openai/gpt-4.1-mini`
+- Trials: `1` per task
+- Domains: `airline`, `retail`, `telecom`, `banking_knowledge`
+- Banking retrieval: `bm25`
+- Submission type: `custom`
+AANA path:
+wrap the τ²-Bench text agent with a pre-tool-call contract gate that returns
+`accept`, `ask`, `defer`, or `refuse` before tool execution.
+| Domain | Pass^1 | Avg cost |
+| --- | ---: | ---: |
+| Airline | `44.00%` | `$0.0068` |
+| Retail | `38.60%` | `$0.0097` |
+| Telecom | `17.54%` | `$0.0224` |
+| Banking knowledge | `2.06%` | `$0.0073` |
+This is an official custom-submission attempt with validated trajectories, not
+a strong performance claim. The first τ²-Bench scaffold exposed the current
+architecture limitation clearly: AANA improves auditability and pre-tool-call
+control, but this implementation is too blunt for many write-heavy,
+retrieval-heavy, and customer-service workflows. The next AANA agent-workflow
+work should improve action-intent routing, authorization-state inference,
+retrieval grounding, and less conservative correction behavior.
 ### RAGTruth: Grounded Hallucination Gate
 Public result artifact: