Add tau2-bench custom scaffold result
Browse files
README.md
CHANGED
|
@@ -33,6 +33,7 @@ datasets:
|
|
| 33 |
- mindbomber/aana-head-to-head-llm-judge-vs-aana
|
| 34 |
- mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
|
| 35 |
- mindbomber/aana-external-validity-hermes-head-to-head
|
|
|
|
| 36 |
metrics:
|
| 37 |
- accuracy
|
| 38 |
- f_beta
|
|
@@ -111,6 +112,48 @@ The demo returns an AANA-style route (`accept`, `revise`, `ask`, `defer`, or
|
|
| 111 |
|
| 112 |
## Current Public Benchmark Signals
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
### RAGTruth: Grounded Hallucination Gate
|
| 115 |
|
| 116 |
Public result artifact:
|
|
|
|
| 33 |
- mindbomber/aana-head-to-head-llm-judge-vs-aana
|
| 34 |
- mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
|
| 35 |
- mindbomber/aana-external-validity-hermes-head-to-head
|
| 36 |
+
- mindbomber/aana-tau2-bench-gpt41mini-1trial
|
| 37 |
metrics:
|
| 38 |
- accuracy
|
| 39 |
- f_beta
|
|
|
|
| 112 |
|
| 113 |
## Current Public Benchmark Signals
|
| 114 |
|
| 115 |
+
### τ²-Bench: Custom Agent Tool-Use Scaffold
|
| 116 |
+
|
| 117 |
+
Official PR:
|
| 118 |
+
https://github.com/sierra-research/tau2-bench/pull/304
|
| 119 |
+
|
| 120 |
+
Public result artifact:
|
| 121 |
+
https://huggingface.co/datasets/mindbomber/aana-tau2-bench-gpt41mini-1trial
|
| 122 |
+
|
| 123 |
+
Benchmark:
|
| 124 |
+
`sierra-research/tau2-bench`
|
| 125 |
+
|
| 126 |
+
Evaluation date:
|
| 127 |
+
`2026-05-07`
|
| 128 |
+
|
| 129 |
+
Configuration:
|
| 130 |
+
|
| 131 |
+
- Agent model: `openai/gpt-4.1-mini`
|
| 132 |
+
- User simulator: `openai/gpt-4.1-mini`
|
| 133 |
+
- Trials: `1` per task
|
| 134 |
+
- Domains: `airline`, `retail`, `telecom`, `banking_knowledge`
|
| 135 |
+
- Banking retrieval: `bm25`
|
| 136 |
+
- Submission type: `custom`
|
| 137 |
+
|
| 138 |
+
AANA path:
|
| 139 |
+
wrap the τ²-Bench text agent with a pre-tool-call contract gate that returns
|
| 140 |
+
`accept`, `ask`, `defer`, or `refuse` before tool execution.
|
| 141 |
+
|
| 142 |
+
| Domain | Pass^1 | Avg cost |
|
| 143 |
+
| --- | ---: | ---: |
|
| 144 |
+
| Airline | `44.00%` | `$0.0068` |
|
| 145 |
+
| Retail | `38.60%` | `$0.0097` |
|
| 146 |
+
| Telecom | `17.54%` | `$0.0224` |
|
| 147 |
+
| Banking knowledge | `2.06%` | `$0.0073` |
|
| 148 |
+
|
| 149 |
+
This is an official custom-submission attempt with validated trajectories, not
|
| 150 |
+
a strong performance claim. The first τ²-Bench scaffold exposed the current
|
| 151 |
+
architecture limitation clearly: AANA improves auditability and pre-tool-call
|
| 152 |
+
control, but this implementation is too blunt for many write-heavy,
|
| 153 |
+
retrieval-heavy, and customer-service workflows. The next AANA agent-workflow
|
| 154 |
+
work should improve action-intent routing, authorization-state inference,
|
| 155 |
+
retrieval grounding, and less conservative correction behavior.
|
| 156 |
+
|
| 157 |
### RAGTruth: Grounded Hallucination Gate
|
| 158 |
|
| 159 |
Public result artifact:
|