mindbomber commited on
Commit
a0fb451
·
verified ·
1 Parent(s): 2646360

Add tau2-bench custom scaffold result

Browse files
Files changed (1) hide show
  1. README.md +43 -0
README.md CHANGED
@@ -33,6 +33,7 @@ datasets:
33
  - mindbomber/aana-head-to-head-llm-judge-vs-aana
34
  - mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
35
  - mindbomber/aana-external-validity-hermes-head-to-head
 
36
  metrics:
37
  - accuracy
38
  - f_beta
@@ -111,6 +112,48 @@ The demo returns an AANA-style route (`accept`, `revise`, `ask`, `defer`, or
111
 
112
  ## Current Public Benchmark Signals
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ### RAGTruth: Grounded Hallucination Gate
115
 
116
  Public result artifact:
 
33
  - mindbomber/aana-head-to-head-llm-judge-vs-aana
34
  - mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
35
  - mindbomber/aana-external-validity-hermes-head-to-head
36
+ - mindbomber/aana-tau2-bench-gpt41mini-1trial
37
  metrics:
38
  - accuracy
39
  - f_beta
 
112
 
113
  ## Current Public Benchmark Signals
114
 
115
+ ### τ²-Bench: Custom Agent Tool-Use Scaffold
116
+
117
+ Official PR:
118
+ https://github.com/sierra-research/tau2-bench/pull/304
119
+
120
+ Public result artifact:
121
+ https://huggingface.co/datasets/mindbomber/aana-tau2-bench-gpt41mini-1trial
122
+
123
+ Benchmark:
124
+ `sierra-research/tau2-bench`
125
+
126
+ Evaluation date:
127
+ `2026-05-07`
128
+
129
+ Configuration:
130
+
131
+ - Agent model: `openai/gpt-4.1-mini`
132
+ - User simulator: `openai/gpt-4.1-mini`
133
+ - Trials: `1` per task
134
+ - Domains: `airline`, `retail`, `telecom`, `banking_knowledge`
135
+ - Banking retrieval: `bm25`
136
+ - Submission type: `custom`
137
+
138
+ AANA path:
139
+ wrap the τ²-Bench text agent with a pre-tool-call contract gate that returns
140
+ `accept`, `ask`, `defer`, or `refuse` before tool execution.
141
+
142
+ | Domain | Pass^1 | Avg cost |
143
+ | --- | ---: | ---: |
144
+ | Airline | `44.00%` | `$0.0068` |
145
+ | Retail | `38.60%` | `$0.0097` |
146
+ | Telecom | `17.54%` | `$0.0224` |
147
+ | Banking knowledge | `2.06%` | `$0.0073` |
148
+
149
+ This is an official custom-submission attempt with validated trajectories, not
150
+ a strong performance claim. The first τ²-Bench scaffold exposed the current
151
+ architecture limitation clearly: AANA improves auditability and pre-tool-call
152
+ control, but this implementation is too blunt for many write-heavy,
153
+ retrieval-heavy, and customer-service workflows. The next AANA agent-workflow
154
+ work should improve action-intent routing, authorization-state inference,
155
+ retrieval grounding, and less conservative correction behavior.
156
+
157
  ### RAGTruth: Grounded Hallucination Gate
158
 
159
  Public result artifact: