Spaces:
Running
Seeking arXiv cs.AI Endorser for HalluMaze Paper
arXiv Endorsement Request
Hi HuggingFace community! ๐
I'm the author of HalluMaze โ a maze navigation benchmark that measures metacognitive error recovery in LLMs. Our key finding: all 10 tested LLMs score significantly below a random walk baseline (p<0.001), revealing a systematic deficit across all model families and cost tiers.
I'm seeking an arXiv endorser for cs.AI to submit the paper.
Paper Summary
HalluMaze: A Maze Navigation Benchmark for LLM Metacognitive Error Recovery
- 10 LLMs ร 60 trials = 600 total trials
- Frontier cost does NOT predict metacognitive recovery: GPT-4o ranks last (MEI=0.315), Claude-3.7-Sonnet ranks first (MEI=0.774)
- Random Walk baseline (MEI=0.900) outperforms all LLMs (Glass's ฮด = 0.6โ2.1, all p<0.001)
| Rank | Model | MEI |
|---|---|---|
| โ | Random Walk โ | 0.900 |
| 1 | Claude-3.7-Sonnet | 0.774 |
| 2 | GLM-4.7 | 0.615 |
| 10 | GPT-4o | 0.315 |
๐ Demo: https://huggingface.co/spaces/Be2Jay/hallumaze
๐ฆ Dataset: https://huggingface.co/datasets/Be2Jay/hallumaze-benchmark
๐ป GitHub: https://github.com/jaytoone/HalluMaze
Endorsement Code
arXiv endorsement code: UCHUCB (cs.AI)
If you are qualified to endorse (3+ cs.* papers on arXiv within the last 5 years), I would be very grateful for your support!
To endorse: please email me at be2jay67@gmail.com and I'll forward the arXiv endorsement request email to you.
Thank you! ๐