Seeking arXiv cs.AI Endorser for HalluMaze Paper

#1
by Be2Jay - opened

arXiv Endorsement Request

Hi HuggingFace community! ๐Ÿ‘‹

I'm the author of HalluMaze โ€” a maze navigation benchmark that measures metacognitive error recovery in LLMs. Our key finding: all 10 tested LLMs score significantly below a random walk baseline (p<0.001), revealing a systematic deficit across all model families and cost tiers.

I'm seeking an arXiv endorser for cs.AI to submit the paper.


Paper Summary

HalluMaze: A Maze Navigation Benchmark for LLM Metacognitive Error Recovery

  • 10 LLMs ร— 60 trials = 600 total trials
  • Frontier cost does NOT predict metacognitive recovery: GPT-4o ranks last (MEI=0.315), Claude-3.7-Sonnet ranks first (MEI=0.774)
  • Random Walk baseline (MEI=0.900) outperforms all LLMs (Glass's ฮด = 0.6โ€“2.1, all p<0.001)
Rank Model MEI
โ€” Random Walk โ˜… 0.900
1 Claude-3.7-Sonnet 0.774
2 GLM-4.7 0.615
10 GPT-4o 0.315

๐Ÿ“Š Demo: https://huggingface.co/spaces/Be2Jay/hallumaze
๐Ÿ“ฆ Dataset: https://huggingface.co/datasets/Be2Jay/hallumaze-benchmark
๐Ÿ’ป GitHub: https://github.com/jaytoone/HalluMaze


Endorsement Code

arXiv endorsement code: UCHUCB (cs.AI)

If you are qualified to endorse (3+ cs.* papers on arXiv within the last 5 years), I would be very grateful for your support!

To endorse: please email me at be2jay67@gmail.com and I'll forward the arXiv endorsement request email to you.

Thank you! ๐Ÿ™

Sign up or log in to comment