Spaces:

Be2Jay
/

hallumaze

Running

App Files Files Community

Seeking arXiv cs.AI Endorser for HalluMaze Paper

by Be2Jay - opened 26 days ago

Discussion

Be2Jay

Owner 26 days ago

arXiv Endorsement Request

Hi HuggingFace community! 👋

I'm the author of HalluMaze — a maze navigation benchmark that measures metacognitive error recovery in LLMs. Our key finding: all 10 tested LLMs score significantly below a random walk baseline (p<0.001), revealing a systematic deficit across all model families and cost tiers.

I'm seeking an arXiv endorser for cs.AI to submit the paper.

Paper Summary

HalluMaze: A Maze Navigation Benchmark for LLM Metacognitive Error Recovery

10 LLMs × 60 trials = 600 total trials
Frontier cost does NOT predict metacognitive recovery: GPT-4o ranks last (MEI=0.315), Claude-3.7-Sonnet ranks first (MEI=0.774)
Random Walk baseline (MEI=0.900) outperforms all LLMs (Glass's δ = 0.6–2.1, all p<0.001)

Rank	Model	MEI
—	Random Walk ★	0.900
1	Claude-3.7-Sonnet	0.774
2	GLM-4.7	0.615
10	GPT-4o	0.315

📊 Demo: https://huggingface.co/spaces/Be2Jay/hallumaze
📦 Dataset: https://huggingface.co/datasets/Be2Jay/hallumaze-benchmark
💻 GitHub: https://github.com/jaytoone/HalluMaze

Endorsement Code

arXiv endorsement code: UCHUCB (cs.AI)

If you are qualified to endorse (3+ cs.* papers on arXiv within the last 5 years), I would be very grateful for your support!

To endorse: please email me at be2jay67@gmail.com and I'll forward the arXiv endorsement request email to you.

Thank you! 🙏

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment