ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs Paper • 2510.00857 • Published Oct 1, 2025 • 1
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 22
Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens Paper • 2108.11193 • Published Jun 8, 2022
ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery Paper • 2604.09237 • Published 8 days ago • 9
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs Paper • 2507.07186 • Published Jul 9, 2025 • 3
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation Paper • 2503.01622 • Published Mar 3, 2025
Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models Paper • 2502.08130 • Published Feb 12, 2025 • 9
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs Paper • 2502.12964 • Published Feb 18, 2025 • 3