LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
Abstract
A Chinese logical reasoning benchmark for large language models is introduced, featuring expert-verified natural-language items with formal annotations and adversarial hardening to better evaluate rule-governed reasoning capabilities.
Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.
Community
LLMEval-Logic is a forward-authored Chinese logical reasoning benchmark, Z3-audited end-to-end. Items are written from real situational scenarios (not back-templated from formulas), every Base item is paired with a gold first-order-logic formalization that is double-checked by the Z3 SMT solver plus an expert rubric, and selected items are then adversarially hardened by a closed-loop agent workflow.
We evaluate 14 frontier LLMs across 7 families under thinking / no-thinking. Headline result: the strongest model (Gemini 3.1 Pro, thinking) reaches only 37.5% Item Accuracy on the Hard subset, with substantial Sub-Q → Item gaps showing models can solve individual sub-questions but fail to maintain a coherent closed candidate space across chained queries. We also find a clear Base ↔ Hard rank inversion among thinking variants (Spearman ρ = −0.61).
Released as 80% public + 20% private contamination-resistant holdout, with code, dataset, and rubrics:
Get this paper in your agent:
hf papers read 2605.19597 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper