Spaces:

OpenEvals
/

README

Running

New Benchmark Dataset

pinned

by burtenshaw - opened Jan 29

Jan 29

Are you maintaining an evaluation benchmark, and would like for it to be included in the eval results short list so that reported result appear as a leaderboard.

⭐️ comment and link to you dataset repo and sources using the benchmark.

adorkin

Feb 6

Not sure what are the specific requirements for benchmarks to be included, but we would like to have this functionality on these language specific benchmarks that we've built. They're quite recent so we don't have much sources yet beyond our own benchmarking efforts and EuroEval.

Manually translated and culturally adapted IFEval for Estonian.
https://huggingface.co/datasets/tartuNLP/ifeval_et

Manually translated and culturally adapted WinoGrande for Estonian.
https://huggingface.co/datasets/tartuNLP/winogrande_et

I'm not completely sure yet how to port the configs from LM Evaluation Harness to eval.yaml though.

yimingliang

Feb 9

Hi, we maintain Encyclo-K, a benchmark for evaluating LLMs with dynamically composed knowledge statements.

Dataset: https://huggingface.co/datasets/m-a-p/Encyclo-K
Paper: https://arxiv.org/abs/2512.24867
Leaderboard: https://encyclo-k.github.io/

We've added the eval.yaml file and would like to be included in the shortlist.

SaylorTwift

OpenEvals org Feb 10

hey @yimingliang ! everything look great, we will add you to the shortlist and all should be set very impressive work on the evals, do you think it could be possible to open PRs on the models you evaluated with the results from your leaderboard ?

SaylorTwift

OpenEvals org Feb 10

hey @adorkin ! Thanks for reaching out. IFEval would require custom code to run; this feature is not available yet, but will be in the future. For winogrande, you could absolutely make a eval.yaml file and turn it into a benchmark. You would need a small modification, though: The answer field should be either A or B instead of 1 or 2, and instead of having two columns for the choices, it would be easier to use one column with a list of choices. Then, your benchmark would simply be a multichoice benchmark :)

SaylorTwift pinned discussion Feb 10

adorkin

Feb 10

@SaylorTwift I see, thanks! Is the yaml expected to contain the prompt itself? I mean it works well as a multiple choice problem, but nonetheless the formulation is a bit non-standard, because you're filling the gap rather than answering a question.

SaylorTwift

OpenEvals org Feb 10

@adorkin yes you can set th eprompt in the yaml file like so https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml. Using the multiple_choice solver instead of the system prompt. Here are the docs from inspect.

adorkin

Feb 10

@SaylorTwift I've added the eval.yaml and a custom dataset config to work with it. The dataset viewer seems to be stuck now which may or may not be related.
https://huggingface.co/datasets/tartuNLP/winogrande_et/blob/main/eval.yaml

SeaWolf-AI

Feb 22

📋 New Benchmark: FINAL Bench — Functional Metacognitive Reasoning

Dataset: https://huggingface.co/datasets/FINAL-Bench/Metacognitive

Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models
(Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang — currently under review)

Blog: https://huggingface.co/blog/FINAL-Bench/metacognitive

Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard

What it measures

FINAL Bench is the first benchmark for evaluating functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Unlike MMLU/GPQA that measure final-answer accuracy, FINAL Bench asks: "What did you do when you got it wrong?"

Key specs

100 tasks | 15 domains | 8 TICOS metacognitive types | 3 difficulty grades
5-axis rubric: MA (Metacognitive Accuracy), ER (Error Recovery), FA (Factual Accuracy), CO (Coherence), SP (Specificity)
Hidden cognitive traps (confirmation bias, anchoring, base-rate neglect) embedded in every task
9 SOTA models evaluated: GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, etc.
DOI: 10.57967/hf/7873

eval.yaml

eval.yaml has been added to the dataset repo.

We would love to be included in the benchmark shortlist! 🚀

SaylorTwift

OpenEvals org Mar 9

Hey @SeaWolf-AI ! Sorry, I missed your message. We are limiting the number of benchmarks on the hub for now so that we can grow the ones we already have before adding more. However, we just noticed your "All bench leaderboard," and it's great! This is exactly what we have in mind when pushing leaderboards on the hub. Would you be up for a quick chat ?

SeaWolf-AI

Mar 9

Hey @SeaWolf-AI ! Sorry, I missed your message. We are limiting the number of benchmarks on the hub for now so that we can grow the ones we already have before adding more. However, we just noticed your "All bench leaderboard," and it's great! This is exactly what we have in mind when pushing leaderboards on the hub. Would you be up for a quick chat ?

Hi @SaylorTwift ,

Thank you for the kind message. I’d be very happy to have a quick chat.

Just to clarify our setup: FINAL Bench is our standalone benchmark for functional metacognitive reasoning, while ALL Bench is our unified leaderboard that brings FINAL Bench together with other major benchmarks in one comparable view.

I’m glad to hear that ALL Bench resonates with your vision for leaderboards on the Hub. I’d love to discuss how it could fit with the OpenEvals / community evals direction, and also whether FINAL Bench itself might eventually be considered for the shortlist as the ecosystem expands.

Happy to coordinate here or by email, whichever is easier for you.

SaylorTwift

OpenEvals org Mar 9

Email is best! What email can i reach you at ?

SeaWolf-AI

Mar 9

Email is best! What email can i reach you at ?

kimminsik1116@gmail.com

piushorn

16 days ago

Hi @SaylorTwift and the OpenEvals team,
I'd like to request that pdf-parse-bench be added to the official benchmark allowlist.

What it benchmarks: PDF parsing quality for mathematical formula and table extraction, evaluated via LLM-as-a-Judge on synthetically generated PDFs with automatic ground truth from LaTeX source.

Dataset: https://huggingface.co/datasets/piushorn/pdf-parse-bench
GitHub: https://github.com/phorn1/pdf-parse-bench

Why LLM-as-a-Judge? Rule-based metrics correlate poorly with human judgment. We validated this in two dedicated human annotation studies:

Formula extraction (750 ratings): best rule-based metric r = 0.31, LLM judge r = 0.77 (https://arxiv.org/abs/2512.09874)
Table extraction (1,500+ ratings): rule-based TEDS/GriTS top at r = 0.70, LLM judge r = 0.93 (https://arxiv.org/abs/2603.18652)

Current state:

22 models benchmarked on OCR parsing
eval.yaml already present in the dataset repo
pip-installable evaluation package: pip install pdf-parse-bench

I'm happy to submit a PR to huggingface.js to register pdf-parse-bench as a framework identifier. Please let me know if there's anything else needed.

Thanks!

jflynt

about 17 hours ago

•

edited about 17 hours ago

Hi, we'd like to register OrgForge EpistemicBench as an official benchmark. Dataset: aeriesec/orgforge. The eval.yaml is in the repo root. We also request orgforge-epistemicbenchbe added to the evaluation_framework enumerable in eval.ts.

What it measures

EpistemicBench evaluates agentic reasoning over a causally grounded enterprise corpus, not what a model knows, but how it reasons under constrained information access. Three tracks:

PERSPECTIVE - Can a model stay within an actor's visibility cone and knowledge horizon while answering correctly? Out-of-cone tool calls are penalized even when they produce the right answer.
COUNTERFACTUAL - Can a model identify the correct causal mechanism and traverse a cause-effect chain in the correct order?
SILENCE - Can a model prove something didn't happen by searching the right artifact space before concluding absence? A correct "no" without evidence of search scores zero on trajectory regardless of answer correctness.

Why the scoring design is intentionally different

The primary metric is violation_adjusted_combined_score = combined_score × (1 - violation_rate)². Trajectory quality is weighted at 60–70% on two of the three tracks. A model cannot overcome epistemic gate violations through high answer accuracy alone. This was a deliberate design decision, we think outcome-only scoring is a structural weakness of existing benchmarks and didn't want to replicate it.

Three evaluation conditions are defined in eval.yaml: gated (primary), ungated (establishes the Epistemic Tax ceiling), and zero-shot (establishes the hallucination floor). The delta between ungated and gated combined_score is the Epistemic Tax — a derived metric that has no equivalent in current benchmarks.

Early results across two models (Claude Sonnet 4.6 and DeepSeek via AWS Bedrock, identical settings)

The results demonstrate exactly the kind of differentiation the benchmark was designed to surface.

Condition	Sonnet 4.6	DeepSeek
Zero-shot combined_score	0.2483	0.2707
Zero-shot accuracy	0.6047	0.6860
Gated combined_score	0.6326	0.3582
Gated accuracy	0.4419	0.0116
Gated budget_exceeded	0/86	85/86
Epistemic Tax	0.0068	0.0060

DeepSeek has a stronger parametric knowledge floor, higher zero-shot accuracy across all three tracks. Under gated agentic conditions with identical settings, Sonnet 4.6 scores 0.6326 vs DeepSeek's 0.3582. The reversal is not subtle.

DeepSeek's budget_exceeded rate (85/86 questions) is the proximate cause of its answer score collapsing to ~0.01 in the gated condition. Notably, its trajectory scores remain reasonable (0.57 overall, 0.80 on PERSPECTIVE), meaning it navigates to the right artifacts but fails to produce structured final answers within the step budget. That's a qualitatively different failure mode from low answer quality, and one that answer-only scoring would obscure entirely.

This is the benchmark's core argument in concrete form: parametric knowledge scores don't predict agentic reasoning quality.

Reproducibility

The corpus is derived from a deterministic simulation, ground truth is verifiable from the simulation state, not human labels. We are releasing a Docker Compose file that stands up the full evaluation stack so any third party can regenerate the corpus from the simulation seed and reproduce scores end-to-end. The paper is at arXiv:2603.14997.

Alignment with your goals

The Community Evals announcement identified a clear gap between benchmark scores and real-world agentic performance. This benchmark was built specifically to address that gap and the Sonnet/DeepSeek results above show it surfaces signal that existing benchmarks don't.

Happy to answer questions on methodology, the harness, or the scoring design.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment