Paper
• 2501.14249
• Published • 77
Note currently the hardest
Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models
Paper
• 2206.04615
• Published • 6
Note *BB* => BBH => BBEH
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Paper
• 2210.09261
• Published • 1
Note BB => *BBH* => BBEH
Paper
• 2502.19187
• Published • 10
Note BB => BBH => **BBEH**
Measuring Massive Multitask Language Understanding
Paper
• 2009.03300
• Published • 3
Note OG MMLU !
MMLU-Pro: A More Robust and Challenging Multi-Task Language
Understanding Benchmark
Paper
• 2406.01574
• Published • 54
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
• 2311.12022
• Published • 36
Instruction-Following Evaluation for Large Language Models
Paper
• 2311.07911
• Published • 22
Note IFEval
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper
• 2310.06770
• Published • 9
Note Coding Benchmark
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
• 2412.15204
• Published • 39
Note Best for long context (as of July 2025)
long context: at least 8K