Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
Croc-Prog-HF
's Collections
Chat-Style and Reasoning Datasets
Synthetic Data Generation & Datasets
Deepfake & AI content detection
Bias, Misalignment, and AI Safety
Benchmark datasets
LoreWeaver-2 Family
MultiLang-Texts HQ Datasets
Math-HQ-datasets
Benchmark datasets
updated
1 day ago
Upvote
1
cais/hle
Benchmark
•
Updated
Jan 20
•
2.5k
•
46.3k
•
778
Qwen/DeepPlanning
Viewer
•
Updated
Mar 3
•
2.14k
•
688
•
195
gaia-benchmark/GAIA
Viewer
•
Updated
Oct 28, 2025
•
932
•
20.4k
•
644
BLINK-Benchmark/BLINK
Viewer
•
Updated
Sep 3, 2025
•
3.81k
•
16.3k
•
41
openai/gsm8k
Benchmark
•
Updated
25 days ago
•
17.6k
•
788k
•
1.26k
allenai/olmOCR-bench
Benchmark
•
Updated
Feb 19
•
4.39k
•
188
TIGER-Lab/MMLU-Pro
Benchmark
•
Updated
Mar 11
•
12.1k
•
108k
•
467
openai/openai_humaneval
Viewer
•
Updated
Jan 4, 2024
•
164
•
216k
•
379
Muennighoff/mbpp
Viewer
•
Updated
Oct 20, 2022
•
1.4k
•
1.89k
•
23
bigcode/bigcodebench
Viewer
•
Updated
Apr 30, 2025
•
5.7k
•
46.8k
•
77
livecodebench/test_generation
Viewer
•
Updated
Jun 13, 2024
•
442
•
1.2k
•
7
ScaleAI/SWE-bench_Pro
Benchmark
•
Updated
Feb 23
•
731
•
413k
•
88
SWE-bench/SWE-bench_Verified
Benchmark
•
Updated
Feb 27
•
500
•
131k
•
31
mteb/arguana
Benchmark
•
Updated
about 4 hours ago
•
11.5k
•
15.3k
•
5
MathArena/hmmt_feb_2026
Benchmark
•
Updated
Feb 19
•
33
•
2.32k
•
1
Idavidrein/gpqa
Benchmark
•
Updated
Mar 5
•
1.25k
•
100k
•
415
likaixin/ScreenSpot-Pro
Benchmark
•
Updated
about 1 month ago
•
10.8k
•
60
harborframework/terminal-bench-2.0
Benchmark
•
Updated
Feb 17
•
3.84k
•
24
FutureMa/EvasionBench
Benchmark
•
Updated
Feb 19
•
16.7k
•
262
•
85
internlm/WildClawBench
Updated
15 days ago
•
12.1k
•
50
FINAL-Bench/World-Model
Viewer
•
Updated
19 days ago
•
100
•
2.36k
•
28
llamaindex/ParseBench
Benchmark
•
Updated
3 days ago
•
169k
•
5.61k
•
46
mteb/BRIGHT
Benchmark
•
Updated
15 days ago
•
1.35M
•
924
•
2
Delores-Lin/MDPBench
Benchmark
•
Updated
8 days ago
•
23.3k
•
12
collinear-ai/yc-bench
Benchmark
•
Updated
25 days ago
•
157
•
15
nvidia/compute-eval
Benchmark
•
Updated
28 days ago
•
2.46k
•
4.49k
•
22
Jackrong/Qwen3.5-reasoning-700x
Viewer
•
Updated
Mar 2
•
633
•
6.88k
•
107
Upvote
1
Share collection
View history
Collection guide
Browse collections