Collections
Discover the best community collections!
Collections including paper arxiv:2502.13595
-
Self-Boosting Large Language Models with Synthetic Preference Data
Paper • 2410.06961 • Published • 16 -
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 377 -
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper • 2412.13649 • Published • 21 -
NeoBERT: A Next-Generation BERT
Paper • 2502.19587 • Published • 38
-
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Paper • 2406.08587 • Published • 16 -
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper • 2406.09170 • Published • 27 -
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 35 -
Benchmarking Agentic Workflow Generation
Paper • 2410.07869 • Published • 29
-
EuroEval Leaderboard
📊7The robust European language model benchmark.
-
ScandEval: A Benchmark for Scandinavian Natural Language Processing
Paper • 2304.00906 • Published • 4 -
MTEB Leaderboard
🥇7.28kEmbedding Leaderboard
-
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding
Paper • 2406.02396 • Published
-
MAEB: Massive Audio Embedding Benchmark
Paper • 2602.16008 • Published • 22 -
HUME: Measuring the Human-Model Performance Gap in Text Embedding Task
Paper • 2510.10062 • Published • 10 -
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 48 -
MIEB: Massive Image Embedding Benchmark
Paper • 2504.10471 • Published • 21
-
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Paper • 2412.16145 • Published • 38 -
SPLADE-v3: New baselines for SPLADE
Paper • 2403.06789 • Published • 5 -
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 48 -
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Paper • 2409.14683 • Published • 11
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 247 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 40 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 40
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 7 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 24 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
MAEB: Massive Audio Embedding Benchmark
Paper • 2602.16008 • Published • 22 -
HUME: Measuring the Human-Model Performance Gap in Text Embedding Task
Paper • 2510.10062 • Published • 10 -
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 48 -
MIEB: Massive Image Embedding Benchmark
Paper • 2504.10471 • Published • 21
-
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Paper • 2412.16145 • Published • 38 -
SPLADE-v3: New baselines for SPLADE
Paper • 2403.06789 • Published • 5 -
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 48 -
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Paper • 2409.14683 • Published • 11
-
Self-Boosting Large Language Models with Synthetic Preference Data
Paper • 2410.06961 • Published • 16 -
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 377 -
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper • 2412.13649 • Published • 21 -
NeoBERT: A Next-Generation BERT
Paper • 2502.19587 • Published • 38
-
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Paper • 2406.08587 • Published • 16 -
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper • 2406.09170 • Published • 27 -
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 35 -
Benchmarking Agentic Workflow Generation
Paper • 2410.07869 • Published • 29
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 247 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 40 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 40
-
EuroEval Leaderboard
📊7The robust European language model benchmark.
-
ScandEval: A Benchmark for Scandinavian Natural Language Processing
Paper • 2304.00906 • Published • 4 -
MTEB Leaderboard
🥇7.28kEmbedding Leaderboard
-
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding
Paper • 2406.02396 • Published
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 7 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 24 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69