ORBIT-v0.1 Collection A collection of ORBIT training datasets and search agents • 4 items • Updated 10 days ago • 2
view article Article Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries +7 Mar 10 • 124
🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated Mar 2 • 12
RexBERT: Context Specialized Bidirectional Encoders for E-commerce Paper • 2602.04605 • Published Feb 4 • 2
view article Article Nano-BEIR: A Multilingual Information Retrieval Benchmark with Quality-Enhanced Queries Dec 22, 2025 • 9
Bharat-NanoBEIR: Indian Language Retrieval Benchmarks Collection NanoBEIR retrieval benchmarks translated into 22 Indian languages across 13 datasets. • 22 items • Updated Dec 13, 2025 • 5
Bharat-NanoBEIR Collection Indian Language Information Retrieval Dataset • 286 items • Updated Jan 26, 2025 • 2
M3DR: Towards Universal Multilingual Multimodal Document Retrieval Paper • 2512.03514 • Published Dec 3, 2025 • 9
Ministral 3 Collection A collection of edge models, with Base, Instruct and Reasoning variants, in 3 different sizes: 3B, 8B and 14B. All with vision capabilities. • 9 items • Updated Dec 2, 2025 • 164
view article Article Introducing MTEB v2: Evaluation of embedding and retrieval systems for more than just text Oct 20, 2025 • 38
view article Article Welcome EmbeddingGemma, Google's new efficient embedding model +4 Sep 4, 2025 • 273
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent Paper • 2508.06600 • Published Aug 8, 2025 • 42
view article Article Training and Finetuning Sparse Embedding Models with Sentence Transformers v5 Jul 1, 2025 • 137
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval Paper • 2505.16967 • Published May 22, 2025 • 24
RLHN Datasets Collection RLHN: Cleaned Training Datasets with False Negatives Identified & Relabeled as ground truth. • 5 items • Updated May 23, 2025 • 4
Multilingual SFT & DPO Datasets Collection These SFT or DPO datasets were translated from English using the Mistral-7B-Instruct-v0.2 or taken from other sources. • 8 items • Updated Mar 31, 2025 • 3