Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs Paper • 2604.10480 • Published 7 days ago • 20
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Paper • 2604.04771 • Published 13 days ago • 120
Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training Paper • 2603.07223 • Published Mar 7 • 13
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning Paper • 2602.11089 • Published Feb 11 • 18
Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility Paper • 2601.17027 • Published Jan 17 • 42
Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets Paper • 2601.09733 • Published Dec 30, 2025 • 9
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies Paper • 2512.19673 • Published Dec 22, 2025 • 66
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value Paper • 2512.14051 • Published Dec 16, 2025 • 47
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models Paper • 2511.11134 • Published Nov 14, 2025 • 33
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning Paper • 2507.17512 • Published Jul 23, 2025 • 37
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once Paper • 2507.10541 • Published Jul 14, 2025 • 30
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models +1 Mar 20, 2024 • 113
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts Paper • 2504.21117 • Published Apr 29, 2025 • 26
CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges Paper • 2504.19093 • Published Apr 27, 2025 • 18
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis Paper • 2504.12322 • Published Apr 11, 2025 • 28
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding Paper • 2504.09925 • Published Apr 14, 2025 • 39
MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion Paper • 2503.16212 • Published Mar 20, 2025 • 25
MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer Paper • 2503.14891 • Published Mar 19, 2025 • 22
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception Paper • 2410.12628 • Published Oct 16, 2024 • 41