Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs Paper • 2604.10480 • Published 6 days ago • 19
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models Paper • 2604.04707 • Published 12 days ago • 200
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Paper • 2604.04771 • Published 12 days ago • 120
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models Paper • 2603.26164 • Published 21 days ago • 353
Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training Paper • 2603.07223 • Published Mar 7 • 13
MMFineReason Collection High-quality STEM reasoning dataset for Multimodal LLM post-training. • 8 items • Updated 17 days ago • 22
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods Paper • 2601.21821 • Published Jan 29 • 62
Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility Paper • 2601.17027 • Published Jan 17 • 42
ODA-Mixture Collection High-quality mixture datasets for post-training covering multiple domains. • 7 items • Updated 17 days ago • 5
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch Paper • 2601.13606 • Published Jan 20 • 12
Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets Paper • 2601.09733 • Published Dec 30, 2025 • 9
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI Paper • 2512.16676 • Published Dec 18, 2025 • 222
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value Paper • 2512.14051 • Published Dec 16, 2025 • 47
Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning Paper • 2510.04081 • Published Oct 5, 2025 • 23
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26, 2025 • 159
From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature Paper • 2509.16591 • Published Sep 20, 2025 • 2
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Paper • 2506.01939 • Published Jun 2, 2025 • 190
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles Paper • 2505.23590 • Published May 29, 2025 • 25