cabinet-data_curation
updated
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Paper
• 2507.01352
• Published • 60
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges
in Russian Speech Generative Models
Paper
• 2507.13563
• Published • 53
Scaling Laws for Optimal Data Mixtures
Paper
• 2507.09404
• Published • 38
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper
• 2511.14993
• Published • 233
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper
• 2512.16676
• Published • 222
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
• 2512.04324
• Published • 159
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
• 2512.14051
• Published • 47
DRIVE: Data Curation Best Practices for Reinforcement Learning with
Verifiable Reward in Competitive Code Generation
Paper
• 2511.06307
• Published • 53
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Paper
• 2511.18050
• Published • 38
FineVision: Open Data Is All You Need
Paper
• 2510.17269
• Published • 80
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Paper
• 2510.23587
• Published • 67
RAG-Anything: All-in-One RAG Framework
Paper
• 2510.12323
• Published • 73
A Survey of Scientific Large Language Models: From Data Foundations to
Agent Frontiers
Paper
• 2508.21148
• Published • 142
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
• 2509.18154
• Published • 56
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Paper
• 2508.01191
• Published • 240
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
Pretraining
Paper
• 2508.10975
• Published • 60
Alchemist: Turning Public Text-to-Image Data into Generative Gold
Paper
• 2505.19297
• Published • 84
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
Dataset and Comprehensive Benchmark
Paper
• 2509.09680
• Published • 44
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
Paper
• 2512.16905
• Published • 32
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Paper
• 2511.12609
• Published • 106
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Paper
• 2511.16043
• Published • 111
Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models
Paper
• 2512.00590
• Published • 52
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
Paper
• 2510.16872
• Published • 112
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper
• 2509.12201
• Published • 107
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated
Data Refinement Using Contrastive Learning
Paper
• 2503.18406
• Published • 3