C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.
By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.
C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.
These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.
The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources
We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.
Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.