1 11

Shuri Kimura

shushukimura

ShuriKimura0402

AI & ML interests

None yet

Recent Activity

liked a model 12 days ago

sbintuitions/nest-ja-0.1b

reacted to ajibawa-2023's post with 🔥 26 days ago

C-Code-Large Dataset: https://huggingface.co/datasets/ajibawa-2023/C-Code-Large C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem. By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language. C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

upvoted a paper about 1 month ago

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

View all activity

Organizations

None yet

liked a model 12 days ago

sbintuitions/nest-ja-0.1b

Updated Mar 5 • 14 • 5

reactedto ajibawa-2023's post with 🔥 26 days ago

Post

2765

C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

upvoted a paper about 1 month ago

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Paper • 2408.13106 • Published Aug 23, 2024 • 2

liked a model 4 months ago

sbintuitions/sarashina2.2-3b-instruct-v0.1

Text Generation • 3B • Updated Mar 5, 2025 • 3.73k • 36

liked 3 datasets 4 months ago

liked a Space 5 months ago

Whisper Asr Space

🦀

updated a Space 5 months ago

Whisper Asr Space

🦀

reactedto codelion's post with 🔥 5 months ago

Post

4005

Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered.

We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples

These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.

The collection includes:
- finePDFs-1B: High-quality textbook-style educational content
- DCLM-baseline-1B: Filtered, diverse web content
- FineWeb-Edu-1B: Curated educational web resources

We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.

Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.

Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing