ajibawa-2023 (Feynman Innovations)

upvoted a paper 17 days ago

Qwen3-Coder-Next Technical Report

Paper • 2603.00729 • Published Feb 28 • 64

reacted to their post with 👍🔥🚀 about 1 month ago

Post

2778

C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

posted an update about 1 month ago

Post

2778

C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

liked a dataset about 1 month ago

ajibawa-2023/C-Code-Large

Viewer • Updated Mar 17 • 2.87M • 948 • 16

updated a dataset about 1 month ago

ajibawa-2023/C-Code-Large

Viewer • Updated Mar 17 • 2.87M • 948 • 16

published a dataset about 1 month ago

ajibawa-2023/C-Code-Large

Viewer • Updated Mar 17 • 2.87M • 948 • 16

liked a Space about 1 month ago

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

📝

220

Explore synthetic data experiments on a virtual bookshelf

replied to their post about 1 month ago

In this case, unfortunately no.

reacted to their post with 👍🚀🔥 about 1 month ago

Post

3852

Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.

3 replies

·

posted an update about 1 month ago

Post

3852

Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.

3 replies

·

updated a dataset about 1 month ago

ajibawa-2023/Cpp-Code-Large

Viewer • Updated Mar 4 • 3.54M • 675 • 16

liked a dataset about 1 month ago

ajibawa-2023/Cpp-Code-Large

Viewer • Updated Mar 4 • 3.54M • 675 • 16

published a dataset about 1 month ago

ajibawa-2023/Cpp-Code-Large

Viewer • Updated Mar 4 • 3.54M • 675 • 16

New activity in ajibawa-2023/Python-Code-Large about 2 months ago

Nice Work

1

#2 opened about 2 months ago by

Ujjwal-Tyagi

reacted to their post with 🚀🔥 about 2 months ago

Post

3513

Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.

1 reply

·

Feynman Innovations

AI & ML interests

Recent Activity

Organizations

Qwen3-Coder-Next Technical Report

ajibawa-2023/C-Code-Large

ajibawa-2023/C-Code-Large

ajibawa-2023/C-Code-Large

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

ajibawa-2023/Cpp-Code-Large

ajibawa-2023/Cpp-Code-Large

ajibawa-2023/Cpp-Code-Large

Nice Work

Feynman Innovations

AI & ML interests

Recent Activity

Organizations

ajibawa-2023's activity

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

Nice Work