Zh-Pythia Collection A series of Chinese language models trained on 3B tokens • 5 items • Updated Nov 13, 2024 • 1
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 22
Tokenizer Study (LLaMA 130M) Collection Correlating tokenizer properties on pre-trained LLMs with their downstream performance. • 84 items • Updated Mar 13 • 3
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5, 2025 • 61