stukenov 's Collections

SozKZ Corpora: Kazakh Training Datasets

Training corpora for Kazakh LLMs — raw, cleaned, deduplicated, tokenized, synthetic, and parallel datasets