Training corpora for Kazakh LLMs — raw, cleaned, deduplicated, tokenized, synthetic, and parallel datasets