SozKZ Corpora: Kazakh Training Datasets

stukenov 's Collections

EkiTil: Bilingual Kazakh-Russian Language Models

SozKZ Misc: TTS, Sentiment & Other

SozKZ Vocab: Kazakh Tokenizers

SozKZ MoE: Mixture of Experts

SozKZ GEC: Kazakh Grammar Error Correction

SozKZ Core: Kazakh Language Models

SozKZ Corpora: Kazakh Training Datasets

updated 29 days ago

Training corpora for Kazakh LLMs — raw, cleaned, deduplicated, tokenized, synthetic, and parallel datasets

Upvote

stukenov/sozkz-corpus-raw-kk-multi-v1

Viewer • Updated 29 days ago • 13.1M • 8
stukenov/sozkz-corpus-raw-kk-gazeta-v1

Viewer • Updated 29 days ago • 74.1k • 14
stukenov/sozkz-corpus-clean-kk-pretrain-v2

Viewer • Updated 29 days ago • 1.02M • 11
stukenov/sozkz-corpus-clean-kk-text-v2

Viewer • Updated Feb 11 • 19M • 30
stukenov/sozkz-corpus-clean-v3

Viewer • Updated 9 days ago • 13.5M • 72
stukenov/sozkz-corpus-clean-enkk-fineweb-edu-v1

Viewer • Updated 29 days ago • 18M • 31
stukenov/sozkz-corpus-dedup-kk-web-v1

Viewer • Updated 29 days ago • 9.48M • 7
stukenov/sozkz-corpus-balanced-kk-gpt2-v1

Viewer • Updated 29 days ago • 480k • 9
stukenov/sozkz-corpus-tokenized-kk-llama50k-v3

Viewer • Updated 29 days ago • 8.88M • 11
stukenov/sozkz-corpus-tokenized-kk-llama50m-v1

Viewer • Updated 29 days ago • 5.9M • 7
stukenov/sozkz-corpus-tokenized-kk-200k-v1

Viewer • Updated Feb 19 • 422k • 6
stukenov/sozkz-corpus-tokenized-enkk-200k-v1

Viewer • Updated Feb 20 • 10.4M • 5
stukenov/sozkz-corpus-tokenized-enkk-fineweb-edu-v1

Viewer • Updated Feb 19 • 9.02M • 8
stukenov/sozkz-corpus-tokenized-kk-multidomain-200k-v1

Viewer • Updated Feb 19 • 1.18M • 7
stukenov/sozkz-corpus-tokenized-kk-t5-50m-v1

Viewer • Updated Feb 12 • 2.06M • 6
stukenov/sozkz-corpus-chatml-kk-instruct-mix-v1

Viewer • Updated Feb 17 • 369k • 18
stukenov/sozkz-corpus-synthetic-kk-instruct-v1

Viewer • Updated Feb 17 • 49.5k • 8
stukenov/sozkz-corpus-kazsandra-sentiment-v1

Viewer • Updated Mar 18 • 60.3k • 8
stukenov/sozkz-instruct-chatml-kk-v1

Viewer • Updated Feb 17 • 1.29M • 12
stukenov/sozkz-instruct-chatml-en-v1

Viewer • Updated Feb 17 • 1.31M • 6
stukenov/sozkz-fineweb-edu-10bt-en-kk

Updated Feb 17 • 5
stukenov/sozkz-fineweb-edu-10bt-parallel-en-kk

Updated Feb 17 • 5

Upvote

Collection guide
Browse collections