SozKZ Vocab: Kazakh Tokenizers
Collection
BPE and SentencePiece tokenizers trained on Kazakh text — 32K vocabularies • 3 items • Updated
SentencePiece unigram tokenizer trained on Kazakh text for T5 pretraining.
| Property | Value |
|---|---|
| Algorithm | SentencePiece Unigram |
| Base vocab | 32,000 |
| Sentinel tokens | 128 (<extra_id_0> ... <extra_id_127>) |
| Total vocab | 32,128 |
| Special tokens | <pad>=0, </s>=1, <unk>=2 |
| Character coverage | 99.95% |
| Byte fallback | Yes |
| Normalization | Identity (no NFKC) |
Trained on 5M samples from stukenov/kazakh-clean-pretrain-text.
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("stukenov/kazakh-t5-sp-32k")
text = "Kazakh text here"
tokens = tokenizer.encode(text)
print(tokenizer.decode(tokens))
<unk></s> as EOS only)