SozKZ Vocab: Kazakh Tokenizers
Collection
BPE and SentencePiece tokenizers trained on Kazakh text — 32K vocabularies • 3 items • Updated
A SentencePiece BPE tokenizer trained on Kazakh text for use in TTS (text-to-speech) models. Part of the KZ-CALM project — a Kazakh consistency-latent TTS system.
| Property | Value |
|---|---|
| Algorithm | Byte-Pair Encoding (BPE) |
| Vocabulary size | 4,096 tokens |
| Character coverage | 100% |
| Training data | 232,350 Kazakh utterances from stukenov/kzcalm-tts-kk-v1 |
| Library | SentencePiece |
| License | Apache 2.0 |
| Token | ID | Purpose |
|---|---|---|
<pad> |
0 | Padding |
<s> |
1 | Beginning of sequence (BOS) |
</s> |
2 | End of sequence (EOS) |
<unk> |
3 | Unknown token |
| File | Description |
|---|---|
tokenizer.model |
SentencePiece binary model (load with SentencePieceProcessor) |
tokenizer.vocab |
Human-readable vocabulary file (token + log-probability per line) |
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
# Encode text to token IDs
text = "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
ids = sp.Encode(text)
print(ids)
# Example output: [142, 87, 12, 305, 8, ...]
# Decode back to text
decoded = sp.Decode(ids)
print(decoded)
# "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
# Encode to subword pieces
pieces = sp.EncodeAsPieces(text)
print(pieces)
# Example: ['▁Сәлем', ',', '▁әлем', '!', '▁Бұл', '▁қазақ', ...]
huggingface_hub
from huggingface_hub import hf_hub_download
import sentencepiece as spm
model_path = hf_hub_download("stukenov/kzcalm-sp-tokenizer-4k-kk-v1", "tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)
print(sp.GetPieceSize()) # 4096
print(sp.Encode("Қазақстан"))
from kzcalm.tokenizer.sp_tokenizer import KazakhTokenizer
tok = KazakhTokenizer("tokenizer.model")
ids = tok.encode("Сәлем, әлем!", add_bos=True, add_eos=True)
# [1, 142, 87, 12, 305, 2] (BOS + tokens + EOS)
text = tok.decode(ids)
# "Сәлем, әлем!"
stukenov/kzcalm-tts-kk-v1 — a unified Kazakh TTS dataset combining KazakhTTS (177K samples) and KazEmoTTS (55K samples).SentencePieceTrainer.Train() with input_sentence_size=5,000,000, shuffle_input_sentence=True, multi-threaded.This tokenizer is designed for:
stukenov/kzcalm-tts-kk-v1 — 232K samples, 438.8h Kazakh speech