You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

KZ-CALM SentencePiece Tokenizer (Kazakh, 4096 vocab)

A SentencePiece BPE tokenizer trained on Kazakh text for use in TTS (text-to-speech) models. Part of the KZ-CALM project — a Kazakh consistency-latent TTS system.

Model Details

Property	Value
Algorithm	Byte-Pair Encoding (BPE)
Vocabulary size	4,096 tokens
Character coverage	100%
Training data	232,350 Kazakh utterances from `stukenov/kzcalm-tts-kk-v1`
Library	SentencePiece
License	Apache 2.0

Special Tokens

Token	ID	Purpose
`<pad>`	0	Padding
`<s>`	1	Beginning of sequence (BOS)
`</s>`	2	End of sequence (EOS)
`<unk>`	3	Unknown token

Files

File	Description
`tokenizer.model`	SentencePiece binary model (load with `SentencePieceProcessor`)
`tokenizer.vocab`	Human-readable vocabulary file (token + log-probability per line)

Usage

Basic Encoding/Decoding

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="tokenizer.model")

# Encode text to token IDs
text = "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
ids = sp.Encode(text)
print(ids)
# Example output: [142, 87, 12, 305, 8, ...]

# Decode back to text
decoded = sp.Decode(ids)
print(decoded)
# "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."

# Encode to subword pieces
pieces = sp.EncodeAsPieces(text)
print(pieces)
# Example: ['▁Сәлем', ',', '▁әлем', '!', '▁Бұл', '▁қазақ', ...]

Download and Use with `huggingface_hub`

from huggingface_hub import hf_hub_download
import sentencepiece as spm

model_path = hf_hub_download("stukenov/kzcalm-sp-tokenizer-4k-kk-v1", "tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)

print(sp.GetPieceSize())  # 4096
print(sp.Encode("Қазақстан"))

Use with KZ-CALM Wrapper

from kzcalm.tokenizer.sp_tokenizer import KazakhTokenizer

tok = KazakhTokenizer("tokenizer.model")
ids = tok.encode("Сәлем, әлем!", add_bos=True, add_eos=True)
# [1, 142, 87, 12, 305, 2]  (BOS + tokens + EOS)

text = tok.decode(ids)
# "Сәлем, әлем!"

Training Details

Source corpus: All transcription texts from stukenov/kzcalm-tts-kk-v1 — a unified Kazakh TTS dataset combining KazakhTTS (177K samples) and KazEmoTTS (55K samples).
Preprocessing: Texts extracted via DuckDB column pruning from remote Parquet shards (no audio download needed). Empty and whitespace-only lines excluded.
Training: SentencePiece SentencePieceTrainer.Train() with input_sentence_size=5,000,000, shuffle_input_sentence=True, multi-threaded.
Vocabulary size choice: 4,096 was selected as a balance between granularity and sequence length for TTS. Kazakh has a rich morphology (agglutinative), so a smaller vocab would produce very long sequences, while a larger vocab would be sparse given the ~232K training sentences.

Intended Use

This tokenizer is designed for:

KZ-CALM TTS pipeline: Converts Kazakh text into token IDs that feed into the Transformer backbone for speech synthesis.
Kazakh NLP experiments: General-purpose Kazakh subword tokenization.
Text preprocessing for any model consuming Kazakh text input.

Limitations

Trained only on TTS transcription data — may not cover specialized vocabulary (medical, legal, technical terms).
No language-specific normalization is applied (numbers, dates, abbreviations appear in their raw text form). A separate text normalizer should be used upstream.
The vocabulary is optimized for Kazakh; it will perform poorly on other languages (Russian, English text will be heavily fragmented).

Related Resources

Training dataset: stukenov/kzcalm-tts-kk-v1 — 232K samples, 438.8h Kazakh speech
Project: KZ-CALM — Kazakh Consistency-Latent TTS

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/kzcalm-sp-tokenizer-4k-kk-v1

SozKZ Vocab: Kazakh Tokenizers

Collection

BPE and SentencePiece tokenizers trained on Kazakh text — 32K vocabularies • 3 items • Updated 29 days ago