You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

KZ-CALM SentencePiece Tokenizer (Kazakh, 4096 vocab)

A SentencePiece BPE tokenizer trained on Kazakh text for use in TTS (text-to-speech) models. Part of the KZ-CALM project — a Kazakh consistency-latent TTS system.

Model Details

Property Value
Algorithm Byte-Pair Encoding (BPE)
Vocabulary size 4,096 tokens
Character coverage 100%
Training data 232,350 Kazakh utterances from stukenov/kzcalm-tts-kk-v1
Library SentencePiece
License Apache 2.0

Special Tokens

Token ID Purpose
<pad> 0 Padding
<s> 1 Beginning of sequence (BOS)
</s> 2 End of sequence (EOS)
<unk> 3 Unknown token

Files

File Description
tokenizer.model SentencePiece binary model (load with SentencePieceProcessor)
tokenizer.vocab Human-readable vocabulary file (token + log-probability per line)

Usage

Basic Encoding/Decoding

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="tokenizer.model")

# Encode text to token IDs
text = "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
ids = sp.Encode(text)
print(ids)
# Example output: [142, 87, 12, 305, 8, ...]

# Decode back to text
decoded = sp.Decode(ids)
print(decoded)
# "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."

# Encode to subword pieces
pieces = sp.EncodeAsPieces(text)
print(pieces)
# Example: ['▁Сәлем', ',', '▁әлем', '!', '▁Бұл', '▁қазақ', ...]

Download and Use with huggingface_hub

from huggingface_hub import hf_hub_download
import sentencepiece as spm

model_path = hf_hub_download("stukenov/kzcalm-sp-tokenizer-4k-kk-v1", "tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)

print(sp.GetPieceSize())  # 4096
print(sp.Encode("Қазақстан"))

Use with KZ-CALM Wrapper

from kzcalm.tokenizer.sp_tokenizer import KazakhTokenizer

tok = KazakhTokenizer("tokenizer.model")
ids = tok.encode("Сәлем, әлем!", add_bos=True, add_eos=True)
# [1, 142, 87, 12, 305, 2]  (BOS + tokens + EOS)

text = tok.decode(ids)
# "Сәлем, әлем!"

Training Details

  • Source corpus: All transcription texts from stukenov/kzcalm-tts-kk-v1 — a unified Kazakh TTS dataset combining KazakhTTS (177K samples) and KazEmoTTS (55K samples).
  • Preprocessing: Texts extracted via DuckDB column pruning from remote Parquet shards (no audio download needed). Empty and whitespace-only lines excluded.
  • Training: SentencePiece SentencePieceTrainer.Train() with input_sentence_size=5,000,000, shuffle_input_sentence=True, multi-threaded.
  • Vocabulary size choice: 4,096 was selected as a balance between granularity and sequence length for TTS. Kazakh has a rich morphology (agglutinative), so a smaller vocab would produce very long sequences, while a larger vocab would be sparse given the ~232K training sentences.

Intended Use

This tokenizer is designed for:

  • KZ-CALM TTS pipeline: Converts Kazakh text into token IDs that feed into the Transformer backbone for speech synthesis.
  • Kazakh NLP experiments: General-purpose Kazakh subword tokenization.
  • Text preprocessing for any model consuming Kazakh text input.

Limitations

  • Trained only on TTS transcription data — may not cover specialized vocabulary (medical, legal, technical terms).
  • No language-specific normalization is applied (numbers, dates, abbreviations appear in their raw text form). A separate text normalizer should be used upstream.
  • The vocabulary is optimized for Kazakh; it will perform poorly on other languages (Russian, English text will be heavily fragmented).

Related Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/kzcalm-sp-tokenizer-4k-kk-v1