You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kazakh T5 SentencePiece Tokenizer (32K)

SentencePiece unigram tokenizer trained on Kazakh text for T5 pretraining.

Details

Property Value
Algorithm SentencePiece Unigram
Base vocab 32,000
Sentinel tokens 128 (<extra_id_0> ... <extra_id_127>)
Total vocab 32,128
Special tokens <pad>=0, </s>=1, <unk>=2
Character coverage 99.95%
Byte fallback Yes
Normalization Identity (no NFKC)

Training Data

Trained on 5M samples from stukenov/kazakh-clean-pretrain-text.

Usage

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("stukenov/kazakh-t5-sp-32k")

text = "Kazakh text here"
tokens = tokenizer.encode(text)
print(tokenizer.decode(tokens))

Design Choices

  • Unigram model: Matches original T5 tokenizer (not BPE)
  • Identity normalization: Preserves Kazakh-specific characters without NFKC folding
  • Byte fallback: Handles any Unicode character without <unk>
  • No BOS token: T5 convention (uses </s> as EOS only)
  • 128 sentinels: For T5 span corruption pretraining
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-vocab-sp-32k-kk-t5-v1