You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kazakh T5 SentencePiece Tokenizer (32K)

SentencePiece unigram tokenizer trained on Kazakh text for T5 pretraining.

Details

Property	Value
Algorithm	SentencePiece Unigram
Base vocab	32,000
Sentinel tokens	128 (`<extra_id_0>` ... `<extra_id_127>`)
Total vocab	32,128
Special tokens	`<pad>`=0, `</s>`=1, `<unk>`=2
Character coverage	99.95%
Byte fallback	Yes
Normalization	Identity (no NFKC)

Training Data

Trained on 5M samples from stukenov/kazakh-clean-pretrain-text.

Usage

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("stukenov/kazakh-t5-sp-32k")

text = "Kazakh text here"
tokens = tokenizer.encode(text)
print(tokenizer.decode(tokens))

Design Choices

Unigram model: Matches original T5 tokenizer (not BPE)
Identity normalization: Preserves Kazakh-specific characters without NFKC folding
Byte fallback: Handles any Unicode character without <unk>
No BOS token: T5 convention (uses </s> as EOS only)
128 sentinels: For T5 span corruption pretraining

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-vocab-sp-32k-kk-t5-v1

SozKZ Vocab: Kazakh Tokenizers

Collection

BPE and SentencePiece tokenizers trained on Kazakh text — 32K vocabularies • 3 items • Updated about 1 month ago