Byte-Level BPE Tokenizer: tur_Latn (32K)

A Byte-Level BPE tokenizer trained on tur_Latn data from Fineweb-2-HQ.

Training Details

Parameter Value
Algorithm Byte-Level BPE
Language tur_Latn
Target Vocab Size 32,000
Final Vocab Size 32,855
Pre-tokenizer custom:tur_Latn
Number handling ltr_3digit
Contraction handling True
Normalizer NFC
Special Tokens <s>, </s>, <pad>, <unk>
Training Shards 2

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("flexitok/bpe_ltr_tur_Latn_32000_v2")
tokens = tokenizer.encode("Hello, world!")

Files

  • tokenizer.json — Full HuggingFace tokenizer
  • vocab.json — Vocabulary mapping
  • merges.txt — BPE merge rules

Sample Encoding

Text Tokens Token IDs
Hello, world! 12345 This is a test. こんにちは H, ello, ,, Ġw, orld, !, Ġ, 123, 45, ĠThis, Ġis, Ġa, Ġtest, ., Ġ, ãģ, ĵ, ãĤ, ĵ, ãģ 42, 18690, 14, 1129, 7565, 3, 223, 17430, 3899, 24258, 1109, 302, 1681, 16, 223, 5495, 244, 9807, 244, 5495
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collections including flexitok/bpe_ltr_tur_Latn_32000_v2