YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π GPT-Style 50K Multilingual Tokenizer
A byte-level BPE tokenizer trained on a large mixed corpus of English + Urdu text, designed for chat-style LLMs with support for structured prompts and multimodal placeholders.
π Overview
This tokenizer is built using Byte Pair Encoding (BPE) with a vocabulary size of 50,000 tokens, optimized for:
- English language text
- Urdu (Unicode script)
- Chat formatting (system/user/assistant)
- Multimodal placeholders (speech, frames)
It is compatible with Hugging Face Transformers.
β¨ Features
- β‘ 50,000 vocabulary size
- π Multilingual support (English + Urdu)
- π¬ Chat-format tokens
- π§ Byte-level BPE (robust to unknown characters)
- π Fully reversible encoding/decoding
- π¦ Hugging Face compatible (
PreTrainedTokenizerFast) - π§© Special token system for LLM training
π§ Special Tokens
Core Tokens
<|pad|> <|unk|> <|bos|> <|eos|>
Chat Tokens
<|im_start|> <|im_end|>
<|system|> <|user|> <|assistant|>
Multimodal Tokens
<|speech|> <|frame_start|> <|frame_end|>
Reserved Tokens
<|reserved_1|> ... <|reserved_80|>
π Training Data
The tokenizer was trained on a large-scale mixed dataset:
- English corpora (web text)
- Urdu datasets
- FineWeb samples (filtered parquet files)
- Chat-formatted synthetic data
Total training size: ~3M+ rows
βοΈ Training Method
- Algorithm: Byte Pair Encoding (BPE)
- Normalization:
NFKC - Pre-tokenization:
ByteLevel - Decoder:
ByteLevelDecoder - Trainer:
BpeTrainer - Vocab size:
50,000
π¦ Installation
pip install transformers tokenizers huggingface_hub
π₯ Loading the Tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Humair332/gpt-style-50k-tokenizer")
π§ͺ Example Usage
Encoding
text = "Hello world! Ψ’ΩΎ Ϊ©ΫΨ³Ϋ ΫΫΪΊΨ"
encoded = tokenizer(text)
print(encoded["input_ids"])
Decoding
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
π¬ Chat Format Example
<|im_start|><|system|>
You are a helpful assistant.
<|im_end|>
<|im_start|><|user|>
ΫΫΩΩΨ Ψ’ΩΎ Ϊ©ΫΨ³Ϋ ΫΫΪΊΨ
<|im_end|>
<|im_start|><|assistant|>
π Performance Metrics
| Metric | English | Urdu |
|---|---|---|
| Chars/Token | ~5.1 | ~3.9 |
| Tokens/Word | ~1.0 | ~1.2 |
| Lossless Decode | β | β |
π§ Notes on Tokenization Behavior
- English text is highly efficient (near word-level tokens)
- Urdu uses subword + byte-level encoding
- Some Urdu words may be split into multiple sub-tokens
- Tokenization is fully reversible and lossless
π Output Format
The tokenizer outputs:
{
"input_ids": [...],
"attention_mask": [...]
}
π₯ Strengths
- Strong English performance
- Good Urdu coverage
- Stable chat formatting
- Robust to unseen characters
- Suitable for LLM pretraining and SFT
β οΈ Limitations
- Urdu tokens may appear byte-split internally
- Not fully optimized for morphological Urdu compression
- Vocabulary size may be small for very large-scale multilingual use
π Repository
Humair332/gpt-style-50k-tokenizer
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support