📘 GPT-Style 50K Multilingual Tokenizer

A byte-level BPE tokenizer trained on a large mixed corpus of English + Urdu text, designed for chat-style LLMs with support for structured prompts and multimodal placeholders.

🚀 Overview

This tokenizer is built using Byte Pair Encoding (BPE) with a vocabulary size of 50,000 tokens, optimized for:

English language text
Urdu (Unicode script)
Chat formatting (system/user/assistant)
Multimodal placeholders (speech, frames)

It is compatible with Hugging Face Transformers.

✨ Features

⚡ 50,000 vocabulary size
🌍 Multilingual support (English + Urdu)
💬 Chat-format tokens
🧠 Byte-level BPE (robust to unknown characters)
🔁 Fully reversible encoding/decoding
📦 Hugging Face compatible (PreTrainedTokenizerFast)
🧩 Special token system for LLM training

🧠 Special Tokens

Core Tokens

<|pad|> <|unk|> <|bos|> <|eos|>

Chat Tokens

<|im_start|> <|im_end|>
<|system|> <|user|> <|assistant|>

Multimodal Tokens

<|speech|> <|frame_start|> <|frame_end|>

Reserved Tokens

<|reserved_1|> ... <|reserved_80|>

📊 Training Data

The tokenizer was trained on a large-scale mixed dataset:

English corpora (web text)
Urdu datasets
FineWeb samples (filtered parquet files)
Chat-formatted synthetic data

Total training size: ~3M+ rows

⚙️ Training Method

Algorithm: Byte Pair Encoding (BPE)
Normalization: NFKC
Pre-tokenization: ByteLevel
Decoder: ByteLevelDecoder
Trainer: BpeTrainer
Vocab size: 50,000

📦 Installation

pip install transformers tokenizers huggingface_hub

📥 Loading the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Humair332/gpt-style-50k-tokenizer")

🧪 Example Usage

Encoding

text = "Hello world! آپ کیسے ہیں؟"

encoded = tokenizer(text)
print(encoded["input_ids"])

Decoding

decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)

💬 Chat Format Example

<|im_start|><|system|>
You are a helpful assistant.
<|im_end|>

<|im_start|><|user|>
ہیلو، آپ کیسے ہیں؟
<|im_end|>

<|im_start|><|assistant|>

📊 Performance Metrics

Metric	English	Urdu
Chars/Token	~5.1	~3.9
Tokens/Word	~1.0	~1.2
Lossless Decode	✅	✅

🧠 Notes on Tokenization Behavior

English text is highly efficient (near word-level tokens)
Urdu uses subword + byte-level encoding
Some Urdu words may be split into multiple sub-tokens
Tokenization is fully reversible and lossless

📁 Output Format

The tokenizer outputs:

{
  "input_ids": [...],
  "attention_mask": [...]
}

🔥 Strengths

Strong English performance
Good Urdu coverage
Stable chat formatting
Robust to unseen characters
Suitable for LLM pretraining and SFT

⚠️ Limitations

Urdu tokens may appear byte-split internally
Not fully optimized for morphological Urdu compression
Vocabulary size may be small for very large-scale multilingual use

📌 Repository

Humair332/gpt-style-50k-tokenizer

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support