YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ“˜ GPT-Style 50K Multilingual Tokenizer

A byte-level BPE tokenizer trained on a large mixed corpus of English + Urdu text, designed for chat-style LLMs with support for structured prompts and multimodal placeholders.


πŸš€ Overview

This tokenizer is built using Byte Pair Encoding (BPE) with a vocabulary size of 50,000 tokens, optimized for:

  • English language text
  • Urdu (Unicode script)
  • Chat formatting (system/user/assistant)
  • Multimodal placeholders (speech, frames)

It is compatible with Hugging Face Transformers.


✨ Features

  • ⚑ 50,000 vocabulary size
  • 🌍 Multilingual support (English + Urdu)
  • πŸ’¬ Chat-format tokens
  • 🧠 Byte-level BPE (robust to unknown characters)
  • πŸ” Fully reversible encoding/decoding
  • πŸ“¦ Hugging Face compatible (PreTrainedTokenizerFast)
  • 🧩 Special token system for LLM training

🧠 Special Tokens

Core Tokens

<|pad|> <|unk|> <|bos|> <|eos|>

Chat Tokens

<|im_start|> <|im_end|>
<|system|> <|user|> <|assistant|>

Multimodal Tokens

<|speech|> <|frame_start|> <|frame_end|>

Reserved Tokens

<|reserved_1|> ... <|reserved_80|>

πŸ“Š Training Data

The tokenizer was trained on a large-scale mixed dataset:

  • English corpora (web text)
  • Urdu datasets
  • FineWeb samples (filtered parquet files)
  • Chat-formatted synthetic data

Total training size: ~3M+ rows


βš™οΈ Training Method

  • Algorithm: Byte Pair Encoding (BPE)
  • Normalization: NFKC
  • Pre-tokenization: ByteLevel
  • Decoder: ByteLevelDecoder
  • Trainer: BpeTrainer
  • Vocab size: 50,000

πŸ“¦ Installation

pip install transformers tokenizers huggingface_hub

πŸ“₯ Loading the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Humair332/gpt-style-50k-tokenizer")

πŸ§ͺ Example Usage

Encoding

text = "Hello world! Ψ’ΩΎ Ϊ©ΫŒΨ³Ϋ’ ہیں؟"

encoded = tokenizer(text)
print(encoded["input_ids"])

Decoding

decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)

πŸ’¬ Chat Format Example

<|im_start|><|system|>
You are a helpful assistant.
<|im_end|>

<|im_start|><|user|>
ΫΫŒΩ„ΩˆΨŒ Ψ’ΩΎ Ϊ©ΫŒΨ³Ϋ’ ہیں؟
<|im_end|>

<|im_start|><|assistant|>

πŸ“Š Performance Metrics

Metric English Urdu
Chars/Token ~5.1 ~3.9
Tokens/Word ~1.0 ~1.2
Lossless Decode βœ… βœ…

🧠 Notes on Tokenization Behavior

  • English text is highly efficient (near word-level tokens)
  • Urdu uses subword + byte-level encoding
  • Some Urdu words may be split into multiple sub-tokens
  • Tokenization is fully reversible and lossless

πŸ“ Output Format

The tokenizer outputs:

{
  "input_ids": [...],
  "attention_mask": [...]
}

πŸ”₯ Strengths

  • Strong English performance
  • Good Urdu coverage
  • Stable chat formatting
  • Robust to unseen characters
  • Suitable for LLM pretraining and SFT

⚠️ Limitations

  • Urdu tokens may appear byte-split internally
  • Not fully optimized for morphological Urdu compression
  • Vocabulary size may be small for very large-scale multilingual use

πŸ“Œ Repository

Humair332/gpt-style-50k-tokenizer

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support