LilChatBot WordLevel Tokenizer

A WordLevel tokenizer trained for the LilChatBot project.

This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.

Design choices

WordLevel tokenization (no subword splitting)
Lowercasing
Unicode NFKC normalization
Apostrophes preserved everywhere
(e.g. don't, lion's, 'hello', James')
Aggressive punctuation isolation, including:
- sentence punctuation (. , ! ? ; :)
- brackets (() [] {})
- slashes (/)
- double quotes (straight and curly)
- en/em dashes (– —)
Repeated punctuation collapsed
(!!! → !, ??? → ?, ... → .)
English-focused

This tokenizer intentionally favors lexical transparency over vocabulary compactness.

Files

tokenizer.json — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)

The tokenizer can be used directly via the tokenizers library or wrapped for use with transformers.

Usage

With `transformers`

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")

print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids))

LilChatBot WordLevel Tokenizer

Design choices

Files

Usage

With transformers

With `transformers`