LilChatBot WordLevel Tokenizer
A WordLevel tokenizer trained for the LilChatBot project.
This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.
Design choices
- WordLevel tokenization (no subword splitting)
- Lowercasing
- Unicode NFKC normalization
- Apostrophes preserved everywhere
(e.g.don't,lion's,'hello',James') - Aggressive punctuation isolation, including:
- sentence punctuation (
. , ! ? ; :) - brackets (
() [] {}) - slashes (
/) - double quotes (straight and curly)
- en/em dashes (
– —)
- sentence punctuation (
- Repeated punctuation collapsed
(!!! → !,??? → ?,... → .) - English-focused
This tokenizer intentionally favors lexical transparency over vocabulary compactness.
Files
tokenizer.json— complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)
The tokenizer can be used directly via the tokenizers library or wrapped for use with transformers.
Usage
With transformers
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")
print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids))