CRITICAL FIX: Switch from ByteLevel to Whitespace pre-tokenizer — fixes 42% UNK rate on domain token sequences a9c4a62 verified rtferraz commited on 2 days ago
Add domain_tokenizer.py — DomainTokenizerBuilder (core assembler, HF integration) 818a2e9 verified rtferraz commited on 8 days ago