feat: add custom tokenizer with multi-char Chinese token splitting
#8
by linyueqian - opened
VoxCPM2 was trained with mask_multichar_chinese_tokens which splits
multi-character Chinese tokens (e.g. "你好" → ["你", "好"]) into individual
character IDs. The current HF repo ships a plain LlamaTokenizerFast without
this splitting, causing downstream inference frameworks (vLLM-Omni, etc.) to
produce garbled Chinese audio.
This PR adds VoxCPM2Tokenizer (subclass of LlamaTokenizerFast) that applies
char-splitting inside encode() and __call__() transparently.
EasonLiu changed pull request status to merged