feat: add custom tokenizer with multi-char Chinese token splitting

#8
by linyueqian - opened

VoxCPM2 was trained with mask_multichar_chinese_tokens which splits
multi-character Chinese tokens (e.g. "你好" → ["你", "好"]) into individual
character IDs. The current HF repo ships a plain LlamaTokenizerFast without
this splitting, causing downstream inference frameworks (vLLM-Omni, etc.) to
produce garbled Chinese audio.

This PR adds VoxCPM2Tokenizer (subclass of LlamaTokenizerFast) that applies
char-splitting inside encode() and __call__() transparently.

EasonLiu changed pull request status to merged

Sign up or log in to comment