Special Token Disaster: Your Tech Lead Has Zero Design Taste
chat_template.jinja
{#- ----------‑‑‑ special token variables ‑‑‑---------- -#}
{%- set bos_token = '<|hy_begin▁of▁sentence|>' %}
{%- set pad_token = '<|hy_▁pad▁|>' %}
{%- set user_token = '<|hy_User|>' %}
See <|hy_begin▁of▁sentence|> and <|hy_▁pad▁|>, | and | and _ and ▁
The special token design for Hunyuan is a visual disaster that screams "zero design taste" from leadership. Using a bloated mix of fullwidth pipes (|) and obscure geometric blocks (▁) doesn't make the model look high-tech, it makes it look like a corrupted encoding error.
A tech leader with an actual eye for polish understands that functional infrastructure should be clean and harmonious. Instead, we got a syntax that creates jagged, uneven visual noise.
Please take back your shit.
That the hell lol, thank you for sharing such a nice model <3 don’t listen here
Hi, thanks for the feedback!
The special token design using fullwidth pipes (|) and block characters (▁) is actually an intentional engineering decision rather than an oversight. During pretraining and continual training, the model is trained on massive, diverse corpora where conventional special tokens like or <|im_start|> frequently appear as plain text. These collisions make it ambiguous whether a token is a genuine control signal or just content, which can degrade model behavior. Using visually distinctive Unicode characters significantly reduces collision probability and ensures a clean separation between control tokens and content.
It's also worth noting that these special tokens are handled internally by the tokenizer and chat template, so they should be completely transparent to end users and developers during normal usage — you won't need to type or deal with them directly.
That said, we completely understand the ergonomic concerns. The current token set in Hy3-preview prioritizes robustness, but we're actively working on an optimized version in a future release that better balances collision resistance with readability and developer experience. Stay tuned!
Thanks again for the candid feedback — it's genuinely appreciated.
Hi, thanks for the feedback!
The special token design using fullwidth pipes (|) and block characters (▁) is actually an intentional engineering decision rather than an oversight. During pretraining and continual training, the model is trained on massive, diverse corpora where conventional special tokens like or <|im_start|> frequently appear as plain text. These collisions make it ambiguous whether a token is a genuine control signal or just content, which can degrade model behavior. Using visually distinctive Unicode characters significantly reduces collision probability and ensures a clean separation between control tokens and content.
It's also worth noting that these special tokens are handled internally by the tokenizer and chat template, so they should be completely transparent to end users and developers during normal usage — you won't need to type or deal with them directly.
That said, we completely understand the ergonomic concerns. The current token set in Hy3-preview prioritizes robustness, but we're actively working on an optimized version in a future release that better balances collision resistance with readability and developer experience. Stay tuned!
Thanks again for the candid feedback — it's genuinely appreciated.
totally unconvincing, when you see "<|im_start|>" in your pre-training corpus, you should parse that data and convert it to conversational format.
@yiqichen01
It is possible to rename special tokens without impacting the model at all, by modifying the tokenizer.json files! It is the token ID that matters, as this is a special token without any merges.
As long as any downstream users do not use their own chat template, this is a non-breaking change.
It may be best to consider this in the next preview or the full release of the model, compared to breaking the model in a new rev.