Add chat template
This chat template looks good to me in pre-testing, but we might want to wait until the model is fully merged into Transformers for final testing + merging!
is this expected?
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", revision="refs/pr/16")
tool_calls = [{"type": "function", "function": {"name": "dummy", "arguments": "{}"}}]
messages1 = [
{"role": "user", "content": "dummy"},
{"role": "assistant", "content": "", "tool_calls": tool_calls},
]
messages2 = messages1 + [{"role": "tool", "name": "dummy", "content": "dummy"}]
s1 = tok.apply_chat_template(messages1, tokenize=False)
s2 = tok.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)
print("s1:", repr(s1))
print("s2:", repr(s2))
print("s2 starts with s1?", s2.startswith(s1))
s1: '<|begin▁of▁sentence|><|User|>dummy<|Assistant|><think></think>\n\n<|DSML|tool_calls>\n<|DSML|invoke name="dummy">\n<|DSML|parameter name="arguments" string="true">{}</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls><|end▁of▁sentence|>'
s2: '<|begin▁of▁sentence|><|User|>dummy<|Assistant|></think>\n\n<|DSML|tool_calls>\n<|DSML|invoke name="dummy">\n<|DSML|parameter name="arguments" string="true">{}</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls><|end▁of▁sentence|><|User|><tool_result>dummy</tool_result><|Assistant|><think>'
s2 starts with s1? False
this seems to work better: https://huggingface.co/trl-internal-testing/tiny-DeepseekV4ForCausalLM/discussions/6/files
@qgallouedec I think this is intended, or at least it matched the example tests; I mostly did "test-driven" development here. I realize <|Assistant|></think> instead of something like <|Assistant|><think>\n</think>looks like a bug, but this pattern also appears in the expected test outputs when drop_thinking=True (which is the default): https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/encoding/tests/test_output_2.txt
Hi! Sharing some downstream evidence that landing this template (or any tool-supporting variant) is load-bearing for the agent ecosystem.
We just shipped day-0 DeepSeek-V4-Flash support in rapid-mlx (Apple Silicon MLX backend) by vendoring Blaizzy's mlx-lm PR #1192 and tested both the 2-bit DQ and 8-bit mlx-community quants on a Mac Studio M3 Ultra:
| Suite | 2-bit DQ | 8-bit |
|---|---|---|
| Plain chat | ✅ works | ✅ works |
| Decode (tok/s) | 56 | 31 |
| Stress (8 scenarios) | 7/8 PASS | 7/8 PASS |
| Tool calling (30-scenario eval) | 0/30 | 0/30 |
| Hermes/OpenClaude agent integration | failing on tool tests | failing on tool tests |
The reason for the 0/30 is exactly what's being discussed here: the chat_template.jinja that mlx-community's quants ship today only handles system/user/assistant — no tool role rendering, no tools array iteration, no <tool_call> markers. So tools passed via the OpenAI-compatible API are silently dropped before the model ever sees them.
So whichever variant of this PR lands first (the current draft or the trl-internal-testing alternative @qgallouedec referenced) is genuinely unblocking — the model itself is clearly capable, it just isn't being told tools exist. Happy to re-run our evals and report numbers once a final template is merged.