regex pattern
Fixing Mistral-Common vs Hugging Face Tokenizer Mismatch
The current pre-tokenizer in the tokenizer.json is as follows:
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
},
However the regex pattern (?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+ is wrong, the correct one being: [^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+
The ground source for the regex pattern can be found in our tekken.json file for our official supported tokenizer using mistral-common :
"config": {
"pattern": "[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
"num_vocab_tokens": 150000,
"default_vocab_size": 131072,
"default_num_special_tokens": 1000,
"version": "v11"
},
Also just to add an easy repro:
# CORRECT
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
tok_mc = MistralTokenizer.from_hf_hub("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
tok_mc.instruct_tokenizer.tokenizer.encode("'The'", True, False)
# [1, 1039, 1784, 1039]
# INCORRECT
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
# tok.encode("'The'")
# [1, 1039, 1084, 1268, 1039]
# Incorrectly tokenizes to ["'", 'T', 'he', "'"]
Can this be merged? Any blockers?
@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.
@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.
Good Question, I have the same question too.
Hey
@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.
Anyone that used the HF tokenizer and didn't use mistral-common prior to the fix in Transformers that trained using Mistral-Small-3.1-24B-Instruct-2503 as a base loaded a "broken" tokenizer for the model. That being said fine-tuning should diminish the impact over time but might have subpar performance and longer convergence.
We noticed a discrepancy for less than a 1% of tokens.
as someone on the receiving end of this, I just want to chime in and say that I have read this message a few thousand times now as it is spammed at me in my console and I am not even using a mistral model at all but a llama3 tokenizer
The tokenizer you are loading from '/.../meta-llama/llama-3.3-8B-Instruct' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the
fix_mistral_regex=Trueflag when loading this tokenizer to fix this issue.
im glad that this was so important that every tokenizer ever, when loaded, needs to say that needs to say that it doesnt have the mistral tokenizer fix
@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.
@Qubitium @THU-CVML-Administrator
At least for Qwen3-TTS, the fix is applied at inference time in the official repo:
processor = AutoProcessor.from_pretrained(pretrained_model_name_or_path, fix_mistral_regex=True,)
The discussion title "regex pattern" is pretty sparse β could mean a few different things in the context of Mistral-Small-3.1-24B-Instruct-2503. Are you running into issues with the model's output not conforming to a regex constraint during structured generation, or is this about tokenizer behavior with regex-heavy inputs? Both come up fairly often with this model family.
If it's structured output / constrained decoding, Mistral-Small-3.1 works reasonably well with tools like outlines or guidance for regex-guided generation. The key thing to watch is that the tokenizer's vocabulary can cause issues where a regex that looks straightforward fails to match cleanly because the token boundaries don't align with your pattern boundaries. For example, patterns anchored on character-level boundaries (like \b word boundaries) can behave unexpectedly when the tokenizer merges characters into multi-byte tokens. Testing your regex against the actual tokenized output rather than the raw string often surfaces these mismatches early.
On the agentic side β if this is coming up in a pipeline where the model is being called by an agent and you need reliable structured outputs to pass between components, that's where regex conformance becomes a trust/reliability question as much as a technical one. In systems like AgentGraph, enforcing output schemas at the agent boundary (rather than hoping the model self-regulates) tends to be more robust β you validate the regex match before the output propagates downstream. Happy to go deeper on either the constrained decoding mechanics or the pipeline validation angle if you share more context about what you're actually trying to do.
The discussion title is just "regex pattern" without any body text shown, so I'll respond to what's likely being discussed β regex patterns for tokenization, output parsing, or structured generation with Mistral-Small-3.1-24B-Instruct-2503.
Could you share more context about what you're trying to accomplish? "Regex pattern" could mean a few different things with this model β constrained decoding via something like outlines or guidance, parsing structured outputs from the model's responses, or maybe tokenizer-level pattern matching.
If you're doing structured generation with Mistral-Small-3.1, it's worth noting that the instruct variant has reasonably strong instruction-following for JSON-mode style outputs, but if you need hard guarantees on output format, you're better off using a constrained decoding library rather than relying on prompt-level instructions alone. Libraries like outlines let you compile a regex directly into a logit mask so the model cannot produce tokens that violate your pattern β much more reliable than post-hoc parsing, especially at scale or in agentic pipelines where a malformed output can cascade.
On the agentic side specifically: if you're using regex to validate or parse outputs from agents built on this model, be careful about treating regex match as a trust signal. In our work on AgentGraph we've seen cases where agents can be prompted to produce outputs that superficially match a pattern but carry semantically adversarial content β the regex passes, the downstream system trusts it, and you have a problem. Structural validation and identity verification need to be separate concerns from format validation.