Spaces:

Naphula
/

model_tools

Running

App Files Files Community

model_tools / enable_fix_mistral_regex_true.md

Naphula

Upload 10 files

f43fd2b verified about 1 month ago

preview code

raw

history blame contribute delete

32.9 kB

architecture: MistralForCausalLM
base_model: B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop
models:
  - model: B:/12B/models--allura-org--Tlacuilo-12B
  - model: B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop
  - model: B:/12B/models--EldritchLabs--Altair-Stock-12B-v1/MPOA_alt
  - model: B:/12B/models--inflatebot--MN-12B-Mag-Mell-R1
  - model: B:/12B/models--MuXodious--Irix-12B-Model_Stock-absolute-heresy
  - model: B:/12B/models--aixonlab--Aether-12b
  - model: B:/12B/models--SicariusSicariiStuff--Impish_Bloodmoon_12B
  - model: B:/12B/models--SicariusSicariiStuff--Sweet_Dreams_12B
  - model: B:/12B/models--Epiculous--Azure_Dusk-v0.2
  # - model: B:/12B/models--SicariusSicariiStuff--Impish_Bloodmoon_12B.backup
merge_method: karcher
parameters:  
  max_iter: 30 # Maximum iterations for Karcher mean algorithm (default 10 )
  tol: 1e-9 # Convergence tolerance for Karcher mean algorithm (default 1e5)
dtype: float32
out_dtype: bfloat16
tokenizer:  
  source: "union"  
  tokens:  
    # Force ChatML EOS tokens  
    "<|im_start|>":  
      source: "B:/12B/models--allura-org--Tlacuilo-12B"  
      force: true  
    "<|im_end|>":  
      source: "B:/12B/models--allura-org--Tlacuilo-12B"  
      force: true  
    # Keep Mistral tokens  
    "[INST]":  
      source: "B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop"  
     #  source: "B:/12B/models--mistralai--Mistral-Nemo-Instruct-2407"    # The tokenizer system requires all models referenced in token configurations to be present in the merge's model list to build proper embedding permutations. 
    "[/INST]":  
      source: "B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop"  
    # Force </s> as fallback EOS  
    "</s>":  
      source: "B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop"  
      force: true

chat_template: "chatml"
name: 🧜‍ Nocturne-Nereid-12B-v1

C:\mergekit-main>timeout /t 3 /nobreak && mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda

Waiting for 0 seconds, press CTRL+C to quit ...
`torch_dtype` is deprecated! Use `dtype` instead!
Warmup loader cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 1499.99it/s]
Planning operations
Executing graph:   0%|                                                                                                                                                                                             | 0/4361 [00:00<?, ?it/s]The tokenizer you are loading from 'B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading tokenizers
The tokenizer you are loading from 'B:/12B/models--allura-org--Tlacuilo-12B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from 'B:/12B/models--aixonlab--Aether-12b' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from 'B:/12B/models--Epiculous--Azure_Dusk-v0.2' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from 'B:/12B/models--inflatebot--MN-12B-Mag-Mell-R1' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from 'B:/12B/models--EldritchLabs--Altair-Stock-12B-v1/MPOA_alt' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from 'B:/12B/models--SicariusSicariiStuff--Impish_Bloodmoon_12B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from 'B:/12B/models--SicariusSicariiStuff--Sweet_Dreams_12B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Building output tokenizer
Token '<|im_start|>' added with multiple different settings, using first
Building permutations
Building tokenizer permutations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00,  9.54it/s]
Executing graph:   0%|                                                                                                                                                                                   | 3/4361 [00:07<2:44:37,  2.27s/it]

Based on the codebase, mergekit doesn't currently support passing fix_mistral_regex=True to tokenizer loading. The warnings you're seeing come from the transformers library, but mergekit only passes trust_remote_code to tokenizer loading calls, not the fix_mistral_regex parameter.

Current Tokenizer Loading in MergeKit

The main tokenizer loading happens in mergekit/tokenizer/build.py:

# In get_stripped_tokenizer()
tokenizer = transformers.AutoTokenizer.from_pretrained(
    path.path,
    revision=path.revision,
    trust_remote_code=trust_remote_code,
    use_fast=True,
)

And in build_tokenizer():

model_tok = transformers.AutoTokenizer.from_pretrained(
    model.model.path,
    revision=model.model.revision,
    trust_remote_code=trust_remote_code,
)

How to Add Support

To add fix_mistral_regex=True support, you would need to:

Add the parameter to MergeOptions - This would require modifying the options structure to include a new field for fix_mistral_regex
Update all tokenizer loading calls - Modify the AutoTokenizer.from_pretrained() calls throughout the codebase to pass this parameter:
- mergekit/tokenizer/build.py (lines 49-54, 203-207, 155-157)
- mergekit/merge.py (line 271-275)
- mergekit/scripts/tokensurgeon.py (line 148-152)
Add CLI flag - Add a --fix-mistral-regex flag to the CLI that sets this option

Temporary Workaround

Until mergekit adds native support, you could:

Modify the source locally - Add fix_mistral_regex=True directly to the AutoTokenizer.from_pretrained() calls in your local mergekit installation
Use a different tokenizer source - If one of your models has a correct tokenizer, you could specify it as the tokenizer source in your YAML config

The warnings won't prevent your merge from completing, but as noted, they may lead to incorrect tokenization behavior with the affected models.

Notes

The trust_remote_code parameter you're already using via --trust-remote-code follows the same pattern through the codebase - it's passed from CLI options to the tokenizer loading functions. Adding fix_mistral_regex support would require implementing the same pattern.

Explain how to enable this for merges on my local copy, step by step.

To enable fix_mistral_regex=True for your local mergekit installation, you'll need to modify several files to add this parameter throughout the tokenizer loading pipeline.

Step-by-Step Local Modifications

1. Add the option to MergeOptions

Edit mergekit/options.py and add the new field:

# In MergeOptions class (around line 41)
fix_mistral_regex: bool = False

Also add it to OPTION_HELP and OPTION_CATEGORIES dictionaries:

# Around line 114
OPTION_HELP = {
    # ... existing options ...
    "fix_mistral_regex": "Fix incorrect regex pattern in Mistral tokenizers",
}

# Around line 140
OPTION_CATEGORIES = {
    # ... existing categories ...
    "fix_mistral_regex": "Dangerous Options",
}

2. Update tokenizer loading in mergekit/tokenizer/build.py

Modify all AutoTokenizer.from_pretrained() calls to include the new parameter:

# Line 49-54 in get_stripped_tokenizer()
tokenizer = transformers.AutoTokenizer.from_pretrained(
    path.path,
    revision=path.revision,
    trust_remote_code=trust_remote_code,
    use_fast=True,
    fix_mistral_regex=fix_mistral_regex,  # Add this line
)

# Line 203-207 in build_tokenizer()
model_tok = transformers.AutoTokenizer.from_pretrained(
    model.model.path,
    revision=model.model.revision,
    trust_remote_code=trust_remote_code,
    fix_mistral_regex=fix_mistral_regex,  # Add this line
)

# Line 155-157 in build_union_tokenizer()
res = transformers.AutoTokenizer.from_pretrained(
    p, use_fast=True, trust_remote_code=trust_remote_code,
    fix_mistral_regex=fix_mistral_regex,  # Add this line
)

You'll also need to update the function signatures to accept this parameter and pass it through the call chain.

3. Update mergekit/merge.py

Modify the tokenizer loading in _copy_tokenizer():

# Line 271-275
tokenizer = transformers.AutoTokenizer.from_pretrained(
    donor_model.model.path,
    revision=donor_model.model.revision,
    trust_remote_code=options.trust_remote_code,
    fix_mistral_regex=options.fix_mistral_regex,  # Add this line
)

4. Update mergekit/scripts/tokensurgeon.py

Modify the tokenizer loading:

# Line 148-152 in get_stuff()
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model.model.path,
    revision=model.model.revision,
    trust_remote_code=options.trust_remote_code,
    fix_mistral_regex=options.fix_mistral_regex,  # Add this line
)

5. Update mergekit/scripts/moe.py

Modify the tokenizer loading:

# Line 36-38
tokenizer = transformers.AutoTokenizer.from_pretrained(
    base_model.model.path, 
    revision=base_model.model.revision,
    fix_mistral_regex=merge_options.fix_mistral_regex,  # Add this line
)

6. Update function signatures

You'll need to update the function signatures to pass the fix_mistral_regex parameter through the call chain. The key functions to update are:

get_stripped_tokenizer() in mergekit/tokenizer/build.py
build_union_tokenizer() in mergekit/tokenizer/build.py
build_tokenizer() in mergekit/tokenizer/build.py
BuildTokenizer.execute() in mergekit/tokenizer/build.py

7. Use the new flag

After making these changes, you can use the new flag in your command:

mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda --fix-mistral-regex

Notes

The trust_remote_code parameter follows the same pattern through the codebase - it's passed from CLI options to the tokenizer loading functions 1 2 . Adding fix_mistral_regex support requires implementing the same pattern throughout all tokenizer loading locations 3 4 .

The warnings you're seeing come from the transformers library and indicate that the tokenizer regex pattern may lead to incorrect tokenization 5 . While the merge will complete without this fix, enabling it will ensure proper tokenization behavior.

I did all of that and now it says this

C:\mergekit-main>timeout /t 3 /nobreak && mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda --fix-mistral-regex

Waiting for 0 seconds, press CTRL+C to quit ...
`torch_dtype` is deprecated! Use `dtype` instead!
Warmup loader cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 1499.99it/s]
Planning operations
Executing graph:   0%|                                                                                                                                                                                     | 1/4361 [00:00<14:53,  4.88it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Scripts\mergekit-yaml.exe\__main__.py", line 5, in <module>
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\options.py", line 169, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\scripts\run_yaml.py", line 30, in main
    run_merge(
  File "C:\mergekit-main\mergekit\merge.py", line 85, in run_merge
    for _task, value in exec.run(quiet=options.quiet):
  File "C:\mergekit-main\mergekit\graph.py", line 747, in run
    for handle, value in self._run(quiet=quiet, desc=desc):
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\graph.py", line 712, in _run
    res = task.execute(**cpu_arguments)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\tokenizer\build.py", line 294, in execute
    return build_tokenizer(
           ^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\tokenizer\build.py", line 193, in build_tokenizer
    tokenizer_base = get_stripped_tokenizer(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\tokenizer\build.py", line 54, in get_stripped_tokenizer
    fix_mistral_regex=fix_mistral_regex,  # Add this line
                      ^^^^^^^^^^^^^^^^^
NameError: name 'fix_mistral_regex' is not defined

C:\mergekit-main>

The error occurs because you added fix_mistral_regex to the AutoTokenizer.from_pretrained() call in get_stripped_tokenizer() but didn't update the function signature to accept this parameter. The parameter needs to be passed through the entire call chain.

Fix the function signatures

1. Update `get_stripped_tokenizer()` in mergekit/tokenizer/build.py

def get_stripped_tokenizer(
    path: ModelPath, trust_remote_code: bool = False, fix_mistral_regex: bool = False
) -> transformers.PreTrainedTokenizerFast:

2. Update `build_tokenizer()` in mergekit/tokenizer/build.py

def build_tokenizer(
    base_model: Optional[ModelReference],
    referenced_models: List[ModelReference],
    tokenizer_source: Union[Literal["union"], Literal["base"], ModelReference],
    trust_remote_code: bool,
    fix_mistral_regex: bool = False,
    add_tokens: Optional[List[str]] = None,
) -> TokenizerInfo:

3. Update `BuildTokenizer.execute()` in mergekit/tokenizer/build.py

class BuildTokenizer(Task[TokenizerInfo]):
    base_model: Optional[ModelReference]
    referenced_models: Tuple[ModelReference, ...]
    tokenizer_source: Union[Literal["union"], Literal["base"], ModelReference]
    add_tokens: Optional[Tuple[str, ...]]
    trust_remote_code: bool = False
    fix_mistral_regex: bool = False

    def execute(self, **_kwargs) -> TokenizerInfo:
        return build_tokenizer(
            base_model=self.base_model,
            referenced_models=self.referenced_models,
            tokenizer_source=self.tokenizer_source,
            trust_remote_code=self.trust_remote_code,
            fix_mistral_regex=self.fix_mistral_regex,
            add_tokens=self.add_tokens,
        )

4. Update the call in `build_tokenizer()`

tokenizer_base = get_stripped_tokenizer(
    base_model.model, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex
)

5. Update the call in `MergePlanner.init()` in mergekit/plan.py

self._tokenizer_task = BuildTokenizer(
    base_model=config.base_model,
    referenced_models=tuple(config.referenced_models()),
    tokenizer_source=tokenizer_source,
    trust_remote_code=options.trust_remote_code,
    fix_mistral_regex=options.fix_mistral_regex,
    add_tokens=tuple(token_cfg.keys()),
)

6. Update `build_union_tokenizer()` in mergekit/tokenizer/build.py

def build_union_tokenizer(
    base_tok: transformers.PreTrainedTokenizerBase,
    tokenizers: Dict[ModelReference, transformers.PreTrainedTokenizerBase],
    trust_remote_code: bool = False,
    fix_mistral_regex: bool = False,
) -> transformers.PreTrainedTokenizerBase:

And update the call within it:

res = transformers.AutoTokenizer.from_pretrained(
    p, use_fast=True, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex
)

Also update the call to build_union_tokenizer() in build_tokenizer():

tokenizer_out = build_union_tokenizer(
    tokenizer_base, tokenizers, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex
)

Notes

The error occurs because get_stripped_tokenizer() doesn't have fix_mistral_regex in its parameter list, so when you try to use it in the function body, Python raises a NameError 1 . The parameter must be added to each function in the call chain from BuildTokenizer.execute() down to the actual AutoTokenizer.from_pretrained() calls 2 .

I now get this error

C:\mergekit-main>timeout /t 3 /nobreak && mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda --fix-mistral-regex

Waiting for 0 seconds, press CTRL+C to quit ...
`torch_dtype` is deprecated! Use `dtype` instead!
Warmup loader cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 1809.80it/s]
Planning operations
Executing graph:   0%|                                                                                                                                                                                             | 0/4361 [00:00<?, ?it/s]Loading tokenizers
Building output tokenizer
Token '<|im_start|>' added with multiple different settings, using first
Executing graph:   0%|                                                                                                                                                                                   | 1/4361 [00:04<5:27:21,  4.51s/it]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Scripts\mergekit-yaml.exe\__main__.py", line 5, in <module>
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\options.py", line 169, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\scripts\run_yaml.py", line 30, in main
    run_merge(
  File "C:\mergekit-main\mergekit\merge.py", line 85, in run_merge
    for _task, value in exec.run(quiet=options.quiet):
  File "C:\mergekit-main\mergekit\graph.py", line 747, in run
    for handle, value in self._run(quiet=quiet, desc=desc):
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\graph.py", line 712, in _run
    res = task.execute(**cpu_arguments)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\tokenizer\build.py", line 296, in execute
    return build_tokenizer(
           ^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\tokenizer\build.py", line 232, in build_tokenizer
    tokenizer_out = build_union_tokenizer(
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\mergekit-main\mergekit\tokenizer\build.py", line 158, in build_union_tokenizer
    fix_mistral_regex=fix_mistral_regex,  # Add this line
                      ^^^^^^^^^^^^^^^^^
NameError: name 'fix_mistral_regex' is not defined

C:\mergekit-main>

The error occurs because you haven't updated the function signature for build_union_tokenizer() to accept the fix_mistral_regex parameter. Here's the fix:

Fix for `build_union_tokenizer()`

Update the function signature in mergekit/tokenizer/build.py:

def build_union_tokenizer(
    base_tok: transformers.PreTrainedTokenizerBase,
    tokenizers: Dict[ModelReference, transformers.PreTrainedTokenizerBase],
    trust_remote_code: bool = False,
    fix_mistral_regex: bool = False,  # Add this parameter
) -> transformers.PreTrainedTokenizerBase:

And update the AutoTokenizer.from_pretrained() call within the same function (around line 155-157):

res = transformers.AutoTokenizer.from_pretrained(
    p, 
    use_fast=True, 
    trust_remote_code=trust_remote_code, 
    fix_mistral_regex=fix_mistral_regex  # This line already exists
)

The call to build_union_tokenizer() in build_tokenizer() (line 232) should already be passing the parameter correctly based on your previous updates.

Complete Function Update

Here's the complete updated function for build_union_tokenizer():

def build_union_tokenizer(
    base_tok: transformers.PreTrainedTokenizerBase,
    tokenizers: Dict[ModelReference, transformers.PreTrainedTokenizerBase],
    trust_remote_code: bool = False,
    fix_mistral_regex: bool = False,
) -> transformers.PreTrainedTokenizerBase:
    # ... rest of function remains the same ...

After this change, your merge should proceed without the NameError 1 .

Notes

The error pattern shows that fix_mistral_regex needs to be passed through the entire call chain from the CLI options down to each AutoTokenizer.from_pretrained() call 2 . You've successfully updated most of the chain, but this function signature was the missing piece.

Analyzing the Fix

I'm currently focused on dissecting the fix_mistral_regex=True parameter within mergekit. My primary goal is to determine the precise function of this fix in the context of merging Mistral models. I need to understand its impact and how it addresses a potential issue or enhances the merging process.

Investigating the Root

I'm now diving into the 'why' behind the regex fix, focusing on the author's explanation. The provided text details issues with Mistral tokenizers and chat templates, particularly around premature termination, endless repetition, and hallucinated end-of-sequence tokens. The fix seems to address these problems in a specific Karcher merge, notably resolving issues with chat templates and jailbreaks, while acknowledging some template-specific behavior.

Clarifying the Bugfix

I'm now zeroing in on the technical implementation. The core of fix_mistral_regex=True is overriding a faulty regex pattern in the tokenizer config. This pattern is part of how the tokenizer handles pre-tokenization, and the original pattern was causing issues with specific Mistral models, including those Nemo 12B-based models. This fix is crucial for compatibility because of this bug in the transformers library, which causes the tokenizer to misbehave in edge cases.

Connecting the Incompatibilities

I'm now understanding how different tokenizer configs cause issues during the merge process. Building a "union" tokenizer with conflicting regex patterns leads to misaligned token mappings, especially for special tokens. The fix_mistral_regex=True parameter directly addresses this problem. I am able to identify the core issues: early termination or endless repetition, and hallucination of <|im_end|>. These are a direct result of the broken regex and tokenizer incompatibility across the various models being merged.

Mapping the Tokenization

I'm now focusing on how the fix's implementation. It seems by overriding the pre-tokenization regex, mergekit accurately maps overlapping tokens and inserts the ChatML tokens, like <|im_start|> and <|im_end|>, correctly. This prevents the underlying regex from disrupting token boundaries, ensuring proper handling of different prompt formats during the merge. I'm now documenting how this fixes ChatML/Special Tokens, generation loops, and resolves incompatibilities.

Based on the logs, the Hugging Face issue linked in the warnings, and the author's notes, here is an explanation of what fix_mistral_regex=True does and why it solves the merge issues.

What the fix actually does (Technically)

Mistral models (specifically those using their newer "Tekken" tokenizer, like Mistral Nemo 12B and Mistral Small 3) rely on a specific Regular Expression (regex) pattern to pre-tokenize text—meaning how it splits sentences into chunks before assigning token IDs.

Recently, a bug was discovered where the regex pattern saved inside the tokenizer_config.json of these models was slightly incorrect. This caused the tokenizer to misinterpret certain character boundaries, spaces, and special tokens.

When you pass fix_mistral_regex=True to AutoTokenizer.from_pretrained() in the transformers library, it tells the library to ignore the broken regex pattern in the model's files and dynamically replace it with the corrected pattern.

In the context of mergekit, if this fix isn't applied, mergekit tries to build a "union" tokenizer out of 9 different models using a broken set of rules. This results in a corrupted vocabulary mapping where special tokens get misaligned.

Why it fixes the issues (According to the Author)

According to the author's notes, applying this fix during the merge process resolved several severe, model-breaking behaviors:

1. It stops the model from hallucinating <|im_end|> as plain text Without the fix, the broken regex was likely causing the tokenizer to fail at recognizing <|im_end|> as a single, distinct "Special Token." Instead, the model was treating it as regular text. Because the text generation interface didn't see the actual End of Sequence ID, it just printed the literal characters <|im_end|> to the screen and kept generating. Fixing the regex ensures the union tokenizer correctly isolates and maps these ChatML special tokens.

2. It fixes endless repetition and early termination Because the token mappings were corrupted by the bad regex, the merged model's language head was firing the wrong token IDs. It would either output an EOS (End of Sequence) token immediately (terminating early) or completely lose the ability to output an EOS token (repeating endlessly). The fix restores the correct token IDs for stopping generation.

3. It resolves "Tokenizer Incompatibility" across different fine-tunes The author merged 9 different models. Some of these models (like Mistral-Nemo-Instruct-2407) use the default Mistral [INST] format, while others (like Tlacuilo-12B) were fine-tuned to use ChatML (<|im_start|>). Normally, merging models with different prompt formats and added tokens causes severe brain damage to the model. By fixing the underlying regex, mergekit was able to accurately map the overlapping base tokens and cleanly inject the forced ChatML tokens (as defined in the YAML config), creating a stable hybrid tokenizer.

The Author's Final Result

By applying this fix, the author successfully created a 9-model Karcher merge that:

Works perfectly with ChatML: It follows instructions, doesn't loop, and stops when it is supposed to.
Accepts Jailbreaks: The control vectors from the "heretic/noslop" models successfully carried over.
Has one remaining quirk: The author notes that if you try to use the Mistral Tekken prompt format instead of ChatML, the model terminates early and refuses prompts. This makes sense: the YAML config explicitly forced the ChatML tokens to overwrite the Mistral ones, meaning the model's brain is now permanently wired to expect ChatML.

Current Tokenizer Loading in MergeKit

How to Add Support

Temporary Workaround

Notes

Step-by-Step Local Modifications

1. Add the option to MergeOptions

2. Update tokenizer loading in mergekit/tokenizer/build.py

3. Update mergekit/merge.py

4. Update mergekit/scripts/tokensurgeon.py

5. Update mergekit/scripts/moe.py

6. Update function signatures

7. Use the new flag

Notes

Fix the function signatures

1. Update get_stripped_tokenizer() in mergekit/tokenizer/build.py

2. Update build_tokenizer() in mergekit/tokenizer/build.py

3. Update BuildTokenizer.execute() in mergekit/tokenizer/build.py

4. Update the call in build_tokenizer()

5. Update the call in MergePlanner.__init__() in mergekit/plan.py

6. Update build_union_tokenizer() in mergekit/tokenizer/build.py

Notes

Fix for build_union_tokenizer()

Complete Function Update

Notes

What the fix actually does (Technically)

Why it fixes the issues (According to the Author)

The Author's Final Result

1. Update `get_stripped_tokenizer()` in mergekit/tokenizer/build.py

2. Update `build_tokenizer()` in mergekit/tokenizer/build.py

3. Update `BuildTokenizer.execute()` in mergekit/tokenizer/build.py

4. Update the call in `build_tokenizer()`

5. Update the call in `MergePlanner.init()` in mergekit/plan.py

6. Update `build_union_tokenizer()` in mergekit/tokenizer/build.py

Fix for `build_union_tokenizer()`