``` architecture: MistralForCausalLM base_model: B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop models: - model: B:/12B/models--allura-org--Tlacuilo-12B - model: B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop - model: B:/12B/models--EldritchLabs--Altair-Stock-12B-v1/MPOA_alt - model: B:/12B/models--inflatebot--MN-12B-Mag-Mell-R1 - model: B:/12B/models--MuXodious--Irix-12B-Model_Stock-absolute-heresy - model: B:/12B/models--aixonlab--Aether-12b - model: B:/12B/models--SicariusSicariiStuff--Impish_Bloodmoon_12B - model: B:/12B/models--SicariusSicariiStuff--Sweet_Dreams_12B - model: B:/12B/models--Epiculous--Azure_Dusk-v0.2 # - model: B:/12B/models--SicariusSicariiStuff--Impish_Bloodmoon_12B.backup merge_method: karcher parameters: max_iter: 30 # Maximum iterations for Karcher mean algorithm (default 10 ) tol: 1e-9 # Convergence tolerance for Karcher mean algorithm (default 1e5) dtype: float32 out_dtype: bfloat16 tokenizer: source: "union" tokens: # Force ChatML EOS tokens "<|im_start|>": source: "B:/12B/models--allura-org--Tlacuilo-12B" force: true "<|im_end|>": source: "B:/12B/models--allura-org--Tlacuilo-12B" force: true # Keep Mistral tokens "[INST]": source: "B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop" # source: "B:/12B/models--mistralai--Mistral-Nemo-Instruct-2407" # The tokenizer system requires all models referenced in token configurations to be present in the merge's model list to build proper embedding permutations. "[/INST]": source: "B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop" # Force as fallback EOS "": source: "B:/12B/models--p-e-w--Mistral-Nemo-Instruct-2407-heretic-noslop" force: true chat_template: "chatml" name: πŸ§œβ€ Nocturne-Nereid-12B-v1 ``` ``` C:\mergekit-main>timeout /t 3 /nobreak && mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda Waiting for 0 seconds, press CTRL+C to quit ... `torch_dtype` is deprecated! Use `dtype` instead! Warmup loader cache: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 1499.99it/s] Planning operations Executing graph: 0%| | 0/4361 [00:00' added with multiple different settings, using first Building permutations Building tokenizer permutations: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 9.54it/s] Executing graph: 0%| | 3/4361 [00:07<2:44:37, 2.27s/it] ``` --- Based on the codebase, mergekit doesn't currently support passing `fix_mistral_regex=True` to tokenizer loading. The warnings you're seeing come from the transformers library, but mergekit only passes `trust_remote_code` to tokenizer loading calls, not the `fix_mistral_regex` parameter. ## Current Tokenizer Loading in MergeKit The main tokenizer loading happens in `mergekit/tokenizer/build.py`: ```python # In get_stripped_tokenizer() tokenizer = transformers.AutoTokenizer.from_pretrained( path.path, revision=path.revision, trust_remote_code=trust_remote_code, use_fast=True, ) ``` And in `build_tokenizer()`: ```python model_tok = transformers.AutoTokenizer.from_pretrained( model.model.path, revision=model.model.revision, trust_remote_code=trust_remote_code, ) ``` ## How to Add Support To add `fix_mistral_regex=True` support, you would need to: 1. **Add the parameter to MergeOptions** - This would require modifying the options structure to include a new field for `fix_mistral_regex` 2. **Update all tokenizer loading calls** - Modify the `AutoTokenizer.from_pretrained()` calls throughout the codebase to pass this parameter: - `mergekit/tokenizer/build.py` (lines 49-54, 203-207, 155-157) - `mergekit/merge.py` (line 271-275) - `mergekit/scripts/tokensurgeon.py` (line 148-152) 3. **Add CLI flag** - Add a `--fix-mistral-regex` flag to the CLI that sets this option ## Temporary Workaround Until mergekit adds native support, you could: 1. **Modify the source locally** - Add `fix_mistral_regex=True` directly to the `AutoTokenizer.from_pretrained()` calls in your local mergekit installation 2. **Use a different tokenizer source** - If one of your models has a correct tokenizer, you could specify it as the tokenizer source in your YAML config The warnings won't prevent your merge from completing, but as noted, they may lead to incorrect tokenization behavior with the affected models. ## Notes The `trust_remote_code` parameter you're already using via `--trust-remote-code` follows the same pattern through the codebase - it's passed from CLI options to the tokenizer loading functions. Adding `fix_mistral_regex` support would require implementing the same pattern. --- > Explain how to enable this for merges on my local copy, step by step. --- To enable `fix_mistral_regex=True` for your local mergekit installation, you'll need to modify several files to add this parameter throughout the tokenizer loading pipeline. ## Step-by-Step Local Modifications ### 1. Add the option to MergeOptions Edit `mergekit/options.py` and add the new field: ```python # In MergeOptions class (around line 41) fix_mistral_regex: bool = False ``` Also add it to `OPTION_HELP` and `OPTION_CATEGORIES` dictionaries: ```python # Around line 114 OPTION_HELP = { # ... existing options ... "fix_mistral_regex": "Fix incorrect regex pattern in Mistral tokenizers", } # Around line 140 OPTION_CATEGORIES = { # ... existing categories ... "fix_mistral_regex": "Dangerous Options", } ``` ### 2. Update tokenizer loading in mergekit/tokenizer/build.py Modify all `AutoTokenizer.from_pretrained()` calls to include the new parameter: ```python # Line 49-54 in get_stripped_tokenizer() tokenizer = transformers.AutoTokenizer.from_pretrained( path.path, revision=path.revision, trust_remote_code=trust_remote_code, use_fast=True, fix_mistral_regex=fix_mistral_regex, # Add this line ) # Line 203-207 in build_tokenizer() model_tok = transformers.AutoTokenizer.from_pretrained( model.model.path, revision=model.model.revision, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex, # Add this line ) # Line 155-157 in build_union_tokenizer() res = transformers.AutoTokenizer.from_pretrained( p, use_fast=True, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex, # Add this line ) ``` You'll also need to update the function signatures to accept this parameter and pass it through the call chain. ### 3. Update mergekit/merge.py Modify the tokenizer loading in `_copy_tokenizer()`: ```python # Line 271-275 tokenizer = transformers.AutoTokenizer.from_pretrained( donor_model.model.path, revision=donor_model.model.revision, trust_remote_code=options.trust_remote_code, fix_mistral_regex=options.fix_mistral_regex, # Add this line ) ``` ### 4. Update mergekit/scripts/tokensurgeon.py Modify the tokenizer loading: ```python # Line 148-152 in get_stuff() tokenizer = transformers.AutoTokenizer.from_pretrained( model.model.path, revision=model.model.revision, trust_remote_code=options.trust_remote_code, fix_mistral_regex=options.fix_mistral_regex, # Add this line ) ``` ### 5. Update mergekit/scripts/moe.py Modify the tokenizer loading: ```python # Line 36-38 tokenizer = transformers.AutoTokenizer.from_pretrained( base_model.model.path, revision=base_model.model.revision, fix_mistral_regex=merge_options.fix_mistral_regex, # Add this line ) ``` ### 6. Update function signatures You'll need to update the function signatures to pass the `fix_mistral_regex` parameter through the call chain. The key functions to update are: - `get_stripped_tokenizer()` in `mergekit/tokenizer/build.py` - `build_union_tokenizer()` in `mergekit/tokenizer/build.py` - `build_tokenizer()` in `mergekit/tokenizer/build.py` - `BuildTokenizer.execute()` in `mergekit/tokenizer/build.py` ### 7. Use the new flag After making these changes, you can use the new flag in your command: ```bash mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda --fix-mistral-regex ``` ## Notes The `trust_remote_code` parameter follows the same pattern through the codebase - it's passed from CLI options to the tokenizer loading functions [1](#0-0) [2](#0-1) . Adding `fix_mistral_regex` support requires implementing the same pattern throughout all tokenizer loading locations [3](#0-2) [4](#0-3) . The warnings you're seeing come from the transformers library and indicate that the tokenizer regex pattern may lead to incorrect tokenization [5](#0-4) . While the merge will complete without this fix, enabling it will ensure proper tokenization behavior. --- > I did all of that and now it says this ``` C:\mergekit-main>timeout /t 3 /nobreak && mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda --fix-mistral-regex Waiting for 0 seconds, press CTRL+C to quit ... `torch_dtype` is deprecated! Use `dtype` instead! Warmup loader cache: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 1499.99it/s] Planning operations Executing graph: 0%| | 1/4361 [00:00<14:53, 4.88it/s] Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Scripts\mergekit-yaml.exe\__main__.py", line 5, in File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1485, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1406, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1269, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 824, in invoke return callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\options.py", line 169, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\scripts\run_yaml.py", line 30, in main run_merge( File "C:\mergekit-main\mergekit\merge.py", line 85, in run_merge for _task, value in exec.run(quiet=options.quiet): File "C:\mergekit-main\mergekit\graph.py", line 747, in run for handle, value in self._run(quiet=quiet, desc=desc): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\graph.py", line 712, in _run res = task.execute(**cpu_arguments) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\tokenizer\build.py", line 294, in execute return build_tokenizer( ^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\tokenizer\build.py", line 193, in build_tokenizer tokenizer_base = get_stripped_tokenizer( ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\tokenizer\build.py", line 54, in get_stripped_tokenizer fix_mistral_regex=fix_mistral_regex, # Add this line ^^^^^^^^^^^^^^^^^ NameError: name 'fix_mistral_regex' is not defined C:\mergekit-main> ``` --- The error occurs because you added `fix_mistral_regex` to the `AutoTokenizer.from_pretrained()` call in `get_stripped_tokenizer()` but didn't update the function signature to accept this parameter. The parameter needs to be passed through the entire call chain. ## Fix the function signatures ### 1. Update `get_stripped_tokenizer()` in mergekit/tokenizer/build.py ```python def get_stripped_tokenizer( path: ModelPath, trust_remote_code: bool = False, fix_mistral_regex: bool = False ) -> transformers.PreTrainedTokenizerFast: ``` ### 2. Update `build_tokenizer()` in mergekit/tokenizer/build.py ```python def build_tokenizer( base_model: Optional[ModelReference], referenced_models: List[ModelReference], tokenizer_source: Union[Literal["union"], Literal["base"], ModelReference], trust_remote_code: bool, fix_mistral_regex: bool = False, add_tokens: Optional[List[str]] = None, ) -> TokenizerInfo: ``` ### 3. Update `BuildTokenizer.execute()` in mergekit/tokenizer/build.py ```python class BuildTokenizer(Task[TokenizerInfo]): base_model: Optional[ModelReference] referenced_models: Tuple[ModelReference, ...] tokenizer_source: Union[Literal["union"], Literal["base"], ModelReference] add_tokens: Optional[Tuple[str, ...]] trust_remote_code: bool = False fix_mistral_regex: bool = False def execute(self, **_kwargs) -> TokenizerInfo: return build_tokenizer( base_model=self.base_model, referenced_models=self.referenced_models, tokenizer_source=self.tokenizer_source, trust_remote_code=self.trust_remote_code, fix_mistral_regex=self.fix_mistral_regex, add_tokens=self.add_tokens, ) ``` ### 4. Update the call in `build_tokenizer()` ```python tokenizer_base = get_stripped_tokenizer( base_model.model, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex ) ``` ### 5. Update the call in `MergePlanner.__init__()` in mergekit/plan.py ```python self._tokenizer_task = BuildTokenizer( base_model=config.base_model, referenced_models=tuple(config.referenced_models()), tokenizer_source=tokenizer_source, trust_remote_code=options.trust_remote_code, fix_mistral_regex=options.fix_mistral_regex, add_tokens=tuple(token_cfg.keys()), ) ``` ### 6. Update `build_union_tokenizer()` in mergekit/tokenizer/build.py ```python def build_union_tokenizer( base_tok: transformers.PreTrainedTokenizerBase, tokenizers: Dict[ModelReference, transformers.PreTrainedTokenizerBase], trust_remote_code: bool = False, fix_mistral_regex: bool = False, ) -> transformers.PreTrainedTokenizerBase: ``` And update the call within it: ```python res = transformers.AutoTokenizer.from_pretrained( p, use_fast=True, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex ) ``` Also update the call to `build_union_tokenizer()` in `build_tokenizer()`: ```python tokenizer_out = build_union_tokenizer( tokenizer_base, tokenizers, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex ) ``` ## Notes The error occurs because `get_stripped_tokenizer()` doesn't have `fix_mistral_regex` in its parameter list, so when you try to use it in the function body, Python raises a `NameError` [1](#0-0) . The parameter must be added to each function in the call chain from `BuildTokenizer.execute()` down to the actual `AutoTokenizer.from_pretrained()` calls [2](#0-1) . --- > I now get this error ``` C:\mergekit-main>timeout /t 3 /nobreak && mergekit-yaml C:\mergekit-main\altair7.yaml C:\mergekit-main\merged_model_altair7 --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda --fix-mistral-regex Waiting for 0 seconds, press CTRL+C to quit ... `torch_dtype` is deprecated! Use `dtype` instead! Warmup loader cache: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 1809.80it/s] Planning operations Executing graph: 0%| | 0/4361 [00:00' added with multiple different settings, using first Executing graph: 0%| | 1/4361 [00:04<5:27:21, 4.51s/it] Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Scripts\mergekit-yaml.exe\__main__.py", line 5, in File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1485, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1406, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1269, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 824, in invoke return callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\options.py", line 169, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\scripts\run_yaml.py", line 30, in main run_merge( File "C:\mergekit-main\mergekit\merge.py", line 85, in run_merge for _task, value in exec.run(quiet=options.quiet): File "C:\mergekit-main\mergekit\graph.py", line 747, in run for handle, value in self._run(quiet=quiet, desc=desc): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\graph.py", line 712, in _run res = task.execute(**cpu_arguments) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\tokenizer\build.py", line 296, in execute return build_tokenizer( ^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\tokenizer\build.py", line 232, in build_tokenizer tokenizer_out = build_union_tokenizer( ^^^^^^^^^^^^^^^^^^^^^^ File "C:\mergekit-main\mergekit\tokenizer\build.py", line 158, in build_union_tokenizer fix_mistral_regex=fix_mistral_regex, # Add this line ^^^^^^^^^^^^^^^^^ NameError: name 'fix_mistral_regex' is not defined C:\mergekit-main> ``` --- The error occurs because you haven't updated the function signature for `build_union_tokenizer()` to accept the `fix_mistral_regex` parameter. Here's the fix: ## Fix for `build_union_tokenizer()` Update the function signature in `mergekit/tokenizer/build.py`: ```python def build_union_tokenizer( base_tok: transformers.PreTrainedTokenizerBase, tokenizers: Dict[ModelReference, transformers.PreTrainedTokenizerBase], trust_remote_code: bool = False, fix_mistral_regex: bool = False, # Add this parameter ) -> transformers.PreTrainedTokenizerBase: ``` And update the `AutoTokenizer.from_pretrained()` call within the same function (around line 155-157): ```python res = transformers.AutoTokenizer.from_pretrained( p, use_fast=True, trust_remote_code=trust_remote_code, fix_mistral_regex=fix_mistral_regex # This line already exists ) ``` The call to `build_union_tokenizer()` in `build_tokenizer()` (line 232) should already be passing the parameter correctly based on your previous updates. ## Complete Function Update Here's the complete updated function for `build_union_tokenizer()`: ```python def build_union_tokenizer( base_tok: transformers.PreTrainedTokenizerBase, tokenizers: Dict[ModelReference, transformers.PreTrainedTokenizerBase], trust_remote_code: bool = False, fix_mistral_regex: bool = False, ) -> transformers.PreTrainedTokenizerBase: # ... rest of function remains the same ... ``` After this change, your merge should proceed without the `NameError` [1](#0-0) . ## Notes The error pattern shows that `fix_mistral_regex` needs to be passed through the entire call chain from the CLI options down to each `AutoTokenizer.from_pretrained()` call [2](#0-1) . You've successfully updated most of the chain, but this function signature was the missing piece. --- **Analyzing the Fix** I'm currently focused on dissecting the `fix_mistral_regex=True` parameter within `mergekit`. My primary goal is to determine the precise function of this fix in the context of merging Mistral models. I need to understand its impact and how it addresses a potential issue or enhances the merging process. **Investigating the Root** I'm now diving into the 'why' behind the regex fix, focusing on the author's explanation. The provided text details issues with Mistral tokenizers and chat templates, particularly around premature termination, endless repetition, and hallucinated end-of-sequence tokens. The fix seems to address these problems in a specific Karcher merge, notably resolving issues with chat templates and jailbreaks, while acknowledging some template-specific behavior. **Clarifying the Bugfix** I'm now zeroing in on the technical implementation. The core of `fix_mistral_regex=True` is overriding a faulty regex pattern in the tokenizer config. This pattern is part of how the tokenizer handles pre-tokenization, and the original pattern was causing issues with specific Mistral models, including those Nemo 12B-based models. This fix is crucial for compatibility because of this bug in the `transformers` library, which causes the tokenizer to misbehave in edge cases. **Connecting the Incompatibilities** I'm now understanding how different tokenizer configs cause issues during the merge process. Building a "union" tokenizer with conflicting regex patterns leads to misaligned token mappings, especially for special tokens. The `fix_mistral_regex=True` parameter directly addresses this problem. I am able to identify the core issues: early termination or endless repetition, and hallucination of `<|im_end|>`. These are a direct result of the broken regex and tokenizer incompatibility across the various models being merged. **Mapping the Tokenization** I'm now focusing on how the fix's implementation. It seems by overriding the pre-tokenization regex, mergekit accurately maps overlapping tokens and inserts the ChatML tokens, like `<|im_start|>` and `<|im_end|>`, correctly. This prevents the underlying regex from disrupting token boundaries, ensuring proper handling of different prompt formats during the merge. I'm now documenting how this fixes ChatML/Special Tokens, generation loops, and resolves incompatibilities. Based on the logs, the Hugging Face issue linked in the warnings, and the author's notes, here is an explanation of what `fix_mistral_regex=True` does and why it solves the merge issues. ### What the fix actually does (Technically) Mistral models (specifically those using their newer "Tekken" tokenizer, like Mistral Nemo 12B and Mistral Small 3) rely on a specific Regular Expression (regex) pattern to pre-tokenize textβ€”meaning how it splits sentences into chunks before assigning token IDs. Recently, a bug was discovered where the regex pattern saved inside the `tokenizer_config.json` of these models was slightly incorrect. This caused the tokenizer to misinterpret certain character boundaries, spaces, and special tokens. When you pass `fix_mistral_regex=True` to `AutoTokenizer.from_pretrained()` in the `transformers` library, it tells the library to **ignore the broken regex pattern in the model's files and dynamically replace it with the corrected pattern**. In the context of `mergekit`, if this fix isn't applied, `mergekit` tries to build a "union" tokenizer out of 9 different models using a broken set of rules. This results in a corrupted vocabulary mapping where special tokens get misaligned. ### Why it fixes the issues (According to the Author) According to the author's notes, applying this fix during the merge process resolved several severe, model-breaking behaviors: **1. It stops the model from hallucinating `<|im_end|>` as plain text** Without the fix, the broken regex was likely causing the tokenizer to fail at recognizing `<|im_end|>` as a single, distinct "Special Token." Instead, the model was treating it as regular text. Because the text generation interface didn't see the actual *End of Sequence ID*, it just printed the literal characters `<|im_end|>` to the screen and kept generating. Fixing the regex ensures the union tokenizer correctly isolates and maps these ChatML special tokens. **2. It fixes endless repetition and early termination** Because the token mappings were corrupted by the bad regex, the merged model's language head was firing the wrong token IDs. It would either output an EOS (End of Sequence) token immediately (terminating early) or completely lose the ability to output an EOS token (repeating endlessly). The fix restores the correct token IDs for stopping generation. **3. It resolves "Tokenizer Incompatibility" across different fine-tunes** The author merged 9 different models. Some of these models (like `Mistral-Nemo-Instruct-2407`) use the default Mistral `[INST]` format, while others (like `Tlacuilo-12B`) were fine-tuned to use `ChatML` (`<|im_start|>`). Normally, merging models with different prompt formats and added tokens causes severe brain damage to the model. By fixing the underlying regex, `mergekit` was able to accurately map the overlapping base tokens and cleanly inject the forced ChatML tokens (as defined in the YAML config), creating a stable hybrid tokenizer. ### The Author's Final Result By applying this fix, the author successfully created a 9-model Karcher merge that: * **Works perfectly with ChatML:** It follows instructions, doesn't loop, and stops when it is supposed to. * **Accepts Jailbreaks:** The control vectors from the "heretic/noslop" models successfully carried over. * **Has one remaining quirk:** The author notes that if you try to use the *Mistral Tekken* prompt format instead of *ChatML*, the model terminates early and refuses prompts. This makes sense: the YAML config explicitly forced the ChatML tokens to overwrite the Mistral ones, meaning the model's brain is now permanently wired to expect ChatML.