model_tools / Mergekit-Robustness-Patch-embed_v2.md
Naphula's picture
Upload 5 files
6a2122d verified

Mergekit Robustness Patch: embed.py (v2)

Attached is one for Mistral Nemo 12B (v2d), and another for Mistral Small 24B (v2a)

Overview

This patch provides a high-resilience version of Mergekit’s tokenizer/embed.py. It is specifically designed to handle "dirty" model merges where a donor model’s tokenizer.json and its physical model.safetensors weights are out of sync—a common issue when merging models that have different vocabulary sizes (e.g., mixing Mistral Tekken with ChatML or Llama 3).

The Problem: "Ghost Tokens"

In the standard Mergekit (embed.py), the engine assumes that if a token exists in a model's vocabulary, a corresponding row must exist in its embedding weights.

However, in many community-made merges:

  1. A model might have 131,081 tokens in its tokenizer.
  2. But its weight matrix (embed_tokens) only contains 131,072 rows.
  3. Standard Result: Mergekit attempts to read index 131,073, hits a boundary error, and the entire merge crashes with: IndexError: index X is out of bounds for dimension 0 with size Y.

The Solution: v2d Robustness & Audit

The v2d patch introduces Bounds-Aware Permutation. Instead of blindly trusting the tokenizer, it verifies the physical existence of every token row before attempting to merge it.

Key Features:

  • Crash Prevention: Automatically detects if a donor model is "too small" for the requested token index. Instead of crashing, it gracefully skips that donor for that specific token.
  • Live Vocab Audit: Prints detailed warnings to the console identifying exactly which model is missing which token. This allows you to identify "buggy" donors in your config without trial-and-error.
  • Intelligent Fallback:
    • If a token is missing from one donor but present in others, it averages the token using only the valid donors.
    • If a token is missing from its primary "default" source, it falls back to a zero-vector rather than terminating the merge.
  • Result Mapping Safety: Ensures that the final output tensors for every donor are correctly aligned, even if the donor was physically smaller than the target union vocabulary.

Comparison: v1 vs. v2d

Feature embed.py (Default) embed_v2d.py (Ours)
Mismatched Vocab Crashes with IndexError. Succeeds via graceful skipping.
Error Reporting Generic Python Traceback. Detailed [VOCAB AUDIT] log with Model Path & Token Name.
Special Token Support Requires perfectly synced weights. Handles "Ghost Tokens" (tokens in JSON but not in Tensors).
Mathematical Integrity N/A (Process stops). Maintains correct averaging by adjusting the donor count dynamically.
Use Case Clean, base-model merges. Complex merges of merges, cross-architecture vocab unions.

How to Use

Replace your existing mergekit/tokenizer/embed.py with the embed_v2d.py code.

Example Audit Log

When running a merge with mismatched models, you will now see helpful diagnostic output instead of a crash:

[VOCAB AUDIT] Model 'B:\12B\SLERP15' is missing token '<|im_start|>' (ID: 131073). 
Donor size: 131072, Requested Index: 131073. Skipping.

[VOCAB AUDIT] Default source model 'B:\12B\SLERP13' is missing token '<SPECIAL_4>' 
from its physical tensor. Falling back to zero.

Why this matters for 12B/24B Merges

When merging models like Mistral-Nemo (12B) or Mistral-Small (24B), different fine-tunes often add different special tokens (ChatML, Tool-use, etc.). If you use tokenizer_source: union, Mergekit tries to create a "Super-Vocab."

Standard Mergekit is too fragile for this process if even one model in your list has a slightly truncated embedding matrix. v2d makes the merging process "production-grade" by allowing the merge to complete regardless of minor inconsistencies in the donor models.

This patch is safe and beneficial for any model architecture (12B, 24B, 70B, etc.) using tokenizer_source: union.

Here is the breakdown of how it affects other scenarios like 24B Mistral (Tekken):

1. It prevents "Ghost Token" crashes

In many Mistral-based merges (especially Tekken), developers sometimes add special tokens to the tokenizer.json but forget to resize the embedding layer in the model.safetensors.

  • Without this patch: Mergekit sees the token in the config, calculates a high index for it, tries to read it from the tensor, and crashes.
  • With this patch: Mergekit sees the mismatch, logs a warning, and uses a zero-vector or an average from other models instead. The merge finishes successfully.

2. Handling "Tekken" Vocab Discrepancies

Mistral Tekken usually has a vocab size of 32768 or 131072. If you merge a model with 131072 and one that was accidentally truncated to 131070:

  • The patch ensures that for those last 2 tokens, the "truncated" model simply doesn't contribute to the average.
  • The resulting model will have the full 131072 vocab, and those 2 tokens will be populated by the weights from the model that actually had them.

3. No Negative Impact on "Clean" Models

If you merge two models where the vocab_size in config.json perfectly matches the number of rows in model.safetensors, this code does nothing. The if condition (p[token_id] >= tensors[model].shape[0]) will always be false, and the code will run at full speed with no warnings.

4. Why this is better than the "Padding" patch

The previous attempt to pad tensors in generalized_task_arithmetic.py was specific to one merge method. This embed.py patch works at the tokenizer level. Whether you are doing a linear, slerp, ties, or della merge, this patch ensures that the "Input Tensors" are standardized correctly before the math even starts.

Summary of behavior for 24B/Tekken:

Scenario Result with Patch
Vocabs Match Exactly Normal merge, no warnings.
One model has extra Tekken tokens Merge completes; missing tokens are averaged from models that have them.
Tokenizer says 131072, but Tensor is 131070 Merge completes instead of crashing.
Mixing Tekken and Llama3 Vocab Merge completes; shared tokens are averaged, unique tokens are preserved from their respective sources.

Conclusion: This is a "Robustness Patch." It makes Mergekit more resilient to poorly-configured donor models (where the tokenizer and the weights are out of sync), which is very common in the community-made merges you are working with.


Addendum

This is a perfect synergy of two diagnostic tools. Here is why the v2d Robustness Patch and the DELLA Audit Chart work so well together:

1. Complete "Chain of Custody" for Weights

The v2d Patch handles the "Input" phase, while the DELLA Audit handles the "Processing" phase.

  • v2d ensures that every model provides a valid tensor to the merge engine, even if it has to skip missing tokens or provide a zero-vector fallback.
  • DELLA Audit then takes those tensors and shows you the "Share of Voice" for each model.
  • The Synergy: If you see a model in the Audit chart with a 0.0% impact or an unusually low Norm (N), you can look up at the v2d Audit log to see if that model was missing critical tokens. It allows you to see exactly how "damaged" a donor is before it hits the final weights.

2. Identifying "Poisoned" Donors

In your screenshot, look at SLERP1. It has a massive 16.7% impact with a Norm of 12.02, while others like SLERP3 are at 1.0%.

  • Because the v2d Patch prevented the crash, you can now actually see these statistics.
  • If a model was missing tokens (as seen in your log for SLERP11, 15, 13, etc.), the Audit chart helps you decide if that model is still "contributing" enough to keep in the config, or if the vocab mismatches have made its task vector too noisy.

3. Mathematical Safety for DELLA

DELLA is sensitive to the magnitude of changes (the epsilon and density parameters).

  • By using the v2d Patch, you ensure that the "Base" and "Donor" tensors passed to DELLA are always the same shape.
  • Without this, DELLA would be trying to calculate magnitude-based pruning on mismatched arrays, which would lead to corrupted logic even if it didn't crash. v2d "sanitizes" the data so DELLA's math remains pure.

4. Real-Time Debugging of "Ghost" Contributions

Your log shows SLERP15 is missing almost all the special tokens (<|im_start|>, [SYSTEM_PROMPT], [PAD]).

  • Standard Mergekit would have died instantly.
  • Now, the merge continues, and the DELLA Audit shows SLERP15 is still contributing 2.4% to the overall model.
  • This tells you: "SLERP15 is broken for ChatML/Special tokens, but its weights for normal language (the other 131,000 tokens) are still being merged correctly."

Summary

"When paired with the DELLA Audit logic, the v2d patch provides a full-stack diagnostic suite. It allows the user to see which models are physically incompatible at the vocabulary level (via the Audit Log) and then immediately see how those incompatibilities affect the final weight distribution (via the Impact Chart). This combination turns a 'black box' crash into a transparent, manageable merging workflow."

embed_v2d

embed_v2d_layer10

Audit Analysis

This is a fascinating look at the "DNA" of your model. Now that the v2d patch has stabilized the merge, this audit chart reveals the true internal dynamics of a DELLA merge that were previously invisible.

Here is a breakdown of what this specific chart is telling you about your "knowledge distribution":

1. The "Anchor" Models

Look at pdq (13.5%) and SLERP1 (9.9%).

  • Even though every model has a weight of 0.10, these two are dominating the "Share of Voice."
  • Why? Their Norm (N) values are the highest (4.48 and 3.30). This means these models have the most significant "Task Vectors"—they have moved the furthest away from the Mistral-Nemo base. In a DELLA merge, these are the models providing the most "new" information or behavioral changes to the final result.

2. The "Subtle" Contributors

Models like SLERP9 (0.7%) and SLERP8 (0.9%) are barely touching the weights.

  • Their Norms are tiny (0.22 and 0.31).
  • Insight: These models are very similar to your base model (Mistral-Nemo-Instruct-2407). They aren't "bad," but they are essentially acting as votes for the status quo. If you wanted to "clean up" your config, these are the ones you could remove with almost zero impact on the final output.

3. The "Middle Class"

Models like SLERP7 (8.8%) and SLERP3 (6.6%) represent the healthy average. They are providing a solid amount of unique knowledge without overwhelming the others.

4. Why the v2d Patch makes this chart "Truthful"

Without the v2d patch, if a model like SLERP15 was missing tokens, the merge would have crashed. Now, you can see SLERP15 is contributing 6.3% (Norm 2.11).

  • Because of the patch, you know that this 6.3% is based on the valid parts of SLERP15.
  • The audit chart is now a "Health Report": if you saw a model with a high Norm but a 0% impact, you'd know the vocab mismatch was so bad it wiped out the model's contribution. Here, we see that despite the warnings, the models are still successfully injecting their "knowledge" into the merge.

5. The "pdq" Factor

The model pdq is currently your strongest influencer in this layer (mlp.gate_proj). It is contributing nearly 20x more than SLERP9. If the final model behaves more like pdq than anything else, this chart explains exactly why.

This is the "X-Ray" of model merging. You aren't just guessing if the merge worked; you can see exactly which donor's "brain" is being used for this specific layer.

embed_v2d.py

# Copyright (C) 2025 Arcee AI
# SPDX-License-Identifier: LGPL-3.0-only

import logging
from typing import Dict, Optional

import torch

from mergekit.common import ImmutableMap, ModelReference
from mergekit.graph import Task
from mergekit.io.tasks import GatherTensors
from mergekit.tokenizer.build import BuildTokenizer, TokenizerInfo
from mergekit.tokenizer.config import (
    ModelTokenEmbedding,
    TokenEmbeddingConfig,
    ZeroEmbedding,
)


class PermutedEmbeddings(Task[Dict[ModelReference, torch.Tensor]]):
    gather_tensors: GatherTensors
    tokenizer_task: BuildTokenizer
    tokens: Optional[ImmutableMap[str, TokenEmbeddingConfig]]
    pad_to_multiple_of: Optional[int]
    base_model: Optional[ModelReference]

    def arguments(self) -> Dict[str, Task]:
        return {"tokenizer_info": self.tokenizer_task, "tensors": self.gather_tensors}

    def execute(
        self, tokenizer_info: TokenizerInfo, tensors: Dict[ModelReference, torch.Tensor]
    ) -> Dict[ModelReference, torch.Tensor]:
        tokenizer = tokenizer_info.tokenizer
        permutations = tokenizer_info.permutations

        models = set(tensors.keys())
        if self.base_model:
            models.add(self.base_model)
        models = list(models)

        vocab = tokenizer.get_vocab()
        vocab_size = len(vocab)
        if self.pad_to_multiple_of and vocab_size % self.pad_to_multiple_of:
            vocab_size = (
                vocab_size // self.pad_to_multiple_of + 1
            ) * self.pad_to_multiple_of
        embed_size = tensors[models[0]].shape[1]
        assert all(
            t.shape[1] == embed_size for t in tensors.values()
        ), "Embedding sizes must match"

        dtype = tensors[models[0]].dtype
        device = tensors[models[0]].device

        token_configs = dict(**(self.tokens or {}))
        tokens_to_average = self.assign_embedding_sources(
            permutations, models, vocab, token_configs
        )

        default_embeds = {}
        for token, token_id in vocab.items():
            embed = torch.zeros(embed_size, dtype=dtype, device=device)
            if token in tokens_to_average:
                count = 0
                for model in models:
                    p = permutations[model]
                    if p[token_id] < 0:
                        continue
                    
                    # --- AUDIT & BOUNDS CHECK ---
                    if p[token_id] >= tensors[model].shape[0]:
                        logging.warning(f"[VOCAB AUDIT] Model '{model}' is missing token '{token}' (ID: {token_id}). "
                                        f"Donor size: {tensors[model].shape[0]}, Requested Index: {p[token_id]}. Skipping.")
                        continue
                    # ----------------------------

                    embed += tensors[model][p[token_id]]
                    count += 1
                embed /= count
            elif cfg := token_configs.get(token, None):
                cfg: TokenEmbeddingConfig
                embed = self.compute_default_embedding(
                    tokenizer_info, tensors, permutations, token, token_id, cfg
                )
            else:
                continue
            default_embeds[token] = embed

        result = {}
        for model in models:
            p = permutations[model]
            old_embed = tensors[model]
            new_embed = torch.zeros(
                (vocab_size, embed_size), dtype=dtype, device=device
            )
            for token, token_id in vocab.items():
                force = False
                if token in token_configs:
                    force = token_configs[token].force

                if p[token_id] >= 0 and not force:
                    # --- BOUNDS CHECK FOR RESULT MAPPING ---
                    if p[token_id] < old_embed.shape[0]:
                        new_embed[token_id, :] = old_embed[p[token_id]]
                    else:
                        # Fallback to the averaged/default version if the donor is too small
                        new_embed[token_id, :] = default_embeds.get(token, torch.zeros_like(new_embed[0]))
                    # ---------------------------------------
                elif token in default_embeds:
                    new_embed[token_id, :] = default_embeds[token]
                else:
                    logging.error(
                        f"No embedding for token {repr(token)} in model {model}!"
                    )

            if vocab_size > len(vocab):
                # as suggested by https://nlp.stanford.edu/~johnhew/vocab-expansion.html
                avg_embed = torch.mean(new_embed[: len(vocab), :], dim=0)
                new_embed[len(vocab) :, :] = avg_embed
            result[model] = new_embed

        return result

    def assign_embedding_sources(
        self,
        permutations: Dict[ModelReference, Dict[int, int]],
        models: list[ModelReference],
        vocab: Dict[str, int],
        token_configs: Dict[str, TokenEmbeddingConfig],
    ):
        permutation_list = [permutations[model] for model in models]

        tokens_to_average = set()
        # find tokens that are only present in one model
        for token, token_id in vocab.items():
            if token in token_configs:
                continue

            has_token = [p[token_id] >= 0 for p in permutation_list]
            num_present = sum(int(x) for x in has_token)
            if num_present == 1:
                donor_model = models[has_token.index(True)]
                token_configs[token] = TokenEmbeddingConfig(source=donor_model)
                continue

            if num_present == 0:
                token_configs[token] = TokenEmbeddingConfig(source=ZeroEmbedding())
                logging.warning(f"Token {repr(token)} not found in any model")
                continue

            if num_present > 0 and self.base_model is not None:
                if permutations[self.base_model][token_id] >= 0:
                    token_configs[token] = TokenEmbeddingConfig(source=self.base_model)
                    continue

            tokens_to_average.add(token)
        return tokens_to_average

    def compute_default_embedding(
        self,
        tokenizer_info: TokenizerInfo,
        tensors: Dict[ModelReference, torch.Tensor],
        permutations: Dict[ModelReference, Dict[int, int]],
        token: str,
        token_id: int,
        cfg: TokenEmbeddingConfig,
    ) -> torch.Tensor:
        if isinstance(cfg.source, ZeroEmbedding):
            pass
        elif isinstance(cfg.source, ModelTokenEmbedding):
            model = cfg.source.model
            assert (
                model in permutations
            ), f"Model {model} referenced but not part of merge"
            p = permutations[model]
            src_token_id = cfg.source.token_id
            if src_token_id is None:
                src_token = cfg.source.token
                assert (
                    src_token in tokenizer_info.original_vocabs[model]
                ), f"Token {repr(src_token)} not found in model {model}"
                src_token_id = tokenizer_info.original_vocabs[model][src_token]
            assert (
                src_token_id >= 0 and src_token_id < tensors[model].shape[0]
            ), f"Token ID {src_token_id} out of range for model {model}"
            embed = tensors[model][src_token_id]
        elif isinstance(cfg.source, ModelReference):
            model = cfg.source
            p = permutations[model]
            assert p[token_id] >= 0, f"Token {repr(token)} not found in model {model}"
            
            # --- BOUNDS CHECK FOR DEFAULT EMBED ---
            if p[token_id] >= tensors[model].shape[0]:
                logging.warning(f"[VOCAB AUDIT] Default source model '{model}' is missing token '{token}' from its physical tensor. Falling back to zero.")
                return torch.zeros(tensors[model].shape[1], dtype=tensors[model].dtype, device=tensors[model].device)
            # --------------------------------------
            
            embed = tensors[model][p[token_id]]
        else:
            raise NotImplementedError(cfg)
        return embed