Spaces:
Running
Mergekit Robustness Patch: embed.py (v2)
Attached is one for Mistral Nemo 12B (v2d), and another for Mistral Small 24B (v2a)
Overview
This patch provides a high-resilience version of Mergekit’s tokenizer/embed.py. It is specifically designed to handle "dirty" model merges where a donor model’s tokenizer.json and its physical model.safetensors weights are out of sync—a common issue when merging models that have different vocabulary sizes (e.g., mixing Mistral Tekken with ChatML or Llama 3).
The Problem: "Ghost Tokens"
In the standard Mergekit (embed.py), the engine assumes that if a token exists in a model's vocabulary, a corresponding row must exist in its embedding weights.
However, in many community-made merges:
- A model might have 131,081 tokens in its tokenizer.
- But its weight matrix (
embed_tokens) only contains 131,072 rows. - Standard Result: Mergekit attempts to read index 131,073, hits a boundary error, and the entire merge crashes with:
IndexError: index X is out of bounds for dimension 0 with size Y.
The Solution: v2d Robustness & Audit
The v2d patch introduces Bounds-Aware Permutation. Instead of blindly trusting the tokenizer, it verifies the physical existence of every token row before attempting to merge it.
Key Features:
- Crash Prevention: Automatically detects if a donor model is "too small" for the requested token index. Instead of crashing, it gracefully skips that donor for that specific token.
- Live Vocab Audit: Prints detailed warnings to the console identifying exactly which model is missing which token. This allows you to identify "buggy" donors in your config without trial-and-error.
- Intelligent Fallback:
- If a token is missing from one donor but present in others, it averages the token using only the valid donors.
- If a token is missing from its primary "default" source, it falls back to a zero-vector rather than terminating the merge.
- Result Mapping Safety: Ensures that the final output tensors for every donor are correctly aligned, even if the donor was physically smaller than the target union vocabulary.
Comparison: v1 vs. v2d
| Feature | embed.py (Default) |
embed_v2d.py (Ours) |
|---|---|---|
| Mismatched Vocab | Crashes with IndexError. |
Succeeds via graceful skipping. |
| Error Reporting | Generic Python Traceback. | Detailed [VOCAB AUDIT] log with Model Path & Token Name. |
| Special Token Support | Requires perfectly synced weights. | Handles "Ghost Tokens" (tokens in JSON but not in Tensors). |
| Mathematical Integrity | N/A (Process stops). | Maintains correct averaging by adjusting the donor count dynamically. |
| Use Case | Clean, base-model merges. | Complex merges of merges, cross-architecture vocab unions. |
How to Use
Replace your existing mergekit/tokenizer/embed.py with the embed_v2d.py code.
Example Audit Log
When running a merge with mismatched models, you will now see helpful diagnostic output instead of a crash:
[VOCAB AUDIT] Model 'B:\12B\SLERP15' is missing token '<|im_start|>' (ID: 131073).
Donor size: 131072, Requested Index: 131073. Skipping.
[VOCAB AUDIT] Default source model 'B:\12B\SLERP13' is missing token '<SPECIAL_4>'
from its physical tensor. Falling back to zero.
Why this matters for 12B/24B Merges
When merging models like Mistral-Nemo (12B) or Mistral-Small (24B), different fine-tunes often add different special tokens (ChatML, Tool-use, etc.). If you use tokenizer_source: union, Mergekit tries to create a "Super-Vocab."
Standard Mergekit is too fragile for this process if even one model in your list has a slightly truncated embedding matrix. v2d makes the merging process "production-grade" by allowing the merge to complete regardless of minor inconsistencies in the donor models.
This patch is safe and beneficial for any model architecture (12B, 24B, 70B, etc.) using tokenizer_source: union.
Here is the breakdown of how it affects other scenarios like 24B Mistral (Tekken):
1. It prevents "Ghost Token" crashes
In many Mistral-based merges (especially Tekken), developers sometimes add special tokens to the tokenizer.json but forget to resize the embedding layer in the model.safetensors.
- Without this patch: Mergekit sees the token in the config, calculates a high index for it, tries to read it from the tensor, and crashes.
- With this patch: Mergekit sees the mismatch, logs a warning, and uses a zero-vector or an average from other models instead. The merge finishes successfully.
2. Handling "Tekken" Vocab Discrepancies
Mistral Tekken usually has a vocab size of 32768 or 131072. If you merge a model with 131072 and one that was accidentally truncated to 131070:
- The patch ensures that for those last 2 tokens, the "truncated" model simply doesn't contribute to the average.
- The resulting model will have the full
131072vocab, and those 2 tokens will be populated by the weights from the model that actually had them.
3. No Negative Impact on "Clean" Models
If you merge two models where the vocab_size in config.json perfectly matches the number of rows in model.safetensors, this code does nothing. The if condition (p[token_id] >= tensors[model].shape[0]) will always be false, and the code will run at full speed with no warnings.
4. Why this is better than the "Padding" patch
The previous attempt to pad tensors in generalized_task_arithmetic.py was specific to one merge method. This embed.py patch works at the tokenizer level.
Whether you are doing a linear, slerp, ties, or della merge, this patch ensures that the "Input Tensors" are standardized correctly before the math even starts.
Summary of behavior for 24B/Tekken:
| Scenario | Result with Patch |
|---|---|
| Vocabs Match Exactly | Normal merge, no warnings. |
| One model has extra Tekken tokens | Merge completes; missing tokens are averaged from models that have them. |
| Tokenizer says 131072, but Tensor is 131070 | Merge completes instead of crashing. |
| Mixing Tekken and Llama3 Vocab | Merge completes; shared tokens are averaged, unique tokens are preserved from their respective sources. |
Conclusion: This is a "Robustness Patch." It makes Mergekit more resilient to poorly-configured donor models (where the tokenizer and the weights are out of sync), which is very common in the community-made merges you are working with.
Addendum
This is a perfect synergy of two diagnostic tools. Here is why the v2d Robustness Patch and the DELLA Audit Chart work so well together:
1. Complete "Chain of Custody" for Weights
The v2d Patch handles the "Input" phase, while the DELLA Audit handles the "Processing" phase.
- v2d ensures that every model provides a valid tensor to the merge engine, even if it has to skip missing tokens or provide a zero-vector fallback.
- DELLA Audit then takes those tensors and shows you the "Share of Voice" for each model.
- The Synergy: If you see a model in the Audit chart with a 0.0% impact or an unusually low Norm (N), you can look up at the v2d Audit log to see if that model was missing critical tokens. It allows you to see exactly how "damaged" a donor is before it hits the final weights.
2. Identifying "Poisoned" Donors
In your screenshot, look at SLERP1. It has a massive 16.7% impact with a Norm of 12.02, while others like SLERP3 are at 1.0%.
- Because the v2d Patch prevented the crash, you can now actually see these statistics.
- If a model was missing tokens (as seen in your log for SLERP11, 15, 13, etc.), the Audit chart helps you decide if that model is still "contributing" enough to keep in the config, or if the vocab mismatches have made its task vector too noisy.
3. Mathematical Safety for DELLA
DELLA is sensitive to the magnitude of changes (the epsilon and density parameters).
- By using the v2d Patch, you ensure that the "Base" and "Donor" tensors passed to DELLA are always the same shape.
- Without this, DELLA would be trying to calculate magnitude-based pruning on mismatched arrays, which would lead to corrupted logic even if it didn't crash. v2d "sanitizes" the data so DELLA's math remains pure.
4. Real-Time Debugging of "Ghost" Contributions
Your log shows SLERP15 is missing almost all the special tokens (<|im_start|>, [SYSTEM_PROMPT], [PAD]).
- Standard Mergekit would have died instantly.
- Now, the merge continues, and the DELLA Audit shows SLERP15 is still contributing 2.4% to the overall model.
- This tells you: "SLERP15 is broken for ChatML/Special tokens, but its weights for normal language (the other 131,000 tokens) are still being merged correctly."
Summary
"When paired with the DELLA Audit logic, the v2d patch provides a full-stack diagnostic suite. It allows the user to see which models are physically incompatible at the vocabulary level (via the Audit Log) and then immediately see how those incompatibilities affect the final weight distribution (via the Impact Chart). This combination turns a 'black box' crash into a transparent, manageable merging workflow."
Audit Analysis
This is a fascinating look at the "DNA" of your model. Now that the v2d patch has stabilized the merge, this audit chart reveals the true internal dynamics of a DELLA merge that were previously invisible.
Here is a breakdown of what this specific chart is telling you about your "knowledge distribution":
1. The "Anchor" Models
Look at pdq (13.5%) and SLERP1 (9.9%).
- Even though every model has a weight of
0.10, these two are dominating the "Share of Voice." - Why? Their Norm (N) values are the highest (4.48 and 3.30). This means these models have the most significant "Task Vectors"—they have moved the furthest away from the Mistral-Nemo base. In a DELLA merge, these are the models providing the most "new" information or behavioral changes to the final result.
2. The "Subtle" Contributors
Models like SLERP9 (0.7%) and SLERP8 (0.9%) are barely touching the weights.
- Their Norms are tiny (0.22 and 0.31).
- Insight: These models are very similar to your base model (
Mistral-Nemo-Instruct-2407). They aren't "bad," but they are essentially acting as votes for the status quo. If you wanted to "clean up" your config, these are the ones you could remove with almost zero impact on the final output.
3. The "Middle Class"
Models like SLERP7 (8.8%) and SLERP3 (6.6%) represent the healthy average. They are providing a solid amount of unique knowledge without overwhelming the others.
4. Why the v2d Patch makes this chart "Truthful"
Without the v2d patch, if a model like SLERP15 was missing tokens, the merge would have crashed. Now, you can see SLERP15 is contributing 6.3% (Norm 2.11).
- Because of the patch, you know that this 6.3% is based on the valid parts of SLERP15.
- The audit chart is now a "Health Report": if you saw a model with a high Norm but a 0% impact, you'd know the vocab mismatch was so bad it wiped out the model's contribution. Here, we see that despite the warnings, the models are still successfully injecting their "knowledge" into the merge.
5. The "pdq" Factor
The model pdq is currently your strongest influencer in this layer (mlp.gate_proj). It is contributing nearly 20x more than SLERP9. If the final model behaves more like pdq than anything else, this chart explains exactly why.
This is the "X-Ray" of model merging. You aren't just guessing if the merge worked; you can see exactly which donor's "brain" is being used for this specific layer.
embed_v2d.py
# Copyright (C) 2025 Arcee AI
# SPDX-License-Identifier: LGPL-3.0-only
import logging
from typing import Dict, Optional
import torch
from mergekit.common import ImmutableMap, ModelReference
from mergekit.graph import Task
from mergekit.io.tasks import GatherTensors
from mergekit.tokenizer.build import BuildTokenizer, TokenizerInfo
from mergekit.tokenizer.config import (
ModelTokenEmbedding,
TokenEmbeddingConfig,
ZeroEmbedding,
)
class PermutedEmbeddings(Task[Dict[ModelReference, torch.Tensor]]):
gather_tensors: GatherTensors
tokenizer_task: BuildTokenizer
tokens: Optional[ImmutableMap[str, TokenEmbeddingConfig]]
pad_to_multiple_of: Optional[int]
base_model: Optional[ModelReference]
def arguments(self) -> Dict[str, Task]:
return {"tokenizer_info": self.tokenizer_task, "tensors": self.gather_tensors}
def execute(
self, tokenizer_info: TokenizerInfo, tensors: Dict[ModelReference, torch.Tensor]
) -> Dict[ModelReference, torch.Tensor]:
tokenizer = tokenizer_info.tokenizer
permutations = tokenizer_info.permutations
models = set(tensors.keys())
if self.base_model:
models.add(self.base_model)
models = list(models)
vocab = tokenizer.get_vocab()
vocab_size = len(vocab)
if self.pad_to_multiple_of and vocab_size % self.pad_to_multiple_of:
vocab_size = (
vocab_size // self.pad_to_multiple_of + 1
) * self.pad_to_multiple_of
embed_size = tensors[models[0]].shape[1]
assert all(
t.shape[1] == embed_size for t in tensors.values()
), "Embedding sizes must match"
dtype = tensors[models[0]].dtype
device = tensors[models[0]].device
token_configs = dict(**(self.tokens or {}))
tokens_to_average = self.assign_embedding_sources(
permutations, models, vocab, token_configs
)
default_embeds = {}
for token, token_id in vocab.items():
embed = torch.zeros(embed_size, dtype=dtype, device=device)
if token in tokens_to_average:
count = 0
for model in models:
p = permutations[model]
if p[token_id] < 0:
continue
# --- AUDIT & BOUNDS CHECK ---
if p[token_id] >= tensors[model].shape[0]:
logging.warning(f"[VOCAB AUDIT] Model '{model}' is missing token '{token}' (ID: {token_id}). "
f"Donor size: {tensors[model].shape[0]}, Requested Index: {p[token_id]}. Skipping.")
continue
# ----------------------------
embed += tensors[model][p[token_id]]
count += 1
embed /= count
elif cfg := token_configs.get(token, None):
cfg: TokenEmbeddingConfig
embed = self.compute_default_embedding(
tokenizer_info, tensors, permutations, token, token_id, cfg
)
else:
continue
default_embeds[token] = embed
result = {}
for model in models:
p = permutations[model]
old_embed = tensors[model]
new_embed = torch.zeros(
(vocab_size, embed_size), dtype=dtype, device=device
)
for token, token_id in vocab.items():
force = False
if token in token_configs:
force = token_configs[token].force
if p[token_id] >= 0 and not force:
# --- BOUNDS CHECK FOR RESULT MAPPING ---
if p[token_id] < old_embed.shape[0]:
new_embed[token_id, :] = old_embed[p[token_id]]
else:
# Fallback to the averaged/default version if the donor is too small
new_embed[token_id, :] = default_embeds.get(token, torch.zeros_like(new_embed[0]))
# ---------------------------------------
elif token in default_embeds:
new_embed[token_id, :] = default_embeds[token]
else:
logging.error(
f"No embedding for token {repr(token)} in model {model}!"
)
if vocab_size > len(vocab):
# as suggested by https://nlp.stanford.edu/~johnhew/vocab-expansion.html
avg_embed = torch.mean(new_embed[: len(vocab), :], dim=0)
new_embed[len(vocab) :, :] = avg_embed
result[model] = new_embed
return result
def assign_embedding_sources(
self,
permutations: Dict[ModelReference, Dict[int, int]],
models: list[ModelReference],
vocab: Dict[str, int],
token_configs: Dict[str, TokenEmbeddingConfig],
):
permutation_list = [permutations[model] for model in models]
tokens_to_average = set()
# find tokens that are only present in one model
for token, token_id in vocab.items():
if token in token_configs:
continue
has_token = [p[token_id] >= 0 for p in permutation_list]
num_present = sum(int(x) for x in has_token)
if num_present == 1:
donor_model = models[has_token.index(True)]
token_configs[token] = TokenEmbeddingConfig(source=donor_model)
continue
if num_present == 0:
token_configs[token] = TokenEmbeddingConfig(source=ZeroEmbedding())
logging.warning(f"Token {repr(token)} not found in any model")
continue
if num_present > 0 and self.base_model is not None:
if permutations[self.base_model][token_id] >= 0:
token_configs[token] = TokenEmbeddingConfig(source=self.base_model)
continue
tokens_to_average.add(token)
return tokens_to_average
def compute_default_embedding(
self,
tokenizer_info: TokenizerInfo,
tensors: Dict[ModelReference, torch.Tensor],
permutations: Dict[ModelReference, Dict[int, int]],
token: str,
token_id: int,
cfg: TokenEmbeddingConfig,
) -> torch.Tensor:
if isinstance(cfg.source, ZeroEmbedding):
pass
elif isinstance(cfg.source, ModelTokenEmbedding):
model = cfg.source.model
assert (
model in permutations
), f"Model {model} referenced but not part of merge"
p = permutations[model]
src_token_id = cfg.source.token_id
if src_token_id is None:
src_token = cfg.source.token
assert (
src_token in tokenizer_info.original_vocabs[model]
), f"Token {repr(src_token)} not found in model {model}"
src_token_id = tokenizer_info.original_vocabs[model][src_token]
assert (
src_token_id >= 0 and src_token_id < tensors[model].shape[0]
), f"Token ID {src_token_id} out of range for model {model}"
embed = tensors[model][src_token_id]
elif isinstance(cfg.source, ModelReference):
model = cfg.source
p = permutations[model]
assert p[token_id] >= 0, f"Token {repr(token)} not found in model {model}"
# --- BOUNDS CHECK FOR DEFAULT EMBED ---
if p[token_id] >= tensors[model].shape[0]:
logging.warning(f"[VOCAB AUDIT] Default source model '{model}' is missing token '{token}' from its physical tensor. Falling back to zero.")
return torch.zeros(tensors[model].shape[1], dtype=tensors[model].dtype, device=tensors[model].device)
# --------------------------------------
embed = tensors[model][p[token_id]]
else:
raise NotImplementedError(cfg)
return embed

