Mismatch between FlexOlmo experts and domain-specific models
Some of the expert weights in allenai/FlexOlmo-7x7B-1T do not match with any of the released domain-specific models, potentially indicating that incorrect model versions were released.
Specifically, the following models do not match with any of the experts in allenai/FlexOlmo-7x7B-1T:
"allenai/Flex-math-2x7B-1T"
"allenai/Flex-code-2x7B-1T"
"allenai/Flex-reddit-2x7B-1T"
These experts have a mean absolute relative error of over 5. The remaining experts (pes2o, creative, news) match perfectly with experts 4, 5, and 6, respectively (zero-indexed).
Running this code shows the mismatch by trying to find the most similar expert (in terms of L2 error) in FlexOlmo-7x7B-1T for each of the domain-specific models:
import torch
from transformers import AutoModelForCausalLM
REF_MODEL = "allenai/FlexOlmo-7x7B-1T"
EXPERT_MAP = {
"math": "allenai/Flex-math-2x7B-1T",
"code": "allenai/Flex-code-2x7B-1T",
"reddit": "allenai/Flex-reddit-2x7B-1T",
"academic": "allenai/Flex-pes2o-2x7B-1T",
"creative": "allenai/Flex-creative-2x7B-1T",
"news": "allenai/Flex-news-2x7B-1T",
}
model = AutoModelForCausalLM.from_pretrained(
REF_MODEL, device_map="cpu", dtype=torch.bfloat16
).eval()
layer = 10
for name, model_name in EXPERT_MAP.items():
print("=" * 40)
print(f"Testing domain-specific model: {name}")
expert_model = AutoModelForCausalLM.from_pretrained(
model_name, device_map="cpu", dtype="auto"
).eval()
torch.cuda.empty_cache()
# Base expert
best_similarity, best_i = float("inf"), -1
for i in range(model.config.num_experts):
similarity = torch.mean(
(expert_model.model.layers[layer].mlp.experts[0].up_proj.weight -
model.model.layers[layer].mlp.experts[i].up_proj.weight) ** 2
)
if similarity < best_similarity:
best_similarity = float(similarity)
best_i = i
print(f"Best match for base expert: expert {best_i} with L2 error {best_similarity:.10f}")
# Domain expert
best_similarity, best_i = float("inf"), -1
for i in range(model.config.num_experts):
similarity = torch.mean(
(expert_model.model.layers[layer].mlp.experts[1].up_proj.weight -
model.model.layers[layer].mlp.experts[i].up_proj.weight) ** 2
)
if similarity < best_similarity:
best_similarity = float(similarity)
best_i = i
print(f"Best match for domain expert: expert {best_i} with L2 error {best_similarity:.10f}")
resulting in:
========================================
Testing domain-specific model: math
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 1 with L2 error 0.0016101063
========================================
Testing domain-specific model: code
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 2 with L2 error 0.0019702788
========================================
Testing domain-specific model: reddit
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 3 with L2 error 0.0018725740
========================================
Testing domain-specific model: academic
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 4 with L2 error 0.0000000000
========================================
Testing domain-specific model: creative
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 5 with L2 error 0.0000000000
========================================
Testing domain-specific model: news
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 6 with L2 error 0.0000000000
The expert weights don't match for any of the layers in the math, code, and reddit for layers in the -RT model either.
The gate weights also don't match to the 2x7B models either, at all.
Hello, thank you so much for bringing this to our attention! We had mistakenly uploaded the incorrect model. Both the 7x7B models have now been updated, and the non-RT model should have the correct experts mapped.