Mismatch between FlexOlmo experts and domain-specific models

by kmg42 - opened Dec 17, 2025

Dec 17, 2025

Some of the expert weights in allenai/FlexOlmo-7x7B-1T do not match with any of the released domain-specific models, potentially indicating that incorrect model versions were released.

Specifically, the following models do not match with any of the experts in allenai/FlexOlmo-7x7B-1T:

"allenai/Flex-math-2x7B-1T"
"allenai/Flex-code-2x7B-1T"
"allenai/Flex-reddit-2x7B-1T"

These experts have a mean absolute relative error of over 5. The remaining experts (pes2o, creative, news) match perfectly with experts 4, 5, and 6, respectively (zero-indexed).

Running this code shows the mismatch by trying to find the most similar expert (in terms of L2 error) in FlexOlmo-7x7B-1T for each of the domain-specific models:

import torch
from transformers import AutoModelForCausalLM

REF_MODEL = "allenai/FlexOlmo-7x7B-1T"
EXPERT_MAP = {
    "math": "allenai/Flex-math-2x7B-1T",
    "code": "allenai/Flex-code-2x7B-1T",
    "reddit": "allenai/Flex-reddit-2x7B-1T",
    "academic": "allenai/Flex-pes2o-2x7B-1T",
    "creative": "allenai/Flex-creative-2x7B-1T",
    "news": "allenai/Flex-news-2x7B-1T",
}

model = AutoModelForCausalLM.from_pretrained(
    REF_MODEL, device_map="cpu", dtype=torch.bfloat16
).eval()

layer = 10
for name, model_name in EXPERT_MAP.items():
    print("=" * 40)
    print(f"Testing domain-specific model: {name}")
    expert_model = AutoModelForCausalLM.from_pretrained(
        model_name, device_map="cpu", dtype="auto"
    ).eval()
    torch.cuda.empty_cache()

    # Base expert
    best_similarity, best_i = float("inf"), -1
    for i in range(model.config.num_experts):
        similarity = torch.mean(
            (expert_model.model.layers[layer].mlp.experts[0].up_proj.weight -
             model.model.layers[layer].mlp.experts[i].up_proj.weight) ** 2
        )
        if similarity < best_similarity:
            best_similarity = float(similarity)
            best_i = i
    print(f"Best match for base expert: expert {best_i} with L2 error {best_similarity:.10f}")

    # Domain expert
    best_similarity, best_i = float("inf"), -1
    for i in range(model.config.num_experts):
        similarity = torch.mean(
            (expert_model.model.layers[layer].mlp.experts[1].up_proj.weight -
             model.model.layers[layer].mlp.experts[i].up_proj.weight) ** 2
        )
        if similarity < best_similarity:
            best_similarity = float(similarity)
            best_i = i
    print(f"Best match for domain expert: expert {best_i} with L2 error {best_similarity:.10f}")

resulting in:

========================================
Testing domain-specific model: math
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 1 with L2 error 0.0016101063
========================================
Testing domain-specific model: code
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 2 with L2 error 0.0019702788
========================================
Testing domain-specific model: reddit
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 3 with L2 error 0.0018725740
========================================
Testing domain-specific model: academic
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 4 with L2 error 0.0000000000
========================================
Testing domain-specific model: creative
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 5 with L2 error 0.0000000000
========================================
Testing domain-specific model: news
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 6 with L2 error 0.0000000000

zuom

Jan 28

The expert weights don't match for any of the layers in the math, code, and reddit for layers in the -RT model either.

The gate weights also don't match to the 2x7B models either, at all.

akshitab

Ai2 org Mar 2

Hello, thank you so much for bringing this to our attention! We had mistakenly uploaded the incorrect model. Both the 7x7B models have now been updated, and the non-RT model should have the correct experts mapped.

akshitab changed discussion status to closed Mar 2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment