Mismatch between FlexOlmo experts and domain-specific models

#5
by kmg42 - opened

Some of the expert weights in allenai/FlexOlmo-7x7B-1T do not match with any of the released domain-specific models, potentially indicating that incorrect model versions were released.

Specifically, the following models do not match with any of the experts in allenai/FlexOlmo-7x7B-1T:

"allenai/Flex-math-2x7B-1T"
"allenai/Flex-code-2x7B-1T"
"allenai/Flex-reddit-2x7B-1T"

These experts have a mean absolute relative error of over 5. The remaining experts (pes2o, creative, news) match perfectly with experts 4, 5, and 6, respectively (zero-indexed).

Running this code shows the mismatch by trying to find the most similar expert (in terms of L2 error) in FlexOlmo-7x7B-1T for each of the domain-specific models:

import torch
from transformers import AutoModelForCausalLM

REF_MODEL = "allenai/FlexOlmo-7x7B-1T"
EXPERT_MAP = {
    "math": "allenai/Flex-math-2x7B-1T",
    "code": "allenai/Flex-code-2x7B-1T",
    "reddit": "allenai/Flex-reddit-2x7B-1T",
    "academic": "allenai/Flex-pes2o-2x7B-1T",
    "creative": "allenai/Flex-creative-2x7B-1T",
    "news": "allenai/Flex-news-2x7B-1T",
}

model = AutoModelForCausalLM.from_pretrained(
    REF_MODEL, device_map="cpu", dtype=torch.bfloat16
).eval()

layer = 10
for name, model_name in EXPERT_MAP.items():
    print("=" * 40)
    print(f"Testing domain-specific model: {name}")
    expert_model = AutoModelForCausalLM.from_pretrained(
        model_name, device_map="cpu", dtype="auto"
    ).eval()
    torch.cuda.empty_cache()

    # Base expert
    best_similarity, best_i = float("inf"), -1
    for i in range(model.config.num_experts):
        similarity = torch.mean(
            (expert_model.model.layers[layer].mlp.experts[0].up_proj.weight -
             model.model.layers[layer].mlp.experts[i].up_proj.weight) ** 2
        )
        if similarity < best_similarity:
            best_similarity = float(similarity)
            best_i = i
    print(f"Best match for base expert: expert {best_i} with L2 error {best_similarity:.10f}")

    # Domain expert
    best_similarity, best_i = float("inf"), -1
    for i in range(model.config.num_experts):
        similarity = torch.mean(
            (expert_model.model.layers[layer].mlp.experts[1].up_proj.weight -
             model.model.layers[layer].mlp.experts[i].up_proj.weight) ** 2
        )
        if similarity < best_similarity:
            best_similarity = float(similarity)
            best_i = i
    print(f"Best match for domain expert: expert {best_i} with L2 error {best_similarity:.10f}")

resulting in:

========================================
Testing domain-specific model: math
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 1 with L2 error 0.0016101063
========================================
Testing domain-specific model: code
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 2 with L2 error 0.0019702788
========================================
Testing domain-specific model: reddit
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 3 with L2 error 0.0018725740
========================================
Testing domain-specific model: academic
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 4 with L2 error 0.0000000000
========================================
Testing domain-specific model: creative
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 5 with L2 error 0.0000000000
========================================
Testing domain-specific model: news
Best match for base expert: expert 0 with L2 error 0.0000000000
Best match for domain expert: expert 6 with L2 error 0.0000000000

The expert weights don't match for any of the layers in the math, code, and reddit for layers in the -RT model either.

The gate weights also don't match to the 2x7B models either, at all.

Ai2 org

Hello, thank you so much for bringing this to our attention! We had mistakenly uploaded the incorrect model. Both the 7x7B models have now been updated, and the non-RT model should have the correct experts mapped.

akshitab changed discussion status to closed

Sign up or log in to comment