Pruned MoEs (Mixtral-8x7B-Instruct-v0.1)
Collection
Pruned experts from Mixtral-8x7B-Instruct-v0.1 with respect to the paper "A Provably Effective Method for Pruning Experts in Fine-tuned Sparse MoEs" • 15 items • Updated
LoRA fine-tuned version of mistralai/Mixtral-8x7B-Instruct-v0.1 only targeting the gate/router.
quantization_config = transformers.BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", truncation=True, padding=True, padding_side="right")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", quantization_config=quantization_config)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model = prepare_model_for_kbit_training(model)
config = LoraConfig(r = 4,
lora_alpha=4,
target_modules = ["gate"],
lora_dropout=0.1
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()
dataset = load_dataset("Na0s/sft-ready-Text-Generation-Augmented-Data", split="train")
trainer = SFTTrainer(
model = lora_model,
tokenizer = tokenizer,
train_dataset = dataset,
packing = True,
args = TrainingArguments(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 16,
group_by_length = True,
warmup_steps = 5,
bf16 = True,
max_steps=5000,
learning_rate = 2e-4,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "cosine",
seed = 3407,
eval_strategy="no",
do_eval=False,
output_dir = "./outputs",
push_to_hub=True,
remove_unused_columns=False,
)
)
Upcoming.
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
The objective of the fine-tuning of this MoE based transformer is to implement the expert pruning detailed in the following paper: A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts
Base model
mistralai/Mixtral-8x7B-v0.1