Mistral-NeMoE-12B-16E

Developed by: blascotobasco
Base model: mistralai/Mistral-Nemo-Instruct-2407
Architecture: Mixtral-style Sparse MoE (16 experts, top-2 routing)
Chat template: Mistral [INST] format
License: Inherited from Mistral-Nemo-Instruct-2407


Overview

Mistral-NeMoE-12B-16E is the first publicly released model produced via the shattering method β€” a dense-to-MoE structural transformation developed as part of the Nebula Structural Modification Suite. The dense FFN of Mistral-Nemo-Instruct-2407 was shattered into 16 experts using SVD-based weight carving with orthogonal router initialisation, then restored to full coherence through six sequential curriculum LoRA passes without any full fine-tuning.

Untrained, the model output random patterns of dots and commas with broken EOS tokens. By Phase 3, coherence was restored. The final phases focused on knowledge distillation, logical reasoning, and instruction alignment.

This model is best understood as a repaired pretrained base rather than a finished instruction-tuned model. The shattering process destroyed the original model's capabilities entirely, and the training curriculum rebuilt them from scratch under significant budget constraints β€” this is a student project, not a well-funded lab release. The result is a coherent, capable MoE base that responds well to instruction and character context, but has not received the depth of training a production model would. It is released primarily for the community to explore, fine-tune, and build upon. Further training will improve it substantially β€” the architecture is sound, the weights are coherent, and the hard work of recovery has been done.


Architecture

Property Value
Base model Mistral-Nemo-Instruct-2407
Parameters ~12B (same shell as source)
Experts 16
Top-K routing 4
Expert FFN dim 896 (carved from 14336)
Context length 131,072 tokens
Vocab size 131,072 (Tekken tokenizer)

The model occupies the same parameter count and memory footprint as the original dense Mistral-Nemo-12B. The MoE structure increases representational capacity through specialisation rather than size.


Training Curriculum

Six sequential LoRA passes were applied to restore and specialise the shattered model:

Phase Dataset Purpose
1 General knowledge Language repair, basic coherence
2 OpenHermes-2.5 Structure restoration, instruction format
3 FineTome-100k Factual grounding
4 MetaMathQA + Stheno + OpenHermes Reasoning and roleplay register
5 (discarded β€” overfitting) β€”
6 OpenHermes + MetaMathQA + Platypus + Magpie-Pro Full recalibration, neutral default register

All passes used LoRA with ranks between r=16 and r=512. No full fine-tuning was performed at any stage.


Behavioral Profile

The model excels at:

  • Rich atmospheric prose β€” responds well to detailed character context
  • High-fidelity roleplay β€” adopts personas naturally from system prompt descriptions
  • Instruction following β€” neutral default register with creative voice available on demand
  • Long context β€” 128k token context inherited from Mistral-Nemo base

Known limitations:

  • Advanced mathematics β€” base Mistral-Nemo ceiling, complex arithmetic may be unreliable
  • Highly specific factual recall β€” may hallucinate on niche topics not well represented in training
  • Pretrained base state β€” benefits significantly from additional fine-tuning for specific use cases

Recommended Sampling Settings

The model responds differently across temperature ranges:

Use case Temperature Min-P Repeat Penalty
Factual / instruction 0.3 – 0.5 0.1 1.15
General roleplay 0.5 – 0.7 0.1 1.15 – 1.18
Creative / atmospheric 0.7 – 0.9 0.1 1.18

Min-P sampling is strongly recommended over Top-P for this model. At high temperatures without Min-P the model may drift toward the training data's most dominant creative register.


Chat Template

This model uses Mistral [INST] format, not ChatML. Using the wrong template will produce incoherent or looping output.

Correct format:

[INST]You are a helpful assistant.

What is the capital of France?[/INST]

With system prompt (folded into first turn):

[INST]You are a gruff Nord blacksmith. Speak in short direct sentences.

Do you have any swords for sale?[/INST]

In llama.cpp:

llama-cli -m model.gguf --ngl 99 -cnv --chat-template mistral --temp 0.5

Upcycling

Using the Nebula Suite, this model can be upcycled to any number of experts to increase capacity. Untrained upcycled variants behave identically to this model until fine-tuned β€” the orthogonal router initialisation ensures all experts start with unique gradient signals rather than dead routing.

Planned releases:

  • Mistral-NeMoE-12B-32E-Untrained β€” 32 expert variant
  • Mistral-NeMoE-12B-64E-Untrained β€” 64 expert variant

Community fine-tuning of untrained upcycles is encouraged.


Nebula Suite

This model was produced entirely using the Nebula Structural Modification Suite, a self-designed framework for extensive structural modification of language models. Tools used in this model's production include:

  • Carver β€” SVD-based FFN shattering with orthogonal router initialisation
  • Upcycler β€” zero-memory expert count scaling
  • Relay β€” router recalibration for pruned models

The suite enables dense-to-MoE conversion, expert count scaling, and router recalibration without gradient-based training.


Intended Use

This model is suitable for general creative writing, atmospheric roleplay, and as a fine-tuning base for persona-specific applications. It is designed to receive rich character context at inference time and responds well to detailed system prompts.

It is particularly well suited as a starting point for researchers or hobbyists interested in MoE fine-tuning, given the recovered architecture and coherent pretrained weights.


Limitations and Responsible Use

  • This model is uncensored and will follow character instructions including dark or mature themes when given appropriate system prompts
  • Users are responsible for appropriate system prompt design
  • The model may produce factually incorrect information, especially on mathematical or highly technical topics
  • Not recommended for factual research or professional advice without verification
Downloads last month
85
Safetensors
Model size
12B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for blascotobasco/Mistral-NeMoE-12B-16E