Mistral-NeMoE-12B-16E

Developed by: blascotobasco
Base model: mistralai/Mistral-Nemo-Instruct-2407
Architecture: Mixtral-style Sparse MoE (16 experts, top-2 routing)
Chat template: Mistral [INST] format
License: Inherited from Mistral-Nemo-Instruct-2407

Overview

Mistral-NeMoE-12B-16E is the first publicly released model produced via the shattering method — a dense-to-MoE structural transformation developed as part of the Nebula Structural Modification Suite. The dense FFN of Mistral-Nemo-Instruct-2407 was shattered into 16 experts using SVD-based weight carving with orthogonal router initialisation, then restored to full coherence through six sequential curriculum LoRA passes without any full fine-tuning.

Untrained, the model output random patterns of dots and commas with broken EOS tokens. By Phase 3, coherence was restored. The final phases focused on knowledge distillation, logical reasoning, and instruction alignment.

This model is best understood as a repaired pretrained base rather than a finished instruction-tuned model. The shattering process destroyed the original model's capabilities entirely, and the training curriculum rebuilt them from scratch under significant budget constraints — this is a student project, not a well-funded lab release. The result is a coherent, capable MoE base that responds well to instruction and character context, but has not received the depth of training a production model would. It is released primarily for the community to explore, fine-tune, and build upon. Further training will improve it substantially — the architecture is sound, the weights are coherent, and the hard work of recovery has been done.

Architecture

Property	Value
Base model	Mistral-Nemo-Instruct-2407
Parameters	~12B (same shell as source)
Experts	16
Top-K routing	4
Expert FFN dim	896 (carved from 14336)
Context length	131,072 tokens
Vocab size	131,072 (Tekken tokenizer)

The model occupies the same parameter count and memory footprint as the original dense Mistral-Nemo-12B. The MoE structure increases representational capacity through specialisation rather than size.

Training Curriculum

Six sequential LoRA passes were applied to restore and specialise the shattered model:

Phase	Dataset	Purpose
1	General knowledge	Language repair, basic coherence
2	OpenHermes-2.5	Structure restoration, instruction format
3	FineTome-100k	Factual grounding
4	MetaMathQA + Stheno + OpenHermes	Reasoning and roleplay register
5	(discarded — overfitting)	—
6	OpenHermes + MetaMathQA + Platypus + Magpie-Pro	Full recalibration, neutral default register

All passes used LoRA with ranks between r=16 and r=512. No full fine-tuning was performed at any stage.

Behavioral Profile

The model excels at:

Rich atmospheric prose — responds well to detailed character context
High-fidelity roleplay — adopts personas naturally from system prompt descriptions
Instruction following — neutral default register with creative voice available on demand
Long context — 128k token context inherited from Mistral-Nemo base

Known limitations:

Advanced mathematics — base Mistral-Nemo ceiling, complex arithmetic may be unreliable
Highly specific factual recall — may hallucinate on niche topics not well represented in training
Pretrained base state — benefits significantly from additional fine-tuning for specific use cases

Recommended Sampling Settings

The model responds differently across temperature ranges:

Use case	Temperature	Min-P	Repeat Penalty
Factual / instruction	0.3 – 0.5	0.1	1.15
General roleplay	0.5 – 0.7	0.1	1.15 – 1.18
Creative / atmospheric	0.7 – 0.9	0.1	1.18

Min-P sampling is strongly recommended over Top-P for this model. At high temperatures without Min-P the model may drift toward the training data's most dominant creative register.

Chat Template

This model uses Mistral [INST] format, not ChatML. Using the wrong template will produce incoherent or looping output.

Correct format:

[INST]You are a helpful assistant.

What is the capital of France?[/INST]

With system prompt (folded into first turn):

[INST]You are a gruff Nord blacksmith. Speak in short direct sentences.

Do you have any swords for sale?[/INST]

In llama.cpp:

llama-cli -m model.gguf --ngl 99 -cnv --chat-template mistral --temp 0.5

Upcycling

Using the Nebula Suite, this model can be upcycled to any number of experts to increase capacity. Untrained upcycled variants behave identically to this model until fine-tuned — the orthogonal router initialisation ensures all experts start with unique gradient signals rather than dead routing.

Planned releases:

Mistral-NeMoE-12B-32E-Untrained — 32 expert variant
Mistral-NeMoE-12B-64E-Untrained — 64 expert variant

Community fine-tuning of untrained upcycles is encouraged.

Nebula Suite

This model was produced entirely using the Nebula Structural Modification Suite, a self-designed framework for extensive structural modification of language models. Tools used in this model's production include:

Carver — SVD-based FFN shattering with orthogonal router initialisation
Upcycler — zero-memory expert count scaling
Relay — router recalibration for pruned models

The suite enables dense-to-MoE conversion, expert count scaling, and router recalibration without gradient-based training.

Intended Use

This model is suitable for general creative writing, atmospheric roleplay, and as a fine-tuning base for persona-specific applications. It is designed to receive rich character context at inference time and responds well to detailed system prompts.

It is particularly well suited as a starting point for researchers or hobbyists interested in MoE fine-tuning, given the recovered architecture and coherent pretrained weights.

Limitations and Responsible Use

This model is uncensored and will follow character instructions including dark or mature themes when given appropriate system prompts
Users are responsible for appropriate system prompt design
The model may produce factually incorrect information, especially on mathematical or highly technical topics
Not recommended for factual research or professional advice without verification

Downloads last month: 85

Safetensors

Model size

12B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for blascotobasco/Mistral-NeMoE-12B-16E

Base model

mistralai/Mistral-Nemo-Base-2407

Finetuned

mistralai/Mistral-Nemo-Instruct-2407

Finetuned

(212)

this model