Mistral-NeMoE-12B-16E
Developed by: blascotobasco
Base model: mistralai/Mistral-Nemo-Instruct-2407
Architecture: Mixtral-style Sparse MoE (16 experts, top-2 routing)
Chat template: Mistral [INST] format
License: Inherited from Mistral-Nemo-Instruct-2407
Overview
Mistral-NeMoE-12B-16E is the first publicly released model produced via the shattering method β a dense-to-MoE structural transformation developed as part of the Nebula Structural Modification Suite. The dense FFN of Mistral-Nemo-Instruct-2407 was shattered into 16 experts using SVD-based weight carving with orthogonal router initialisation, then restored to full coherence through six sequential curriculum LoRA passes without any full fine-tuning.
Untrained, the model output random patterns of dots and commas with broken EOS tokens. By Phase 3, coherence was restored. The final phases focused on knowledge distillation, logical reasoning, and instruction alignment.
This model is best understood as a repaired pretrained base rather than a finished instruction-tuned model. The shattering process destroyed the original model's capabilities entirely, and the training curriculum rebuilt them from scratch under significant budget constraints β this is a student project, not a well-funded lab release. The result is a coherent, capable MoE base that responds well to instruction and character context, but has not received the depth of training a production model would. It is released primarily for the community to explore, fine-tune, and build upon. Further training will improve it substantially β the architecture is sound, the weights are coherent, and the hard work of recovery has been done.
Architecture
| Property | Value |
|---|---|
| Base model | Mistral-Nemo-Instruct-2407 |
| Parameters | ~12B (same shell as source) |
| Experts | 16 |
| Top-K routing | 4 |
| Expert FFN dim | 896 (carved from 14336) |
| Context length | 131,072 tokens |
| Vocab size | 131,072 (Tekken tokenizer) |
The model occupies the same parameter count and memory footprint as the original dense Mistral-Nemo-12B. The MoE structure increases representational capacity through specialisation rather than size.
Training Curriculum
Six sequential LoRA passes were applied to restore and specialise the shattered model:
| Phase | Dataset | Purpose |
|---|---|---|
| 1 | General knowledge | Language repair, basic coherence |
| 2 | OpenHermes-2.5 | Structure restoration, instruction format |
| 3 | FineTome-100k | Factual grounding |
| 4 | MetaMathQA + Stheno + OpenHermes | Reasoning and roleplay register |
| 5 | (discarded β overfitting) | β |
| 6 | OpenHermes + MetaMathQA + Platypus + Magpie-Pro | Full recalibration, neutral default register |
All passes used LoRA with ranks between r=16 and r=512. No full fine-tuning was performed at any stage.
Behavioral Profile
The model excels at:
- Rich atmospheric prose β responds well to detailed character context
- High-fidelity roleplay β adopts personas naturally from system prompt descriptions
- Instruction following β neutral default register with creative voice available on demand
- Long context β 128k token context inherited from Mistral-Nemo base
Known limitations:
- Advanced mathematics β base Mistral-Nemo ceiling, complex arithmetic may be unreliable
- Highly specific factual recall β may hallucinate on niche topics not well represented in training
- Pretrained base state β benefits significantly from additional fine-tuning for specific use cases
Recommended Sampling Settings
The model responds differently across temperature ranges:
| Use case | Temperature | Min-P | Repeat Penalty |
|---|---|---|---|
| Factual / instruction | 0.3 β 0.5 | 0.1 | 1.15 |
| General roleplay | 0.5 β 0.7 | 0.1 | 1.15 β 1.18 |
| Creative / atmospheric | 0.7 β 0.9 | 0.1 | 1.18 |
Min-P sampling is strongly recommended over Top-P for this model. At high temperatures without Min-P the model may drift toward the training data's most dominant creative register.
Chat Template
This model uses Mistral [INST] format, not ChatML. Using the wrong template will produce incoherent or looping output.
Correct format:
[INST]You are a helpful assistant.
What is the capital of France?[/INST]
With system prompt (folded into first turn):
[INST]You are a gruff Nord blacksmith. Speak in short direct sentences.
Do you have any swords for sale?[/INST]
In llama.cpp:
llama-cli -m model.gguf --ngl 99 -cnv --chat-template mistral --temp 0.5
Upcycling
Using the Nebula Suite, this model can be upcycled to any number of experts to increase capacity. Untrained upcycled variants behave identically to this model until fine-tuned β the orthogonal router initialisation ensures all experts start with unique gradient signals rather than dead routing.
Planned releases:
Mistral-NeMoE-12B-32E-Untrainedβ 32 expert variantMistral-NeMoE-12B-64E-Untrainedβ 64 expert variant
Community fine-tuning of untrained upcycles is encouraged.
Nebula Suite
This model was produced entirely using the Nebula Structural Modification Suite, a self-designed framework for extensive structural modification of language models. Tools used in this model's production include:
- Carver β SVD-based FFN shattering with orthogonal router initialisation
- Upcycler β zero-memory expert count scaling
- Relay β router recalibration for pruned models
The suite enables dense-to-MoE conversion, expert count scaling, and router recalibration without gradient-based training.
Intended Use
This model is suitable for general creative writing, atmospheric roleplay, and as a fine-tuning base for persona-specific applications. It is designed to receive rich character context at inference time and responds well to detailed system prompts.
It is particularly well suited as a starting point for researchers or hobbyists interested in MoE fine-tuning, given the recovered architecture and coherent pretrained weights.
Limitations and Responsible Use
- This model is uncensored and will follow character instructions including dark or mature themes when given appropriate system prompts
- Users are responsible for appropriate system prompt design
- The model may produce factually incorrect information, especially on mathematical or highly technical topics
- Not recommended for factual research or professional advice without verification
- Downloads last month
- 85
Model tree for blascotobasco/Mistral-NeMoE-12B-16E
Base model
mistralai/Mistral-Nemo-Base-2407