Muse-12B-NVFP4-FP8

Quantized weights of the Muse-12B model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with Four Over Six adaptive block scaling for the MLP layers and FP8_DYNAMIC for the self-attention layers. More information about the hybrid format here, but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths.

Inference

Tested on a RTX 5060 Ti 16GB with Aphrodite Engine and vLLM. It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine or an older version of vLLM. On my system, Aphrodite Engine was able to run the checkpoint with a 32k context window with the --single-user-mode flag, while vLLM didn't have quite enough VRAM to do the same. It works fine at shorter context lengths or with the KV cache quantized, however.

Recommended generation settings (a mix of what it says on the Muse-12B model card and the AI Dungeon Model Guide):

Temperature: 1.0
Top K: 250
Top P: 1
Min P: 0.025
Repetition Penalty: 1.05
Presence Penalty: 0.25

If using programs that support DRY and XTC (at time of writing, Aphrodite Engine supports both and vLLM doesn't support either yet), you can also try using them to cut down on repetition if necessary.

Prompt Format

The calibration data was provided with the same ChatML tags as had been used to finetune Latitude's 12B models:

<|im_start|>system
You're a masterful storyteller and gamemaster. Write in second person present tense (You are), crafting vivid, engaging narratives with authority and confidence.<|im_end|>
<|im_start|>user
> You peer into the darkness.<|im_end|>
<|im_start|>assistant
You have been eaten by a grue.<|im_end|>

As such, I would recommend using that format for inference.

Credits

Muse-12B was made by Latitude Games with help from Gryphe Padar

Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han

Downloads last month: 29

Safetensors

Model size

8B params

Tensor type

F32

BF16

F8_E4M3

Model tree for DataSnake/Muse-12B-NVFP4-FP8

Base model

mistralai/Mistral-Nemo-Base-2407

Finetuned

LatitudeGames/Muse-12B

Quantized

(12)

this model

Dataset used to train DataSnake/Muse-12B-NVFP4-FP8

Paper for DataSnake/Muse-12B-NVFP4-FP8

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Paper • 2512.02010 • Published Dec 1, 2025