Mistral-Nemo-Instruct-2407-NVFP4-FP8

A version of Mistral-Nemo-Instruct-2407 created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.

Quantization Format

The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.

Self-attention layers: FP8_DYNAMIC, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
MLP layers: NVFP4 with Four Over Six adaptive block scaling for the weights. As this is purely a matter of how weights are selected rather than a change in format, it increases accuracy compared to regular NVFP4 without impacting performance or requiring any changes at runtime.
lm_head, embed_tokens, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.

More about Four Over Six

One of the main downsides of using FP4 is the extreme sparsity of large values. At a base level, NVFP4 works by dividing the model into sixteen-element blocks, then assigning FP8 scale factors to each block (as well as a single FP32 scale factor for the tensor as a whole) such that the largest absolute value in the block maps to ±6. For example, if a block has the values {10,-20,40,-60}, the scale factor would be set to 10 and the FP4 values would be {1,-2,4,-6}. The problem is that the FP4 format only allows for a very limited set of values. In particular, it can't represent any number between 4 and 6, so anything in the block that maps to ±5 will be severely affected by rounding error. In the following two graphs, the x-axis represents the value a given weight would have with the corresponding scaling, and the y-axis represents how much proportional error would be introduced by rounding it to the nearest valid FP4 value:

However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The memoryless_mse observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of $p$ and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using $p\le1$ to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to $\pm 6 / p ±6/p$ . Obviously, this can be used to implement Four Over Six by setting $p\in\{1,1.5\}$ . The key to doing this is the following code from mse.py:

for i in range(int(maxshrink * grid)):
    p = 1 - i / grid

With maxshrink set to -1 and grid set to -2, this loop will run twice: once with $p = 1$ and a second time with $p = 1.5$ . The comparisons are done by taking the absolute value of the difference between original and quantized values and raising it to the power of norm, then summing that for each block, so setting norm to 2 is equivalent to using quantization MSE.

Benchmarks

To test the quantization format, I compared this model to a version quantized with regular NVFP4 (available here) in a variety of benchmarks using lm_eval. For benchmarks that provided standard error in addition to the score, I also computed a $z$ score by dividing the difference in scores by $\sqrt{s_1^2+s_2^2}$ and used that to test for statistical significance. The results of the benchmarks are in the table below, with the cases where the difference was statistically significant $(\left|z\right|\ge1.96)$ in bold. Even for the tests where the results weren't individually significant, the fact that the hybrid got the better score on all of them except winogrande seems rather telling.

Task	Metric	NVFP4 Baseline	Hybrid Quant
coqa	em	53.92%	57.33%
	f1	71.82%	73.47%
hellaswag	acc	61.86%	62.40%
	acc_norm	80.84%	81.25%
ifeval	inst_level_loose_acc	54.56%	57.67%
	inst_level_strict_acc	47.12%	51.08%
	prompt_level_loose_acc	46.03%	49.17%
	prompt_level_strict_acc	38.08%	41.96%
lambada_openai	acc	75.84%	77.26%
	perplexity	3.0229	2.9233
lambada_openai_cloze	acc	31.22%	33.17%
	perplexity	29.8427	26.6948
lambada_standard	acc	68.85%	69.26%
	perplexity	3.6401	3.5514
lambada_standard_cloze	acc	22.59%	28.37%
	perplexity	44.8440	35.5615
commonsense_qa	acc	57.74%	62.08%
mmlu	acc	63.25%	64.54%
openbookqa	acc	36.80%	40.40%
	acc_norm	47.00%	48.80%
winogrande	acc	76.72%	75.45%
triviaqa	exact_match	59.53%	61.84%
truthfulqa_mc1	acc	37.82%	39.17%
truthfulqa_mc2	acc	52.84%	54.75%

I also tested versions with just Four Over Six block scaling (here) and just FP8 self-attention (here) for the sake of thoroughness. Both were generally better than the NVFP4 baseline, and the hybrid quant was generally better than either separately. Detailed results are included in json format here, csv format here and md format here.

VRAM Usage

In Aphrodite Engine with the --single-user-mode flag, running the model with a KV cache size of 32,768 tokens used 15,157 MiB of VRAM.

In vLLM 0.19.0, which is somewhat less efficient with its VRAM use, using --gpu-memory-utilization 0.9 and --max-model-len auto on a RTX 5060 Ti 16GB allowed a KV cache size of up to 30,304 tokens and used 15,415 MiB of VRAM.

Prefill speed

I benchmarked model speed by splitting sample texts into $n$ -token chunks and running requests for the next token (thus forcing the model to prefill the same number of tokens repeatedly), then recording Aphrodite Engine's average prefill speed at different values of $n$ for both the NVFP4 and hybrid quants (as Four Over Six has no impact on performance, I only needed to compare the two models this time).

Tokens	NVFP4 tokens/s	Hybrid tokens/s
4096	7137.7	5801.8
8192	5931.8	4933.0
12288	4940.1	4253.8
16384	4295.8	3721.8
20480	3727.0	3308.6
24576	3344.8	2978.8
28672	2968.3	2693.1
32768	2688.1	2506.3

As shown in the following graph, the difference in speed shrinks as context length grows and memory bandwidth overtakes compute as the limiting factor.

Long-context Perplexity

For this test, I split sample texts into $n$ -token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.

Sample texts used

Tokens	NVFP4	Four Over Six	FP8 Attention	Hybrid
4096	4.2980	4.1049	3.7679	3.6271
8192	4.1490	3.9653	3.6378	3.4951
12288	4.3685	4.1949	3.7713	3.6243
16384	4.6961	4.5181	3.9535	3.8087
20480	4.9098	4.7429	4.0625	3.9234
24576	5.0934	4.9134	4.1677	4.0173
28672	5.2833	5.1015	4.2761	4.1295
32768	5.4543	5.2666	4.3560	4.2114

While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.

Further Perplexity Comparison

Out of curiosity, I also tried quantizing the model with a different mixed-precision recipe that quantized all down_proj tensors to FP8_DYNAMIC and the rest to NVFP4, testing versions with and without Four Over Six. Interestingly, while these performed better than any other at shorter context lengths, their graphs remained parallel to that of pure NVFP4 and both were overtaken by the versions with FP8 attention at longer contexts. Between this and the fact that the versions with FP8 down_proj were larger and thus required more VRAM, I feel confident in my assessment that FP8 attention is the better option overall.

Results

Tokens	FP8 `down_proj`	FP8 `down_proj` (4/6)
4096	3.5965	3.4747
8192	3.4717	3.3517
12288	3.7064	3.5865
16384	4.0343	3.9131
20480	4.2567	4.1288
24576	4.4232	4.2880
28672	4.6076	4.4737
32768	4.7801	4.6277

Inference

This model requires compressed-tensors 0.14.0 or later and has been tested on both vLLM and Aphrodite Engine. If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file aphrodite/platforms/interface.py in your library or venv (if you've followed the official installation instructions, it will be under ~/venv/aphrodite/lib/python3.12/site-packages) and comment out or delete lines 487-491.

Credits

Mistral-Nemo-Instruct-2407 was made by Mistral AI and nVidia

Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han

Downloads last month: 41

Safetensors

Model size

8B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8

Base model

mistralai/Mistral-Nemo-Base-2407

Finetuned

mistralai/Mistral-Nemo-Instruct-2407

Quantized

(160)

this model

Paper for DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Paper • 2512.02010 • Published Dec 1, 2025