Mistral-Nemo-Instruct-2407-NVFP4-FP8

A version of Mistral-Nemo-Instruct-2407 created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.

Quantization Format

The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.

  • Self-attention layers: FP8_DYNAMIC, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
  • MLP layers: NVFP4 with Four Over Six adaptive block scaling for the weights. As this is purely a matter of how weights are selected rather than a change in format, it increases accuracy compared to regular NVFP4 without impacting performance or requiring any changes at runtime.
  • lm_head, embed_tokens, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.

More about Four Over Six

One of the main downsides of using FP4 is the extreme sparsity of large values. At a base level, NVFP4 works by dividing the model into sixteen-element blocks, then assigning FP8 scale factors to each block (as well as a single FP32 scale factor for the tensor as a whole) such that the largest absolute value in the block maps to ±6. For example, if a block has the values {10,-20,40,-60}, the scale factor would be set to 10 and the FP4 values would be {1,-2,4,-6}. The problem is that the FP4 format only allows for a very limited set of values. In particular, it can't represent any number between 4 and 6, so anything in the block that maps to ±5 will be severely affected by rounding error. In the following two graphs, the x-axis represents the value a given weight would have with the corresponding scaling, and the y-axis represents how much proportional error would be introduced by rounding it to the nearest valid FP4 value:

image/png

However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The memoryless_mse observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of pp and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using p1p\le1 to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to ±6/p±6/p. Obviously, this can be used to implement Four Over Six by setting p{1,1.5}p\in\{1,1.5\}. The key to doing this is the following code from mse.py:

for i in range(int(maxshrink * grid)):
    p = 1 - i / grid

With maxshrink set to -1 and grid set to -2, this loop will run twice: once with p=1p=1 and a second time with p=1.5p=1.5. The comparisons are done by taking the absolute value of the difference between original and quantized values and raising it to the power of norm, then summing that for each block, so setting norm to 2 is equivalent to using quantization MSE.

Benchmarks

To test the quantization format, I compared this model to a version quantized with regular NVFP4 (available here) in a variety of benchmarks using lm_eval. For benchmarks that provided standard error in addition to the score, I also computed a zz score by dividing the difference in scores by s12+s22\sqrt{s_1^2+s_2^2} and used that to test for statistical significance. The results of the benchmarks are in the table below, with the cases where the difference was statistically significant (z1.96)(\left|z\right|\ge1.96) in bold. Even for the tests where the results weren't individually significant, the fact that the hybrid got the better score on all of them except winogrande seems rather telling.

Task Metric NVFP4 Baseline Hybrid Quant
coqa em 53.92% 57.33%
f1 71.82% 73.47%
hellaswag acc 61.86% 62.40%
acc_norm 80.84% 81.25%
ifeval inst_level_loose_acc 54.56% 57.67%
inst_level_strict_acc 47.12% 51.08%
prompt_level_loose_acc 46.03% 49.17%
prompt_level_strict_acc 38.08% 41.96%
lambada_openai acc 75.84% 77.26%
perplexity 3.0229 2.9233
lambada_openai_cloze acc 31.22% 33.17%
perplexity 29.8427 26.6948
lambada_standard acc 68.85% 69.26%
perplexity 3.6401 3.5514
lambada_standard_cloze acc 22.59% 28.37%
perplexity 44.8440 35.5615
commonsense_qa acc 57.74% 62.08%
mmlu acc 63.25% 64.54%
openbookqa acc 36.80% 40.40%
acc_norm 47.00% 48.80%
winogrande acc 76.72% 75.45%
triviaqa exact_match 59.53% 61.84%
truthfulqa_mc1 acc 37.82% 39.17%
truthfulqa_mc2 acc 52.84% 54.75%

I also tested versions with just Four Over Six block scaling (here) and just FP8 self-attention (here) for the sake of thoroughness. Both were generally better than the NVFP4 baseline, and the hybrid quant was generally better than either separately. Detailed results are included in json format here, csv format here and md format here.

VRAM Usage

In Aphrodite Engine with the --single-user-mode flag, running the model with a KV cache size of 32,768 tokens used 15,157 MiB of VRAM.

In vLLM 0.19.0, which is somewhat less efficient with its VRAM use, using --gpu-memory-utilization 0.9 and --max-model-len auto on a RTX 5060 Ti 16GB allowed a KV cache size of up to 30,304 tokens and used 15,415 MiB of VRAM.

Prefill speed

I benchmarked model speed by splitting sample texts into nn-token chunks and running requests for the next token (thus forcing the model to prefill the same number of tokens repeatedly), then recording Aphrodite Engine's average prefill speed at different values of nn for both the NVFP4 and hybrid quants (as Four Over Six has no impact on performance, I only needed to compare the two models this time).

Tokens NVFP4 tokens/s Hybrid tokens/s
4096 7137.7 5801.8
8192 5931.8 4933.0
12288 4940.1 4253.8
16384 4295.8 3721.8
20480 3727.0 3308.6
24576 3344.8 2978.8
28672 2968.3 2693.1
32768 2688.1 2506.3

As shown in the following graph, the difference in speed shrinks as context length grows and memory bandwidth overtakes compute as the limiting factor. image/png

Long-context Perplexity

For this test, I split sample texts into nn-token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.

Sample texts used
Tokens NVFP4 Four Over Six FP8 Attention Hybrid
4096 4.2980 4.1049 3.7679 3.6271
8192 4.1490 3.9653 3.6378 3.4951
12288 4.3685 4.1949 3.7713 3.6243
16384 4.6961 4.5181 3.9535 3.8087
20480 4.9098 4.7429 4.0625 3.9234
24576 5.0934 4.9134 4.1677 4.0173
28672 5.2833 5.1015 4.2761 4.1295
32768 5.4543 5.2666 4.3560 4.2114

While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced. image/png

Further Perplexity Comparison

Out of curiosity, I also tried quantizing the model with a different mixed-precision recipe that quantized all down_proj tensors to FP8_DYNAMIC and the rest to NVFP4, testing versions with and without Four Over Six. Interestingly, while these performed better than any other at shorter context lengths, their graphs remained parallel to that of pure NVFP4 and both were overtaken by the versions with FP8 attention at longer contexts. Between this and the fact that the versions with FP8 down_proj were larger and thus required more VRAM, I feel confident in my assessment that FP8 attention is the better option overall.

Results
Tokens FP8 down_proj FP8 down_proj (4/6)
4096 3.5965 3.4747
8192 3.4717 3.3517
12288 3.7064 3.5865
16384 4.0343 3.9131
20480 4.2567 4.1288
24576 4.4232 4.2880
28672 4.6076 4.4737
32768 4.7801 4.6277

image/png

Inference

This model requires compressed-tensors 0.14.0 or later and has been tested on both vLLM and Aphrodite Engine. If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file aphrodite/platforms/interface.py in your library or venv (if you've followed the official installation instructions, it will be under ~/venv/aphrodite/lib/python3.12/site-packages) and comment out or delete lines 487-491.

Credits

Mistral-Nemo-Instruct-2407 was made by Mistral AI and nVidia

Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han

Downloads last month
41
Safetensors
Model size
8B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8

Paper for DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8