Mixtral-8x7B-Instruct-v0.1-NVFP4

NVFP4 (E2M1) quantization of mistralai/Mixtral-8x7B-Instruct-v0.1 for NVIDIA Blackwell GPUs (SM100+).

This quantization reduces memory from 87 GB to 25 GB (72% reduction) while improving MMLU accuracy by +2.5 points over the BF16 baseline. Inference throughput is 2.55x faster.

Key Results

Metric	NVFP4	BF16	Delta
MMLU Overall	60.0%	57.5%	+2.5%
Wikitext Perplexity	7.36	6.20	+1.17
Inference Speed	20.7 tok/s	8.1 tok/s	2.55x
Model Size	25 GB	87 GB	-72%

NVFP4 preserves and slightly improves task-level reasoning (MMLU) while trading off raw language modeling fidelity (perplexity). The mechanism behind the MMLU improvement is not fully understood.

MMLU Breakdown (Full 14K+ Questions, 5-shot)

Category	NVFP4	BF16	Delta
Overall	60.0%	57.5%	+2.5%
STEM	51.7%	49.4%	+2.3%
Humanities	54.5%	50.6%	+3.9%
Social Sciences	68.5%	67.5%	+1.0%
Other	68.4%	66.3%	+2.1%

Both models were benchmarked under identical conditions: same hardware, same vLLM version, same context window (4096 tokens), same evaluation harness.

Wikitext Perplexity

Metric	NVFP4	BF16
Word Perplexity	7.364	6.196
Bits per Byte	0.539	0.492
Byte Perplexity	1.453	1.406

BF16 retains an advantage on raw next-token prediction, as expected for an unquantized model.

Quantization Details

Parameter	Value
Algorithm	NVFP4 E2M1, max calibration
Tool	NVIDIA ModelOpt 0.41.0 (`NVFP4_DEFAULT_CFG`)
Group Size	16
Calibration Data	512 samples (256 from `codeparrot/codeparrot-clean`, 256 from `Open-Orca/SlimOrca`)
Calibration Batch Size	8, pre-tokenized, max_length=512
Quantization Time	~1 hour
Excluded Layers	`lm_head`, all `self_attn` layers, all `block_sparse_moe.gate` (router) layers
Export Format	HuggingFace safetensors via `modelopt.torch.export.export_hf_checkpoint`

Why Exclude Self-Attention?

An initial quantization (v1) applied NVFP4 to all linear layers except embeddings and routers, yielding 57.0% MMLU — a 2.9% drop from BF16 baseline (measured with --limit 200). Excluding self-attention layers from quantization (v2) recovered that loss and pushed accuracy above baseline to 61.1% on the same subset and 60.0% on the full 14K evaluation. The MoE expert layers quantize well; attention layers do not.

Benchmark Methodology

All benchmarks used lm-evaluation-harness v0.4.11.

Parameter	Value
MMLU Task	`mmlu_generative` (57 subjects, generative format)
Few-shot	5-shot
Wikitext Task	`wikitext` (perplexity via `local-completions`)
Context Window	4096 tokens (both models)
Concurrency	5 concurrent requests
Hardware	NVIDIA DGX Spark (GB10), 128 GB unified memory
Inference Stack	vLLM 0.14.0rc1 on `gogamza/unsloth-vllm-gb10:latest`

Preliminary vs Full Benchmark

An initial benchmark with --limit 200 (200 questions per subject) showed NVFP4 at 61.1% vs BF16 at 59.9% (+1.2%). The full benchmark (14K+ questions, no limit) confirmed and widened this result to 60.0% vs 57.5% (+2.5%), providing stronger statistical confidence.

Inference Speed

Measured separately: 5 runs of 500-token generation, single concurrent request, technical prompt.

Model	Tokens/sec	Relative
NVFP4	20.7 tok/s	2.55x
BF16	8.1 tok/s	1.0x

Hardware Requirements

GPU: NVIDIA Blackwell (SM100+) — GB10, GB100, GB200
Memory: ~25 GB for model weights + KV cache overhead
Tested on: DGX Spark (GB10), 128 GB unified LPDDR5x

This model will not run on pre-Blackwell GPUs (A100, H100, etc.). NVFP4 is a Blackwell-native format.

Usage

vLLM

docker run -d --name mixtral-nvfp4 \
    --runtime nvidia --gpus all \
    -p 8001:8001 \
    -v /path/to/Mixtral-8x7B-Instruct-v0.1-NVFP4:/model \
    -v /path/to/cache:/root/.cache/vllm \
    gogamza/unsloth-vllm-gb10:latest \
    vllm serve /model \
    --host 0.0.0.0 --port 8001 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.3 \
    --served-model-name mixtral \
    --chat-template /model/chat_template.jinja

vLLM auto-detects NVFP4 quantization from hf_quant_config.json. No --quantization flag needed.

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8001/v1", api_key="none")
response = client.chat.completions.create(
    model="mixtral",
    messages=[{"role": "user", "content": "Explain the Mixture of Experts architecture."}],
)
print(response.choices[0].message.content)

Files

File	Description
`model-*.safetensors` (6 shards)	NVFP4 quantized weights (~25 GB total)
`config.json`	Model config with quantization parameters
`hf_quant_config.json`	ModelOpt quantization metadata (auto-detected by vLLM)
`tokenizer.json`, `tokenizer_config.json`	Tokenizer files
`chat_template.jinja`	Chat template for instruct format
`special_tokens_map.json`	Special token definitions

Known Limitations

Blackwell-only: Requires SM100+ hardware. Will not load on Ampere/Hopper.
Perplexity trade-off: ~19% higher word perplexity vs BF16. Raw language modeling quality is reduced, though task performance is preserved.
No KV cache quantization: Only weights and activations are quantized. KV cache remains in default precision.
HumanEval: Not benchmarked — lm-eval's generative evaluation mode conflicts with chat template wrapping, producing 0.0 pass@1 for both NVFP4 and BF16.

License

Apache 2.0

Citation

@misc{brewhaha2026mixtral-nvfp4,
    title={Mixtral-8x7B-Instruct-v0.1-NVFP4: NVFP4 Quantization with Selective Layer Exclusion},
    author={SyntacticallySugary},
    year={2026},
    url={https://huggingface.co/SyntacticallySugary/Mixtral-8x7B-Instruct-v0.1-NVFP4}
}

Acknowledgments

Mistral AI for the base Mixtral-8x7B-Instruct model
NVIDIA for ModelOpt, NVFP4 format, and the DGX Spark platform
vLLM for inference serving with Blackwell NVFP4 support
EleutherAI for the evaluation harness

Downloads last month: 168

Safetensors

Model size

24B params

Tensor type

BF16

F8_E4M3

Model tree for SyntacticallySugary/Mixtral-8x7B-Instruct-v0.1-NVFP4

Base model

mistralai/Mixtral-8x7B-v0.1

Finetuned

mistralai/Mixtral-8x7B-Instruct-v0.1

Quantized

(49)

this model