Mixtral-8x7B-Instruct-v0.1-NVFP4
NVFP4 (E2M1) quantization of mistralai/Mixtral-8x7B-Instruct-v0.1 for NVIDIA Blackwell GPUs (SM100+).
This quantization reduces memory from 87 GB to 25 GB (72% reduction) while improving MMLU accuracy by +2.5 points over the BF16 baseline. Inference throughput is 2.55x faster.
Key Results
| Metric | NVFP4 | BF16 | Delta |
|---|---|---|---|
| MMLU Overall | 60.0% | 57.5% | +2.5% |
| Wikitext Perplexity | 7.36 | 6.20 | +1.17 |
| Inference Speed | 20.7 tok/s | 8.1 tok/s | 2.55x |
| Model Size | 25 GB | 87 GB | -72% |
NVFP4 preserves and slightly improves task-level reasoning (MMLU) while trading off raw language modeling fidelity (perplexity). The mechanism behind the MMLU improvement is not fully understood.
MMLU Breakdown (Full 14K+ Questions, 5-shot)
| Category | NVFP4 | BF16 | Delta |
|---|---|---|---|
| Overall | 60.0% | 57.5% | +2.5% |
| STEM | 51.7% | 49.4% | +2.3% |
| Humanities | 54.5% | 50.6% | +3.9% |
| Social Sciences | 68.5% | 67.5% | +1.0% |
| Other | 68.4% | 66.3% | +2.1% |
Both models were benchmarked under identical conditions: same hardware, same vLLM version, same context window (4096 tokens), same evaluation harness.
Wikitext Perplexity
| Metric | NVFP4 | BF16 |
|---|---|---|
| Word Perplexity | 7.364 | 6.196 |
| Bits per Byte | 0.539 | 0.492 |
| Byte Perplexity | 1.453 | 1.406 |
BF16 retains an advantage on raw next-token prediction, as expected for an unquantized model.
Quantization Details
| Parameter | Value |
|---|---|
| Algorithm | NVFP4 E2M1, max calibration |
| Tool | NVIDIA ModelOpt 0.41.0 (NVFP4_DEFAULT_CFG) |
| Group Size | 16 |
| Calibration Data | 512 samples (256 from codeparrot/codeparrot-clean, 256 from Open-Orca/SlimOrca) |
| Calibration Batch Size | 8, pre-tokenized, max_length=512 |
| Quantization Time | ~1 hour |
| Excluded Layers | lm_head, all self_attn layers, all block_sparse_moe.gate (router) layers |
| Export Format | HuggingFace safetensors via modelopt.torch.export.export_hf_checkpoint |
Why Exclude Self-Attention?
An initial quantization (v1) applied NVFP4 to all linear layers except embeddings and routers, yielding 57.0% MMLU — a 2.9% drop from BF16 baseline (measured with --limit 200). Excluding self-attention layers from quantization (v2) recovered that loss and pushed accuracy above baseline to 61.1% on the same subset and 60.0% on the full 14K evaluation. The MoE expert layers quantize well; attention layers do not.
Benchmark Methodology
All benchmarks used lm-evaluation-harness v0.4.11.
| Parameter | Value |
|---|---|
| MMLU Task | mmlu_generative (57 subjects, generative format) |
| Few-shot | 5-shot |
| Wikitext Task | wikitext (perplexity via local-completions) |
| Context Window | 4096 tokens (both models) |
| Concurrency | 5 concurrent requests |
| Hardware | NVIDIA DGX Spark (GB10), 128 GB unified memory |
| Inference Stack | vLLM 0.14.0rc1 on gogamza/unsloth-vllm-gb10:latest |
Preliminary vs Full Benchmark
An initial benchmark with --limit 200 (200 questions per subject) showed NVFP4 at 61.1% vs BF16 at 59.9% (+1.2%). The full benchmark (14K+ questions, no limit) confirmed and widened this result to 60.0% vs 57.5% (+2.5%), providing stronger statistical confidence.
Inference Speed
Measured separately: 5 runs of 500-token generation, single concurrent request, technical prompt.
| Model | Tokens/sec | Relative |
|---|---|---|
| NVFP4 | 20.7 tok/s | 2.55x |
| BF16 | 8.1 tok/s | 1.0x |
Hardware Requirements
- GPU: NVIDIA Blackwell (SM100+) — GB10, GB100, GB200
- Memory: ~25 GB for model weights + KV cache overhead
- Tested on: DGX Spark (GB10), 128 GB unified LPDDR5x
This model will not run on pre-Blackwell GPUs (A100, H100, etc.). NVFP4 is a Blackwell-native format.
Usage
vLLM
docker run -d --name mixtral-nvfp4 \
--runtime nvidia --gpus all \
-p 8001:8001 \
-v /path/to/Mixtral-8x7B-Instruct-v0.1-NVFP4:/model \
-v /path/to/cache:/root/.cache/vllm \
gogamza/unsloth-vllm-gb10:latest \
vllm serve /model \
--host 0.0.0.0 --port 8001 \
--dtype bfloat16 \
--gpu-memory-utilization 0.3 \
--served-model-name mixtral \
--chat-template /model/chat_template.jinja
vLLM auto-detects NVFP4 quantization from hf_quant_config.json. No --quantization flag needed.
OpenAI-Compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8001/v1", api_key="none")
response = client.chat.completions.create(
model="mixtral",
messages=[{"role": "user", "content": "Explain the Mixture of Experts architecture."}],
)
print(response.choices[0].message.content)
Files
| File | Description |
|---|---|
model-*.safetensors (6 shards) |
NVFP4 quantized weights (~25 GB total) |
config.json |
Model config with quantization parameters |
hf_quant_config.json |
ModelOpt quantization metadata (auto-detected by vLLM) |
tokenizer.json, tokenizer_config.json |
Tokenizer files |
chat_template.jinja |
Chat template for instruct format |
special_tokens_map.json |
Special token definitions |
Known Limitations
- Blackwell-only: Requires SM100+ hardware. Will not load on Ampere/Hopper.
- Perplexity trade-off: ~19% higher word perplexity vs BF16. Raw language modeling quality is reduced, though task performance is preserved.
- No KV cache quantization: Only weights and activations are quantized. KV cache remains in default precision.
- HumanEval: Not benchmarked — lm-eval's generative evaluation mode conflicts with chat template wrapping, producing 0.0 pass@1 for both NVFP4 and BF16.
License
Apache 2.0
Citation
@misc{brewhaha2026mixtral-nvfp4,
title={Mixtral-8x7B-Instruct-v0.1-NVFP4: NVFP4 Quantization with Selective Layer Exclusion},
author={SyntacticallySugary},
year={2026},
url={https://huggingface.co/SyntacticallySugary/Mixtral-8x7B-Instruct-v0.1-NVFP4}
}
Acknowledgments
- Mistral AI for the base Mixtral-8x7B-Instruct model
- NVIDIA for ModelOpt, NVFP4 format, and the DGX Spark platform
- vLLM for inference serving with Blackwell NVFP4 support
- EleutherAI for the evaluation harness
- Downloads last month
- 168
Model tree for SyntacticallySugary/Mixtral-8x7B-Instruct-v0.1-NVFP4
Base model
mistralai/Mixtral-8x7B-v0.1