Mixtral-8x7B-Instruct-v0.1-NVFP4

NVFP4 (E2M1) quantization of mistralai/Mixtral-8x7B-Instruct-v0.1 for NVIDIA Blackwell GPUs (SM100+).

This quantization reduces memory from 87 GB to 25 GB (72% reduction) while improving MMLU accuracy by +2.5 points over the BF16 baseline. Inference throughput is 2.55x faster.

Key Results

Metric NVFP4 BF16 Delta
MMLU Overall 60.0% 57.5% +2.5%
Wikitext Perplexity 7.36 6.20 +1.17
Inference Speed 20.7 tok/s 8.1 tok/s 2.55x
Model Size 25 GB 87 GB -72%

NVFP4 preserves and slightly improves task-level reasoning (MMLU) while trading off raw language modeling fidelity (perplexity). The mechanism behind the MMLU improvement is not fully understood.

MMLU Breakdown (Full 14K+ Questions, 5-shot)

Category NVFP4 BF16 Delta
Overall 60.0% 57.5% +2.5%
STEM 51.7% 49.4% +2.3%
Humanities 54.5% 50.6% +3.9%
Social Sciences 68.5% 67.5% +1.0%
Other 68.4% 66.3% +2.1%

Both models were benchmarked under identical conditions: same hardware, same vLLM version, same context window (4096 tokens), same evaluation harness.

Wikitext Perplexity

Metric NVFP4 BF16
Word Perplexity 7.364 6.196
Bits per Byte 0.539 0.492
Byte Perplexity 1.453 1.406

BF16 retains an advantage on raw next-token prediction, as expected for an unquantized model.

Quantization Details

Parameter Value
Algorithm NVFP4 E2M1, max calibration
Tool NVIDIA ModelOpt 0.41.0 (NVFP4_DEFAULT_CFG)
Group Size 16
Calibration Data 512 samples (256 from codeparrot/codeparrot-clean, 256 from Open-Orca/SlimOrca)
Calibration Batch Size 8, pre-tokenized, max_length=512
Quantization Time ~1 hour
Excluded Layers lm_head, all self_attn layers, all block_sparse_moe.gate (router) layers
Export Format HuggingFace safetensors via modelopt.torch.export.export_hf_checkpoint

Why Exclude Self-Attention?

An initial quantization (v1) applied NVFP4 to all linear layers except embeddings and routers, yielding 57.0% MMLU — a 2.9% drop from BF16 baseline (measured with --limit 200). Excluding self-attention layers from quantization (v2) recovered that loss and pushed accuracy above baseline to 61.1% on the same subset and 60.0% on the full 14K evaluation. The MoE expert layers quantize well; attention layers do not.

Benchmark Methodology

All benchmarks used lm-evaluation-harness v0.4.11.

Parameter Value
MMLU Task mmlu_generative (57 subjects, generative format)
Few-shot 5-shot
Wikitext Task wikitext (perplexity via local-completions)
Context Window 4096 tokens (both models)
Concurrency 5 concurrent requests
Hardware NVIDIA DGX Spark (GB10), 128 GB unified memory
Inference Stack vLLM 0.14.0rc1 on gogamza/unsloth-vllm-gb10:latest

Preliminary vs Full Benchmark

An initial benchmark with --limit 200 (200 questions per subject) showed NVFP4 at 61.1% vs BF16 at 59.9% (+1.2%). The full benchmark (14K+ questions, no limit) confirmed and widened this result to 60.0% vs 57.5% (+2.5%), providing stronger statistical confidence.

Inference Speed

Measured separately: 5 runs of 500-token generation, single concurrent request, technical prompt.

Model Tokens/sec Relative
NVFP4 20.7 tok/s 2.55x
BF16 8.1 tok/s 1.0x

Hardware Requirements

  • GPU: NVIDIA Blackwell (SM100+) — GB10, GB100, GB200
  • Memory: ~25 GB for model weights + KV cache overhead
  • Tested on: DGX Spark (GB10), 128 GB unified LPDDR5x

This model will not run on pre-Blackwell GPUs (A100, H100, etc.). NVFP4 is a Blackwell-native format.

Usage

vLLM

docker run -d --name mixtral-nvfp4 \
    --runtime nvidia --gpus all \
    -p 8001:8001 \
    -v /path/to/Mixtral-8x7B-Instruct-v0.1-NVFP4:/model \
    -v /path/to/cache:/root/.cache/vllm \
    gogamza/unsloth-vllm-gb10:latest \
    vllm serve /model \
    --host 0.0.0.0 --port 8001 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.3 \
    --served-model-name mixtral \
    --chat-template /model/chat_template.jinja

vLLM auto-detects NVFP4 quantization from hf_quant_config.json. No --quantization flag needed.

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8001/v1", api_key="none")
response = client.chat.completions.create(
    model="mixtral",
    messages=[{"role": "user", "content": "Explain the Mixture of Experts architecture."}],
)
print(response.choices[0].message.content)

Files

File Description
model-*.safetensors (6 shards) NVFP4 quantized weights (~25 GB total)
config.json Model config with quantization parameters
hf_quant_config.json ModelOpt quantization metadata (auto-detected by vLLM)
tokenizer.json, tokenizer_config.json Tokenizer files
chat_template.jinja Chat template for instruct format
special_tokens_map.json Special token definitions

Known Limitations

  • Blackwell-only: Requires SM100+ hardware. Will not load on Ampere/Hopper.
  • Perplexity trade-off: ~19% higher word perplexity vs BF16. Raw language modeling quality is reduced, though task performance is preserved.
  • No KV cache quantization: Only weights and activations are quantized. KV cache remains in default precision.
  • HumanEval: Not benchmarked — lm-eval's generative evaluation mode conflicts with chat template wrapping, producing 0.0 pass@1 for both NVFP4 and BF16.

License

Apache 2.0

Citation

@misc{brewhaha2026mixtral-nvfp4,
    title={Mixtral-8x7B-Instruct-v0.1-NVFP4: NVFP4 Quantization with Selective Layer Exclusion},
    author={SyntacticallySugary},
    year={2026},
    url={https://huggingface.co/SyntacticallySugary/Mixtral-8x7B-Instruct-v0.1-NVFP4}
}

Acknowledgments

  • Mistral AI for the base Mixtral-8x7B-Instruct model
  • NVIDIA for ModelOpt, NVFP4 format, and the DGX Spark platform
  • vLLM for inference serving with Blackwell NVFP4 support
  • EleutherAI for the evaluation harness
Downloads last month
168
Safetensors
Model size
24B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SyntacticallySugary/Mixtral-8x7B-Instruct-v0.1-NVFP4

Quantized
(49)
this model