Mistral-7B-Instruct-v0.2-MXFP4-W4A4

Model Description

This is an MXFP4 (Microscaling FP4) quantized version of mistralai/Mistral-7B-Instruct-v0.2 using the compressed-tensors quantization method.

  • Base Model: mistralai/Mistral-7B-Instruct-v0.2
  • Quantization Method: compressed-tensors
  • Quantization Type: MXFP4 W4A4 (4-bit Weight and Activation)
  • Format: mxfp4-pack-quantized (MX Microscaling FP4)
  • Model Size: ~4.0GB (compared to ~15GB for BF16)
  • Compression Ratio: ~3.8x

Quantization Configuration

This model uses MXFP4 (Microscaling FP4) quantization with block-scaled quantization (group size 32) for both weights and activations. MXFP4 uses E8M0 (8-bit exponent-only) block scales shared across groups of 32 elements, following the OCP MX specification.

Weights

  • Precision: FP4 E2M1 (4-bit floating point)
  • Scale Format: E8M0 (uint8 exponent)
  • Strategy: Group (block-scaled)
  • Group Size: 32
  • Symmetric: Yes
  • Dynamic: No (static quantization with calibration)

Activations

  • Precision: FP4 E2M1 (4-bit floating point)
  • Scale Format: E8M0 (uint8 exponent)
  • Strategy: Group (block-scaled)
  • Group Size: 32
  • Symmetric: Yes
  • Dynamic: Yes (dynamic quantization at inference time)

Other Details

  • KV Cache: Not quantized (remains in BF16)
  • Ignored Layers: lm_head
  • Target Layers: All Linear layers
  • Calibration: 512 samples from CNN/DailyMail, max_seq_length=2048

Hardware Requirements

MXFP4 inference requires NVIDIA Blackwell (SM120+) GPUs with CUDA 12.8+ for native CUTLASS MXFP4 GEMM support.

Usage with vLLM

from vllm import LLM, SamplingParams

model_id = "JongYeop/Mistral-7B-Instruct-v0.2-MXFP4-W4A4"

llm = LLM(model=model_id, max_model_len=4096, enforce_eager=True)

outputs = llm.generate(
    ["The capital of France is"],
    SamplingParams(max_tokens=64, temperature=0)
)

for output in outputs:
    print(output.outputs[0].text)

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Mistral-7B-Instruct-v0.2-MXFP4-W4A4"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

  • Architecture: MistralForCausalLM
  • Hidden Size: 4096
  • Intermediate Size: 14336
  • Number of Layers: 32
  • Number of Attention Heads: 32
  • Number of KV Heads: 8 (GQA)
  • Vocabulary Size: 32000
  • Max Position Embeddings: 32768

Differences from NVFP4

Feature MXFP4 NVFP4
Scale Format E8M0 (uint8 exponent) E4M3 + FP32 global scale
Group Size 32 16
Standard OCP MX Specification NVIDIA proprietary
Hardware SM120+ (Blackwell) SM89+ (Ada/Hopper/Blackwell)

Intended Use

This quantized model is intended for efficient inference with significantly reduced memory footprint. It is suitable for:

  • Deployment on NVIDIA Blackwell GPUs
  • Memory-constrained serving environments
  • High-throughput inference scenarios

Limitations

  • Requires NVIDIA Blackwell (SM120+) GPUs for native MXFP4 GEMM support
  • FP4 quantization may result in some accuracy degradation compared to FP8 or BF16
  • KV cache remains in BF16 (not quantized)

License

Same as the base model: Apache 2.0

Downloads last month
147
Safetensors
Model size
4B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JongYeop/Mistral-7B-Instruct-v0.2-MXFP4-W4A4

Quantized
(99)
this model