Mistral-7B-Instruct-v0.2-MXFP4-W4A4

Model Description

This is an MXFP4 (Microscaling FP4) quantized version of mistralai/Mistral-7B-Instruct-v0.2 using the compressed-tensors quantization method.

Base Model: mistralai/Mistral-7B-Instruct-v0.2
Quantization Method: compressed-tensors
Quantization Type: MXFP4 W4A4 (4-bit Weight and Activation)
Format: mxfp4-pack-quantized (MX Microscaling FP4)
Model Size: ~4.0GB (compared to ~15GB for BF16)
Compression Ratio: ~3.8x

Quantization Configuration

This model uses MXFP4 (Microscaling FP4) quantization with block-scaled quantization (group size 32) for both weights and activations. MXFP4 uses E8M0 (8-bit exponent-only) block scales shared across groups of 32 elements, following the OCP MX specification.

Weights

Precision: FP4 E2M1 (4-bit floating point)
Scale Format: E8M0 (uint8 exponent)
Strategy: Group (block-scaled)
Group Size: 32
Symmetric: Yes
Dynamic: No (static quantization with calibration)

Activations

Precision: FP4 E2M1 (4-bit floating point)
Scale Format: E8M0 (uint8 exponent)
Strategy: Group (block-scaled)
Group Size: 32
Symmetric: Yes
Dynamic: Yes (dynamic quantization at inference time)

Other Details

KV Cache: Not quantized (remains in BF16)
Ignored Layers: lm_head
Target Layers: All Linear layers
Calibration: 512 samples from CNN/DailyMail, max_seq_length=2048

Hardware Requirements

MXFP4 inference requires NVIDIA Blackwell (SM120+) GPUs with CUDA 12.8+ for native CUTLASS MXFP4 GEMM support.

Usage with vLLM

from vllm import LLM, SamplingParams

model_id = "JongYeop/Mistral-7B-Instruct-v0.2-MXFP4-W4A4"

llm = LLM(model=model_id, max_model_len=4096, enforce_eager=True)

outputs = llm.generate(
    ["The capital of France is"],
    SamplingParams(max_tokens=64, temperature=0)
)

for output in outputs:
    print(output.outputs[0].text)

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Mistral-7B-Instruct-v0.2-MXFP4-W4A4"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

Architecture: MistralForCausalLM
Hidden Size: 4096
Intermediate Size: 14336
Number of Layers: 32
Number of Attention Heads: 32
Number of KV Heads: 8 (GQA)
Vocabulary Size: 32000
Max Position Embeddings: 32768

Differences from NVFP4

Feature	MXFP4	NVFP4
Scale Format	E8M0 (uint8 exponent)	E4M3 + FP32 global scale
Group Size	32	16
Standard	OCP MX Specification	NVIDIA proprietary
Hardware	SM120+ (Blackwell)	SM89+ (Ada/Hopper/Blackwell)

Intended Use

This quantized model is intended for efficient inference with significantly reduced memory footprint. It is suitable for:

Deployment on NVIDIA Blackwell GPUs
Memory-constrained serving environments
High-throughput inference scenarios

Limitations

Requires NVIDIA Blackwell (SM120+) GPUs for native MXFP4 GEMM support
FP4 quantization may result in some accuracy degradation compared to FP8 or BF16
KV cache remains in BF16 (not quantized)

License

Same as the base model: Apache 2.0

Downloads last month: 147

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for JongYeop/Mistral-7B-Instruct-v0.2-MXFP4-W4A4

Base model

mistralai/Mistral-7B-Instruct-v0.2

Quantized

(99)

this model