Mistral-7B-Instruct-v0.2-FP4-W4A4

Model Description

This is an NVFP4 (NVIDIA FP4) quantized version of mistralai/Mistral-7B-Instruct-v0.2 using the compressed-tensors quantization method.

Base Model: mistralai/Mistral-7B-Instruct-v0.2
Quantization Method: compressed-tensors
Quantization Type: NVFP4 W4A4 (4-bit Weight and Activation)
Model Size: ~4.2GB (compared to ~14GB for BF16)
Compression Ratio: ~3.3x

Quantization Configuration

This model uses NVFP4 (NVIDIA FP4) quantization with grouped quantization for both weights and activations:

Weights

Precision: NVFP4 (4-bit floating point)
Strategy: Tensor-group (grouped quantization)
Group Size: 16
Symmetric: Yes
Dynamic: No (static quantization)
Observer: MinMax

Activations

Precision: NVFP4 (4-bit floating point)
Strategy: Tensor-group (grouped quantization)
Group Size: 16
Symmetric: Yes
Dynamic: Local (dynamic quantization with local calibration)
Observer: MinMax

Other Details

Format: nvfp4-pack-quantized (packed 4-bit format)
KV Cache: Not quantized
Ignored Layers: lm_head
Target Layers: Linear layers
Quantization Version: 0.11.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Mistral-7B-Instruct-v0.2-FP4-W4A4"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

# Generate text
messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

Architecture: MistralForCausalLM
Hidden Size: 4096
Intermediate Size: 14336
Number of Layers: 32
Number of Attention Heads: 32
Number of KV Heads: 8
Vocabulary Size: 32000
Max Position Embeddings: 32768

Intended Use

This quantized model is intended for efficient inference with significantly reduced memory footprint while maintaining reasonable performance. It is suitable for:

Resource-constrained environments
Edge deployment
Applications requiring minimal memory usage
High throughput scenarios
GPU inference with FP4 support

Limitations

FP4 quantization may result in more accuracy loss compared to FP8 or INT8 quantization
Best performance is achieved on hardware with native FP4 support (e.g., NVIDIA H100, Ada Lovelace, Blackwell GPUs)
Dynamic activation quantization may introduce additional runtime overhead
Grouped quantization requires compatible inference engines

Performance Notes

Memory Usage: ~3.3x reduction compared to BF16
Speed: Requires hardware with FP4 tensor core support for optimal performance
Accuracy: May experience some degradation compared to higher precision formats

Citation

If you use this model, please cite the original Mistral paper and the compressed-tensors library.

License

Same as the base model: Apache 2.0

Downloads last month: 226

Safetensors

Model size

4B params

Tensor type

F32

BF16

F8_E4M3

Model tree for JongYeop/Mistral-7B-Instruct-v0.2-FP4-W4A4

Base model

mistralai/Mistral-7B-Instruct-v0.2

Quantized

(99)

this model