Mistral-7B-Instruct-v0.2-FP4-W4A4

Model Description

This is an NVFP4 (NVIDIA FP4) quantized version of mistralai/Mistral-7B-Instruct-v0.2 using the compressed-tensors quantization method.

  • Base Model: mistralai/Mistral-7B-Instruct-v0.2
  • Quantization Method: compressed-tensors
  • Quantization Type: NVFP4 W4A4 (4-bit Weight and Activation)
  • Model Size: ~4.2GB (compared to ~14GB for BF16)
  • Compression Ratio: ~3.3x

Quantization Configuration

This model uses NVFP4 (NVIDIA FP4) quantization with grouped quantization for both weights and activations:

Weights

  • Precision: NVFP4 (4-bit floating point)
  • Strategy: Tensor-group (grouped quantization)
  • Group Size: 16
  • Symmetric: Yes
  • Dynamic: No (static quantization)
  • Observer: MinMax

Activations

  • Precision: NVFP4 (4-bit floating point)
  • Strategy: Tensor-group (grouped quantization)
  • Group Size: 16
  • Symmetric: Yes
  • Dynamic: Local (dynamic quantization with local calibration)
  • Observer: MinMax

Other Details

  • Format: nvfp4-pack-quantized (packed 4-bit format)
  • KV Cache: Not quantized
  • Ignored Layers: lm_head
  • Target Layers: Linear layers
  • Quantization Version: 0.11.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Mistral-7B-Instruct-v0.2-FP4-W4A4"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

# Generate text
messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

  • Architecture: MistralForCausalLM
  • Hidden Size: 4096
  • Intermediate Size: 14336
  • Number of Layers: 32
  • Number of Attention Heads: 32
  • Number of KV Heads: 8
  • Vocabulary Size: 32000
  • Max Position Embeddings: 32768

Intended Use

This quantized model is intended for efficient inference with significantly reduced memory footprint while maintaining reasonable performance. It is suitable for:

  • Resource-constrained environments
  • Edge deployment
  • Applications requiring minimal memory usage
  • High throughput scenarios
  • GPU inference with FP4 support

Limitations

  • FP4 quantization may result in more accuracy loss compared to FP8 or INT8 quantization
  • Best performance is achieved on hardware with native FP4 support (e.g., NVIDIA H100, Ada Lovelace, Blackwell GPUs)
  • Dynamic activation quantization may introduce additional runtime overhead
  • Grouped quantization requires compatible inference engines

Performance Notes

  • Memory Usage: ~3.3x reduction compared to BF16
  • Speed: Requires hardware with FP4 tensor core support for optimal performance
  • Accuracy: May experience some degradation compared to higher precision formats

Citation

If you use this model, please cite the original Mistral paper and the compressed-tensors library.

License

Same as the base model: Apache 2.0

Downloads last month
226
Safetensors
Model size
4B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for JongYeop/Mistral-7B-Instruct-v0.2-FP4-W4A4

Quantized
(99)
this model