Mistral-7B-Instruct-v0.2-FP4-W4A4
Model Description
This is an NVFP4 (NVIDIA FP4) quantized version of mistralai/Mistral-7B-Instruct-v0.2 using the compressed-tensors quantization method.
- Base Model: mistralai/Mistral-7B-Instruct-v0.2
- Quantization Method: compressed-tensors
- Quantization Type: NVFP4 W4A4 (4-bit Weight and Activation)
- Model Size: ~4.2GB (compared to ~14GB for BF16)
- Compression Ratio: ~3.3x
Quantization Configuration
This model uses NVFP4 (NVIDIA FP4) quantization with grouped quantization for both weights and activations:
Weights
- Precision: NVFP4 (4-bit floating point)
- Strategy: Tensor-group (grouped quantization)
- Group Size: 16
- Symmetric: Yes
- Dynamic: No (static quantization)
- Observer: MinMax
Activations
- Precision: NVFP4 (4-bit floating point)
- Strategy: Tensor-group (grouped quantization)
- Group Size: 16
- Symmetric: Yes
- Dynamic: Local (dynamic quantization with local calibration)
- Observer: MinMax
Other Details
- Format: nvfp4-pack-quantized (packed 4-bit format)
- KV Cache: Not quantized
- Ignored Layers: lm_head
- Target Layers: Linear layers
- Quantization Version: 0.11.0
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "JongYeop/Mistral-7B-Instruct-v0.2-FP4-W4A4"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
# Generate text
messages = [
{"role": "user", "content": "What is machine learning?"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Model Architecture
- Architecture: MistralForCausalLM
- Hidden Size: 4096
- Intermediate Size: 14336
- Number of Layers: 32
- Number of Attention Heads: 32
- Number of KV Heads: 8
- Vocabulary Size: 32000
- Max Position Embeddings: 32768
Intended Use
This quantized model is intended for efficient inference with significantly reduced memory footprint while maintaining reasonable performance. It is suitable for:
- Resource-constrained environments
- Edge deployment
- Applications requiring minimal memory usage
- High throughput scenarios
- GPU inference with FP4 support
Limitations
- FP4 quantization may result in more accuracy loss compared to FP8 or INT8 quantization
- Best performance is achieved on hardware with native FP4 support (e.g., NVIDIA H100, Ada Lovelace, Blackwell GPUs)
- Dynamic activation quantization may introduce additional runtime overhead
- Grouped quantization requires compatible inference engines
Performance Notes
- Memory Usage: ~3.3x reduction compared to BF16
- Speed: Requires hardware with FP4 tensor core support for optimal performance
- Accuracy: May experience some degradation compared to higher precision formats
Citation
If you use this model, please cite the original Mistral paper and the compressed-tensors library.
License
Same as the base model: Apache 2.0
- Downloads last month
- 226
Model tree for JongYeop/Mistral-7B-Instruct-v0.2-FP4-W4A4
Base model
mistralai/Mistral-7B-Instruct-v0.2