Mistral-7B-Instruct-v0.2-FP8-W8A8
Model Description
This is an FP8 quantized version of mistralai/Mistral-7B-Instruct-v0.2 using the compressed-tensors quantization method.
- Base Model: mistralai/Mistral-7B-Instruct-v0.2
- Quantization Method: compressed-tensors
- Quantization Type: FP8 W8A8 (Weight and Activation)
- Model Size: ~7.0GB (compared to ~14GB for BF16)
Quantization Configuration
This model uses FP8 per-tensor symmetric quantization for both weights and activations:
Weights
- Precision: FP8 (8-bit floating point)
- Strategy: Per-tensor
- Symmetric: Yes
- Dynamic: No (static quantization)
- Observer: MinMax
Activations
- Precision: FP8 (8-bit floating point)
- Strategy: Per-tensor
- Symmetric: Yes
- Dynamic: No (static quantization)
- Observer: MinMax
Other Details
- KV Cache: Not quantized
- Ignored Layers: lm_head
- Target Layers: Linear layers
- Quantization Version: 0.11.0
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "JongYeop/Mistral-7B-Instruct-v0.2-FP8-W8A8"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
# Generate text
messages = [
{"role": "user", "content": "What is machine learning?"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Model Architecture
- Architecture: MistralForCausalLM
- Hidden Size: 4096
- Intermediate Size: 14336
- Number of Layers: 32
- Number of Attention Heads: 32
- Number of KV Heads: 8
- Vocabulary Size: 32000
- Max Position Embeddings: 32768
Intended Use
This quantized model is intended for efficient inference with reduced memory footprint while maintaining reasonable performance. It is suitable for:
- Resource-constrained environments
- Applications requiring lower latency
- Batch inference scenarios
- GPU inference with FP8 support
Limitations
- FP8 quantization may result in some accuracy loss compared to the full precision model
- Static quantization may not adapt to all input distributions optimally
- Best performance is achieved on hardware with native FP8 support (e.g., NVIDIA H100, Ada Lovelace GPUs)
Citation
If you use this model, please cite the original Mistral paper and the compressed-tensors library.
License
Same as the base model: Apache 2.0
- Downloads last month
- 132
Model tree for JongYeop/Mistral-7B-Instruct-v0.2-FP8-W8A8
Base model
mistralai/Mistral-7B-Instruct-v0.2