Mistral-7B-Instruct-v0.2-FP8-W8A8

Model Description

This is an FP8 quantized version of mistralai/Mistral-7B-Instruct-v0.2 using the compressed-tensors quantization method.

  • Base Model: mistralai/Mistral-7B-Instruct-v0.2
  • Quantization Method: compressed-tensors
  • Quantization Type: FP8 W8A8 (Weight and Activation)
  • Model Size: ~7.0GB (compared to ~14GB for BF16)

Quantization Configuration

This model uses FP8 per-tensor symmetric quantization for both weights and activations:

Weights

  • Precision: FP8 (8-bit floating point)
  • Strategy: Per-tensor
  • Symmetric: Yes
  • Dynamic: No (static quantization)
  • Observer: MinMax

Activations

  • Precision: FP8 (8-bit floating point)
  • Strategy: Per-tensor
  • Symmetric: Yes
  • Dynamic: No (static quantization)
  • Observer: MinMax

Other Details

  • KV Cache: Not quantized
  • Ignored Layers: lm_head
  • Target Layers: Linear layers
  • Quantization Version: 0.11.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Mistral-7B-Instruct-v0.2-FP8-W8A8"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

# Generate text
messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

  • Architecture: MistralForCausalLM
  • Hidden Size: 4096
  • Intermediate Size: 14336
  • Number of Layers: 32
  • Number of Attention Heads: 32
  • Number of KV Heads: 8
  • Vocabulary Size: 32000
  • Max Position Embeddings: 32768

Intended Use

This quantized model is intended for efficient inference with reduced memory footprint while maintaining reasonable performance. It is suitable for:

  • Resource-constrained environments
  • Applications requiring lower latency
  • Batch inference scenarios
  • GPU inference with FP8 support

Limitations

  • FP8 quantization may result in some accuracy loss compared to the full precision model
  • Static quantization may not adapt to all input distributions optimally
  • Best performance is achieved on hardware with native FP8 support (e.g., NVIDIA H100, Ada Lovelace GPUs)

Citation

If you use this model, please cite the original Mistral paper and the compressed-tensors library.

License

Same as the base model: Apache 2.0

Downloads last month
132
Safetensors
Model size
7B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JongYeop/Mistral-7B-Instruct-v0.2-FP8-W8A8

Quantized
(99)
this model