MN-12B-Vespa-x1-heretic - FP8 Dynamic Quantization

This is an FP8 quantized version of MN-12B-Vespa-x1-heretic using llmcompressor with the FP8_DYNAMIC scheme.

Model Details

  • Base Model: MN-12B-Vespa-x1-heretic
  • Quantization: FP8_DYNAMIC (W8A8)
  • Format: compressed-tensors (SafeTensors)
  • Memory: ~50% of original BF16 size
  • Quality: <1-2% degradation on benchmarks (typical)

Quick Start

vLLM (Recommended)

pip install vllm

# Serve the model
vllm serve REPO_ID \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95

# Python API
from vllm import LLM
llm = LLM(model="REPO_ID")
outputs = llm.generate("Hello, how are you?")
print(outputs[0].outputs[0].text)

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "REPO_ID",
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("REPO_ID")

messages = [{'role': 'user', 'content': 'Hello!'}]
inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Quantization Details

This model was quantized using:

  • Tool: llmcompressor
  • Method: FP8_DYNAMIC (Round-to-Nearest)
  • Targets: All Linear layers except lm_head
  • Scheme: W8A8 (8-bit weights and activations)

Performance

Memory Usage

  • Original BF16: ~2ร— size of FP8
  • FP8 Quantized: ~50% of original
  • Savings: ~50% VRAM reduction

Inference Speed

  • Expect 1.3-1.8ร— faster inference vs BF16
  • 2ร— higher throughput (more KV cache available)

Use Cases

Perfect for:

  • โœ… Production inference on limited VRAM
  • โœ… Running larger models on single GPU
  • โœ… Cost-effective API serving
  • โœ… High-throughput applications
  • โœ… Extended context lengths (more KV cache)

Hardware Requirements

Minimum VRAM (approximate):

  • 70B model: ~40 GB (RTX A6000, A100 40GB)
  • 123B model: ~70 GB (A100 80GB, H100, H200)

Recommended:

  • H100/H200 for best performance
  • vLLM for optimized serving
  • Enable FP8 KV cache for extended context

Important Notes

โš ๏ธ Quantization Trade-offs:

  • Slight quality degradation (typically <1-2%)
  • Not suitable for fine-tuning (inference only)
  • Best with vLLM (has FP8 kernel optimizations)

โœ… Best Practices:

  • Use --kv-cache-dtype fp8 for longer contexts
  • Set --gpu-memory-utilization 0.90-0.95
  • Add --enforce-eager if you encounter compilation issues

Citation

If you use this model, please cite:

@misc{model_name-fp8,
  author = {author},
  title = {model_name FP8 Dynamic Quantization},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/repo_id}
}

License

Inherits license from base model: MN-12B-Vespa-x1-heretic

Acknowledgments


Want more FP8 models? Check out my other quantizations!

Downloads last month
12
Safetensors
Model size
12B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support