Qwen2.5-14B-Instruct-FP8-W8A8

Model Description

This is an FP8 quantized version of Qwen/Qwen2.5-14B-Instruct using the compressed-tensors quantization method.

  • Base Model: Qwen/Qwen2.5-14B-Instruct
  • Quantization Method: compressed-tensors
  • Quantization Type: FP8 W8A8 (8-bit Weight and Activation)
  • Model Size: ~16.3GB (compared to ~28GB for BF16)
  • Compression Ratio: ~1.7x

Quantization Configuration

This model uses FP8 quantization with per-tensor quantization for both weights and activations:

Weights

  • Precision: FP8 (8-bit floating point)
  • Strategy: Per-tensor
  • Group Size: None (per-tensor)
  • Symmetric: Yes
  • Dynamic: No (static quantization)
  • Observer: MinMax

Activations

  • Precision: FP8 (8-bit floating point)
  • Strategy: Per-tensor
  • Group Size: None (per-tensor)
  • Symmetric: Yes
  • Dynamic: No (static quantization)
  • Observer: MinMax

Other Details

  • Format: float-quantized
  • KV Cache: Not quantized
  • Ignored Layers: lm_head
  • Target Layers: Linear layers
  • Quantization Version: 0.11.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Qwen2.5-14B-Instruct-FP8-W8A8"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

# Generate text
messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

  • Architecture: Qwen2ForCausalLM
  • Hidden Size: 5120
  • Intermediate Size: 13824
  • Number of Layers: 48
  • Number of Attention Heads: 40
  • Number of KV Heads: 8
  • Vocabulary Size: 152064
  • Max Position Embeddings: 32768

Intended Use

This quantized model is intended for efficient inference with reduced memory footprint while maintaining high accuracy. It is suitable for:

  • Production deployment with reduced memory requirements
  • High throughput inference scenarios
  • GPU inference with FP8 support
  • Applications where accuracy is important but memory savings are desired

Limitations

  • Best performance is achieved on hardware with native FP8 support (e.g., NVIDIA H100, Ada Lovelace, Blackwell GPUs)
  • Requires compatible inference engines that support FP8 computation
  • Static quantization may not adapt to varying input distributions

Performance Notes

  • Memory Usage: ~1.7x reduction compared to BF16
  • Speed: Requires hardware with FP8 tensor core support for optimal performance
  • Accuracy: Generally retains most of the original model's accuracy, with minimal degradation compared to FP4

Citation

If you use this model, please cite the original Qwen2.5 paper and the compressed-tensors library.

License

Same as the base model: Apache 2.0

Downloads last month
155
Safetensors
Model size
15B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JongYeop/Qwen2.5-14B-Instruct-FP8-W8A8

Base model

Qwen/Qwen2.5-14B
Quantized
(133)
this model