Qwen2.5-14B-Instruct-FP8-W8A8
Model Description
This is an FP8 quantized version of Qwen/Qwen2.5-14B-Instruct using the compressed-tensors quantization method.
- Base Model: Qwen/Qwen2.5-14B-Instruct
- Quantization Method: compressed-tensors
- Quantization Type: FP8 W8A8 (8-bit Weight and Activation)
- Model Size: ~16.3GB (compared to ~28GB for BF16)
- Compression Ratio: ~1.7x
Quantization Configuration
This model uses FP8 quantization with per-tensor quantization for both weights and activations:
Weights
- Precision: FP8 (8-bit floating point)
- Strategy: Per-tensor
- Group Size: None (per-tensor)
- Symmetric: Yes
- Dynamic: No (static quantization)
- Observer: MinMax
Activations
- Precision: FP8 (8-bit floating point)
- Strategy: Per-tensor
- Group Size: None (per-tensor)
- Symmetric: Yes
- Dynamic: No (static quantization)
- Observer: MinMax
Other Details
- Format: float-quantized
- KV Cache: Not quantized
- Ignored Layers: lm_head
- Target Layers: Linear layers
- Quantization Version: 0.11.0
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "JongYeop/Qwen2.5-14B-Instruct-FP8-W8A8"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
# Generate text
messages = [
{"role": "user", "content": "What is machine learning?"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Model Architecture
- Architecture: Qwen2ForCausalLM
- Hidden Size: 5120
- Intermediate Size: 13824
- Number of Layers: 48
- Number of Attention Heads: 40
- Number of KV Heads: 8
- Vocabulary Size: 152064
- Max Position Embeddings: 32768
Intended Use
This quantized model is intended for efficient inference with reduced memory footprint while maintaining high accuracy. It is suitable for:
- Production deployment with reduced memory requirements
- High throughput inference scenarios
- GPU inference with FP8 support
- Applications where accuracy is important but memory savings are desired
Limitations
- Best performance is achieved on hardware with native FP8 support (e.g., NVIDIA H100, Ada Lovelace, Blackwell GPUs)
- Requires compatible inference engines that support FP8 computation
- Static quantization may not adapt to varying input distributions
Performance Notes
- Memory Usage: ~1.7x reduction compared to BF16
- Speed: Requires hardware with FP8 tensor core support for optimal performance
- Accuracy: Generally retains most of the original model's accuracy, with minimal degradation compared to FP4
Citation
If you use this model, please cite the original Qwen2.5 paper and the compressed-tensors library.
License
Same as the base model: Apache 2.0
- Downloads last month
- 155