Qwen2.5-14B-Instruct-FP8-W8A8

Model Description

This is an FP8 quantized version of Qwen/Qwen2.5-14B-Instruct using the compressed-tensors quantization method.

Base Model: Qwen/Qwen2.5-14B-Instruct
Quantization Method: compressed-tensors
Quantization Type: FP8 W8A8 (8-bit Weight and Activation)
Model Size: ~16.3GB (compared to ~28GB for BF16)
Compression Ratio: ~1.7x

Quantization Configuration

This model uses FP8 quantization with per-tensor quantization for both weights and activations:

Weights

Precision: FP8 (8-bit floating point)
Strategy: Per-tensor
Group Size: None (per-tensor)
Symmetric: Yes
Dynamic: No (static quantization)
Observer: MinMax

Activations

Precision: FP8 (8-bit floating point)
Strategy: Per-tensor
Group Size: None (per-tensor)
Symmetric: Yes
Dynamic: No (static quantization)
Observer: MinMax

Other Details

Format: float-quantized
KV Cache: Not quantized
Ignored Layers: lm_head
Target Layers: Linear layers
Quantization Version: 0.11.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "JongYeop/Qwen2.5-14B-Instruct-FP8-W8A8"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

# Generate text
messages = [
    {"role": "user", "content": "What is machine learning?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Model Architecture

Architecture: Qwen2ForCausalLM
Hidden Size: 5120
Intermediate Size: 13824
Number of Layers: 48
Number of Attention Heads: 40
Number of KV Heads: 8
Vocabulary Size: 152064
Max Position Embeddings: 32768

Intended Use

This quantized model is intended for efficient inference with reduced memory footprint while maintaining high accuracy. It is suitable for:

Production deployment with reduced memory requirements
High throughput inference scenarios
GPU inference with FP8 support
Applications where accuracy is important but memory savings are desired

Limitations

Best performance is achieved on hardware with native FP8 support (e.g., NVIDIA H100, Ada Lovelace, Blackwell GPUs)
Requires compatible inference engines that support FP8 computation
Static quantization may not adapt to varying input distributions

Performance Notes

Memory Usage: ~1.7x reduction compared to BF16
Speed: Requires hardware with FP8 tensor core support for optimal performance
Accuracy: Generally retains most of the original model's accuracy, with minimal degradation compared to FP4

Citation

If you use this model, please cite the original Qwen2.5 paper and the compressed-tensors library.

License

Same as the base model: Apache 2.0

Downloads last month: 155

Safetensors

Model size

15B params

Tensor type

BF16

F8_E4M3

Model tree for JongYeop/Qwen2.5-14B-Instruct-FP8-W8A8

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Quantized

(133)

this model