Qwen3-14B-FP8-MinMax

This is an FP8-quantized version of Qwen3-14B optimized for efficient inference.

Quantization Details

Method: FP8 W8A8 (weights + activations)
Observer: MinMax
Weights: Symmetric static per-channel
Activations: Symmetric dynamic per-token
Library: llm-compressor
Targets: Linear layers only (lm_head excluded)

Performance

Achieves ~97% retention on GSM8K benchmarks with 50% memory reduction and 2x throughput improvement.

Usage

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "selimaktas/Qwen3-14B-FP8-MinMax"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Initialize vLLM
llm = LLM(model=model_id, tensor_parallel_size=1)

# Generate
messages = [{"role": "user", "content": "Explain quantum computing"}]
prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=256)

outputs = llm.generate(prompts, sampling_params)
print(outputs[0].outputs[0].text)

Benchmarking

lm_eval --model vllm \
  --model_args pretrained=selimaktas/Qwen3-14B-FP8-MinMax,dtype=auto,add_bos_token=True \
  --tasks gsm8k --num_fewshot 5 --batch_size auto

Requirements

vLLM >= 0.5.0
Transformers >= 4.51.0
CUDA-capable GPU with FP8 support (H100, L40S, etc.)

Downloads last month: 20

Safetensors

Model size

15B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support