Qwen3-14B-FP8-MinMax

This is an FP8-quantized version of Qwen3-14B optimized for efficient inference.

Quantization Details

  • Method: FP8 W8A8 (weights + activations)
  • Observer: MinMax
  • Weights: Symmetric static per-channel
  • Activations: Symmetric dynamic per-token
  • Library: llm-compressor
  • Targets: Linear layers only (lm_head excluded)

Performance

Achieves ~97% retention on GSM8K benchmarks with 50% memory reduction and 2x throughput improvement.

Usage

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "selimaktas/Qwen3-14B-FP8-MinMax"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Initialize vLLM
llm = LLM(model=model_id, tensor_parallel_size=1)

# Generate
messages = [{"role": "user", "content": "Explain quantum computing"}]
prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=256)

outputs = llm.generate(prompts, sampling_params)
print(outputs[0].outputs[0].text)

Benchmarking

lm_eval --model vllm \
  --model_args pretrained=selimaktas/Qwen3-14B-FP8-MinMax,dtype=auto,add_bos_token=True \
  --tasks gsm8k --num_fewshot 5 --batch_size auto

Requirements

  • vLLM >= 0.5.0
  • Transformers >= 4.51.0
  • CUDA-capable GPU with FP8 support (H100, L40S, etc.)
Downloads last month
20
Safetensors
Model size
15B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support