Qwen3-14B-FP8-MinMax
This is an FP8-quantized version of Qwen3-14B optimized for efficient inference.
Quantization Details
- Method: FP8 W8A8 (weights + activations)
- Observer: MinMax
- Weights: Symmetric static per-channel
- Activations: Symmetric dynamic per-token
- Library: llm-compressor
- Targets: Linear layers only (lm_head excluded)
Performance
Achieves ~97% retention on GSM8K benchmarks with 50% memory reduction and 2x throughput improvement.
Usage
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "selimaktas/Qwen3-14B-FP8-MinMax"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Initialize vLLM
llm = LLM(model=model_id, tensor_parallel_size=1)
# Generate
messages = [{"role": "user", "content": "Explain quantum computing"}]
prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)
print(outputs[0].outputs[0].text)
Benchmarking
lm_eval --model vllm \
--model_args pretrained=selimaktas/Qwen3-14B-FP8-MinMax,dtype=auto,add_bos_token=True \
--tasks gsm8k --num_fewshot 5 --batch_size auto
Requirements
- vLLM >= 0.5.0
- Transformers >= 4.51.0
- CUDA-capable GPU with FP8 support (H100, L40S, etc.)
- Downloads last month
- 20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support