Qwen-3 Collection
Collection
Quantized Qwen3 instruction models for efficient text generation (AutoRound W4A16). • 2 items • Updated
This model is a 4-bit quantized version of Qwen/Qwen3-4B-Instruct-2507, optimized using Intel's AutoRound algorithm.
It achieves state-of-the-art accuracy retention by tuning weights for 1000 iterations with 512 calibration samples, significantly outperforming standard RTN (Round-to-Nearest) quantization.
W4A16 (4-bit weights, 16-bit activations)TrueYou need the auto-round library to run this model in its native format.
pip install auto-round transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig
model_id = "Vishva007/Qwen3-4B-Instruct-2507-W4A16-AutoRound"
# Load the model with AutoRound configuration
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Explain quantum computing in one sentence."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs, skip_special_tokens=True))
Qwen3-4B-Instruct-2507 is the latest non-thinking instruct model from the Qwen team, featuring significant improvements in reasoning, coding, and instruction following.
This quantized version retains nearly 99% of the FP16 performance while reducing VRAM usage significantly, enabling deployment on consumer GPUs (e.g., RTX 3060/4060).
@article{cheng2023optimize,
title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao},
journal={arXiv preprint arXiv:2309.05516},
year={2023}
}
Base model
Qwen/Qwen3-4B-Instruct-2507