EdgeRazor for Lightweight LLMs

Qwen3-0.6B-EdgeRazor-4bit

Contents
Model Overview
Model Bit-Widths
Model Performance
Quickstart
Citation

Model Overview

Base Model: Qwen/Qwen3-0.6B
Training: zhangsq-nju/EdgeRazor
Quantization: 4-bit for all embedding, decoder, and lm_head layers

Model Bit-Widths

Mixed-Precision Recipe	Bit-Width	This Repo
100% 4-bit + 0% 1.58-bit	4	✔️
50% 4-bit + 50% 1.58-bit	2.79
12.5% 4-bit + 87.5% 1.58-bit	1.88
0% 4-bit + 100% 1.58-bit	1.58

Model Performance

Models	W-A-KV	ARC-e	ARC-c	HellaS.	BoolQ	PIQA	WinoG.	SIQA	OBQA	Tr.QA2	Ethics	MMLU	IFEval	GSM8K	HumanE.	Average (↑)
Qwen3-0.6B	16-16-16	56.02	34.04	47.23	64.04	67.36	56.04	39.20	31.20	42.84	47.70	40.12	58.41	41.54	37.20	47.35
EdgeRazor	4-16-16	58.54	33.45	45.04	68.01	68.34	55.72	40.07	33.40	43.69	54.36	39.37	53.42	42.00	34.15	47.83
EdgeRazor	2.79-16-16	51.77	28.33	37.47	70.70	63.71	54.06	40.33	28.20	42.72	55.08	36.85	51.39	26.69	31.10	44.17
EdgeRazor	1.88-16-16	51.22	27.73	34.21	66.91	63.66	53.35	38.43	27.60	43.80	55.92	28.78	42.51	25.09	23.17	41.60
EdgeRazor	1.58-16-16	45.75	25.77	33.89	66.64	60.72	52.33	38.23	29.80	44.40	51.70	32.85	37.34	14.25	23.17	39.77
EdgeRazor	4-8-8	57.79	33.70	45.00	67.49	67.85	55.88	40.17	33.80	43.53	54.09	39.73	53.42	42.00	34.76	47.80
EdgeRazor	2.79-8-8	52.10	28.50	37.36	70.58	63.92	53.12	40.12	28.60	42.82	54.97	36.44	49.54	26.99	32.32	44.10
EdgeRazor	1.88-8-8	51.47	27.99	34.22	66.85	63.49	53.04	38.02	27.40	43.88	55.92	29.56	44.55	25.09	23.17	41.76
EdgeRazor	1.58-8-8	44.87	26.11	33.88	66.73	60.55	51.30	38.28	31.00	44.72	50.76	33.09	38.45	15.01	22.56	39.81

Quickstart

It is recommended to ensure that EdgeRazor is installed in advance for weight-activation quantization. The provided weights are already quantized (quantized_weights*scaling_bf16); to enable activation and KV cache quantization, set trust_remote_code=True in the model configuration.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False # For EdgeRazor-nbit, we only train the instruct mode.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Citation

If you find our project useful in your research, please consider kindly citing our papers ✏️:

@article{zhangsh-edgerazor,
  title={{EdgeRazor}: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation},
  author={Shu-Hao Zhang and Le-Tong Huang and Xiang-Sheng Deng and Xin-Yi Zou and Chen Wu and Nan Li and Shao-Qun Zhang},
  year={2026},
}

Downloads last month: 47

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(801)

this model

Collection including zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit

EdgeRazor-Nbit

Collection

15 items • Updated 9 days ago