EdgeRazor for Lightweight LLMs

Qwen3-1.7B-EdgeRazor-2.79bit

Contents
Model Overview
Model Bit-Widths
Model Performance
Quickstart
Citation

Model Overview

Base Model: Qwen/Qwen3-1.7B
Training: zhangsq-nju/EdgeRazor
Quantization: 2.79-bit for all decoder layers; 4-bit for embedding and lm_head

Model Bit-Widths

Mixed-Precision Recipe	Bit-Width	This Repo
100% 4-bit + 0% 1.58-bit	4
50% 4-bit + 50% 1.58-bit	2.79	✔️
12.5% 4-bit + 87.5% 1.58-bit	1.88
0% 4-bit + 100% 1.58-bit	1.58

Model Performance

Models	W-A-KV	ARC-e	ARC-c	HellaS.	BoolQ	PIQA	WinoG.	SIQA	OBQA	Tr.QA2	Ethics	MMLU	IFEval	GSM8K	HumanE.	Average (↑)
Qwen3-1.7B	16-16-16	69.87	42.83	60.40	77.77	72.58	60.85	45.19	37.40	45.97	49.63	55.49	67.10	68.76	67.07	58.64
EdgeRazor	4-16-16	70.66	44.80	57.51	80.09	72.31	60.14	44.06	38.40	48.41	64.02	54.70	58.96	68.39	57.32	58.56
EdgeRazor	2.79-16-16	63.47	38.57	49.48	78.78	68.23	55.64	43.91	33.40	45.42	60.81	46.25	54.71	54.28	53.66	53.33
EdgeRazor	1.88-16-16	59.60	34.04	40.94	72.11	65.23	54.38	41.76	29.80	46.09	57.30	38.93	43.81	36.39	39.63	47.14
EdgeRazor	1.58-16-16	55.60	31.06	39.53	70.95	63.60	53.28	41.97	31.60	40.16	55.89	35.00	32.72	29.49	33.54	43.89
EdgeRazor	4-8-8	70.16	44.45	57.52	79.82	72.58	59.67	43.45	38.20	48.37	63.56	54.29	60.26	68.54	59.15	58.57
EdgeRazor	2.79-8-8	62.79	38.31	49.53	78.38	68.72	56.04	43.65	33.40	45.57	60.72	46.27	54.34	53.68	50.61	53.00
EdgeRazor	1.88-8-8	59.09	33.53	40.85	72.14	65.18	53.99	41.76	29.00	46.18	57.33	39.03	41.96	37.53	40.85	47.03
EdgeRazor	1.58-8-8	55.64	31.48	39.68	70.70	64.25	53.91	41.76	31.60	40.15	56.26	35.07	32.35	28.96	32.93	43.91

Quickstart

It is recommended to ensure that EdgeRazor is installed in advance for weight-activation quantization. The provided weights are already quantized (quantized_weights*scaling_bf16); to enable activation and KV cache quantization, set trust_remote_code=True in the model configuration.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False # For EdgeRazor-nbit, we only train the instruct mode.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Citation

If you find our project useful in your research, please consider kindly citing our papers ✏️:

@article{zhangsh-edgerazor,
  title={{EdgeRazor}: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation},
  author={Shu-Hao Zhang and Le-Tong Huang and Xiang-Sheng Deng and Xin-Yi Zou and Chen Wu and Nan Li and Shao-Qun Zhang},
  year={2026},
}

Downloads last month: 352

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(621)

this model

Collection including zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit

EdgeRazor-Nbit

Collection

15 items • Updated 9 days ago