EdgeRazor Logo

EdgeRazor for Lightweight LLMs

GitHub EdgeRazor

Qwen3-0.6B-EdgeRazor-4bit

Contents

Model Overview

Model Bit-Widths

Mixed-Precision Recipe Bit-Width This Repo
100% 4-bit + 0% 1.58-bit 4 ✔️
50% 4-bit + 50% 1.58-bit 2.79
12.5% 4-bit + 87.5% 1.58-bit 1.88
0% 4-bit + 100% 1.58-bit 1.58

Model Performance

Models W-A-KV ARC-e ARC-c HellaS. BoolQ PIQA WinoG. SIQA OBQA Tr.QA2 Ethics MMLU IFEval GSM8K HumanE. Average (↑)
Qwen3-0.6B 16-16-16 56.02 34.04 47.23 64.04 67.36 56.04 39.20 31.20 42.84 47.70 40.12 58.41 41.54 37.20 47.35
EdgeRazor 4-16-16 58.54 33.45 45.04 68.01 68.34 55.72 40.07 33.40 43.69 54.36 39.37 53.42 42.00 34.15 47.83
EdgeRazor 2.79-16-16 51.77 28.33 37.47 70.70 63.71 54.06 40.33 28.20 42.72 55.08 36.85 51.39 26.69 31.10 44.17
EdgeRazor 1.88-16-16 51.22 27.73 34.21 66.91 63.66 53.35 38.43 27.60 43.80 55.92 28.78 42.51 25.09 23.17 41.60
EdgeRazor 1.58-16-16 45.75 25.77 33.89 66.64 60.72 52.33 38.23 29.80 44.40 51.70 32.85 37.34 14.25 23.17 39.77
EdgeRazor 4-8-8 57.79 33.70 45.00 67.49 67.85 55.88 40.17 33.80 43.53 54.09 39.73 53.42 42.00 34.76 47.80
EdgeRazor 2.79-8-8 52.10 28.50 37.36 70.58 63.92 53.12 40.12 28.60 42.82 54.97 36.44 49.54 26.99 32.32 44.10
EdgeRazor 1.88-8-8 51.47 27.99 34.22 66.85 63.49 53.04 38.02 27.40 43.88 55.92 29.56 44.55 25.09 23.17 41.76
EdgeRazor 1.58-8-8 44.87 26.11 33.88 66.73 60.55 51.30 38.28 31.00 44.72 50.76 33.09 38.45 15.01 22.56 39.81

Quickstart

It is recommended to ensure that EdgeRazor is installed in advance for weight-activation quantization. The provided weights are already quantized (quantized_weights*scaling_bf16); to enable activation and KV cache quantization, set trust_remote_code=True in the model configuration.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False # For EdgeRazor-nbit, we only train the instruct mode.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Citation

If you find our project useful in your research, please consider kindly citing our papers ✏️:

@article{zhangsh-edgerazor,
  title={{EdgeRazor}: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation},
  author={Shu-Hao Zhang and Le-Tong Huang and Xiang-Sheng Deng and Xin-Yi Zou and Chen Wu and Nan Li and Shao-Qun Zhang},
  year={2026},
}
Downloads last month
47
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(801)
this model

Collection including zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit