PreSINQ GGUF Quantized Qwen3-4B Model
This repository contains the official PreSINQ GGUF-quantized versions of the Qwen3-8B model. For a detailed explanation of PreSINQ strategy please refer to the the official SINQ repository.
SINQ is a fast and high-quality quantization technique designed to significantly reduce Large Language Model size while preserving accuracy.
If you find this project useful, please consider giving a β to the official SINQ repository.
Model Details
- Model Name:
Qwen3-8B-PreSINQ-GGUF - Base Model:
Qwen/Qwen3-8B - Task: Text Generation
- Framework: PyTorch / Transformers
- License: Apache-2.0
- Quantized By: Huawei β Computing Systems Lab
How to Obtain the PreSINQ Model
The PreSINQ Qwen3-8B models are produced using the PreSINQ GGUF script available in the official SINQ repository.
The models provided here correspond to the best-performing configurations for each quantization type.
π Best PreSINQ Quantization Results (Qwen3-8B)
Results below are measured on the WikiText-2 test set.
| Method | Bits | Size (GB) | Perplexity β |
|---|---|---|---|
| Baseline (FP16) | FP16 | 15.26 | 10.1019 |
| Baseline + Q3_K_S | 3-bit | 3.77 | 11.3619 |
| PreSINQ + Q3_K_S | 3-bit | 3.77 | 10.6786 |
However, you can generate good PreSINQ models (not the best one) faster by reducing the number of configurations explored during the PreSINQ script execution.
π Usage
Usage Example
You can load and run the PreSINQ GGUF models using:
- π€ Transformers
- llama.cpp
- Any GGUF-compatible inference framework
π§Ύ How to Cite This Work
If you find SINQ useful in your research or applications:
@misc{muller2025sinq,
title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
year={2025},
eprint={2509.22944},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2509.22944}
}
- Downloads last month
- 9
3-bit