Experimental global target bits‑per‑weight quantization of Qwen/Qwen3.5-9B

Using non-standard (forked) LLaMA C++ release b8927 for quantization.

From the original model creators:

Qwen3.5-9B

This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc.

Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.

Qwen3.5 Highlights

Qwen3.5 features the following enhancement:

Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.

Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.

Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.

Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

Next-Generation Training Infrastructure: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.

For more details, please refer to our blog post Qwen3.5.

⚠️ PLEASE READ THIS BEFORE USING THESE EXPERIMENTAL VERSIONS! ⚠️

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.

The method to produce these experimental versions involves using a custom version of llama-imatrix to generate an imatrix that includes tensor statistic and a custom version of llama-quantize, which computes a per-tensor quantization error, to automatically select the lowest error quantization recipe that achieves a global target bits‑per‑weight (bpw). More details on the implementation and test results here

There are two pull requests (#14891 & #15550) to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified versions will be available on GitHub.

For testing and comparison, I use models produced by Bartowski (see credits below) and Unsloth (Daniel and Michael Han do some really interesting stuff!) but when they don't provide versions of the required model, tests and comparisons are against standard quantization obtained by simply running llama-quantize with no further optimizations.

All experimental versions were generated using an appropriate imatrix created from datasets available at eaddario/imatrix-calibration. In llama.cpp, an imatrix is a calibration file derived from running representative text through the model and collecting activation statistics. It is used to weight quantization error so that error in more “important” directions (as estimated from activations) is penalized more heavily.

The process to generate these models is roughly as follows:

Convert the original model's safetensors to GGUF F16*
Estimate the Perplexity score for the F16 model (baseline) using the wikitext-2-raw-v1 dataset, and save the logits
Generate an imatrix from the most appropriate calibration dataset
Quantize the baseline model targeting a bpw average (e.g. llama-quantize --target-bpw 4.5678 --state-file --imatrix imatrix.gguf baseline-model-F16.gguf 12)
Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model
Keep version with the best 𝜌PPL and μKLD scores
Repeat until all desired quants are created

*BF16 would be preferred, but F16 performs better on Apple's GPUs

Advantages and disadvantages of the global target bits‑per‑weight quantization process

Advantages

Target arbitrary size models
- When specifying --target-bpw 4.5678 for instance, the algorithm will produce a model (nearly) exactly of that size, which is very useful for maximizing VRAM usage. In a system with 24GB VRAM and a 70B model, standard quants might produce a 16.8GB file (too small, quality left on table) or a 24.1GB file (won't fit). This approach can generate a 23.85GB file to utilize the hardware fully.
Data-driven mixed precision often can improve quality at fixed size
- Instead of using hardcoded heuristics (e.g. make attn_v Q5_K for a 70B model), that may be sub‑optimal for a given architecture or size, the quantization mix is determined by the actual error sensitivity of the specific model's weights. This, in practice, often yields a better quality/size trade-off, especially in aggressive quantization scenarios (1.5 to 3.5 bpw), or for unusual architectures.
- Please note: llama.cpp’s heuristics have been tuned across many models and are highly optimized; although the target bpw method produces better quality often (>75% based on tests with 130 models from 11 different families), it can also lose in surprising cases.
Allows better like-for-like comparisons between models and families
- Standard llama.cpp quantization uses hardcoded rules like: "use Q4_K_M, except bump some tensors up/down, except fall back if incompatible, except keep some tensors unquantized..." and for that reason, two different models quantized with the same Q4_K_M type can end up with very different bpw (e.g. 4.75 and 4.30).
- All things being equal, the performance of a model is usually proportional to its overall bpw size; models with a higher bpw tend to perform better than lower bpw models. Since model A has simply been given more bits, it will typically perform better (lower perplexity, better eval scores, etc.) even if the underlying quantization method is identical. That makes comparing the performance not a controlled experiment, because the comparison is between models with different effective compression ratios.
- --target-bpw tries to address that by making the experiment more controlled: each model gets quantized to land on (approximately) the same global byte budget, so that the models' performance differences are more attributable to architecture/training differences, quantization error behaviour at the same compression ratio, optimizer’s allocation decisions, etc.

Disadvantages

Quantization process is significantly slower than standard
- This approach can take 5x-10x longer as it quantizes a sample of most tensors into 15 different formats, dequantizes them back to floats, computes error diffs, and selects the best size/error option that fits the global bpw budget.
- However, the --state-file option will save/use the above-mentioned computations so that future quantizations, for the same model, can be generated at normal speed. It also allows to interrupt the computation process and resume it at a later time.
The optimization target is only a proxy for the model's performance quality
- The process minimizes a per-tensor estimated error computed from sampled rows, not actual perplexity or divergence of output distributions (a future version may address this). Since errors interact nonlinearly across layers, there are no guarantees it will select the best possible quantization recipe subject to the bpw size constraint.
An imatrix with activations data is required for best results
- Activation data is required to compute the bias factor (i.e. the systematic error projected onto activation directions). If the imatrix file does not contain activation data, the --target-bpw option will refuse to run.

Models

To ensure a fair "apples-to-apples" comparison, models IQ1_M, IQ2_M, Q3_K, Q4_K, Q5_K, Q6_K and Q8_0 were quantized to align with the bits-per-weight (bpw) of naive models (which uses standard quantization from simply running llama-quantize without further optimization). In contrast, the Q4_K-B and Q4_K-U models were matched to the ones produced by Bartwoski and Unsloth, respectively.

Bits per weight, size, perplexity and KL Divergence scores

Model	BPW	Size (GB)	μPPL	𝜌PPL	μKLD	Same Top-P
Qwen3.5-9B-F16	16.0019	17.0	7.740724 ±0.051440	100%	N/A	N/A
Qwen3.5-9B-IQ1_M	2.5610	2.7	11.807114 ±0.081282	90.82%	0.459065 ±0.001904	69.889 ±0.119
Qwen3.5-9B-IQ2_M	3.2134	3.4	8.764317 ±0.058793	96.63%	0.170553 ±0.001180	80.764 ±0.103
Qwen3.5-9B-Q3_K	4.1212	4.3	7.908381 ±0.052553	98.85%	0.053698 ±0.000791	89.608 ±0.079
Qwen3.5-9B-Q4_K	5.0197	5.2	7.818710 ±0.052151	99.61%	0.017196 ±0.000474	94.341 ±0.060
Qwen3.5-9B-Q4_K-B	5.2526	5.5	7.802908 ±0.052073	99.71%	0.012362 ±0.000379	94.974 ±0.057
Qwen3.5-9B-Q4_K-U	5.0656	5.3	7.811403 ±0.052081	99.64%	0.015860 ±0.000458	94.528 ±0.059
Qwen3.5-9B-Q4_K_M-naive	5.0197	5.2	7.802555 ±0.051866	99.51%	0.020758 ±0.000594	94.215 ±0.061
Qwen3.5-9B-Q4_K_M-bartowski	5.2526	5.5	7.837441 ±0.052308	99.58%	0.016737 ±0.000485	94.704 ±0.058
Qwen3.5-9B-Q4_K_M-unsloth	5.0656	5.3	7.855939 ±0.052445	99.55%	0.018559 ±0.000551	94.581 ±0.059
Qwen3.5-9B-Q5_K	5.7692	6.0	7.789108 ±0.051966	99.82%	0.007173 ±0.000347	96.652 ±0.047
Qwen3.5-9B-Q6_K	6.5655	6.9	7.757402 ±0.051640	99.89%	0.004164 ±0.000375	97.755 ±0.039
Qwen3.5-9B-Q8_0	8.5028	8.9	7.756922 ±0.051665	99.96%	0.000828 ±0.000180	99.146 ±0.024

ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores

Scores generated using llama-perplexity with 750 tasks per test, and a context size of 768 tokens.

For the test data used in the generation of these scores, follow the appropriate links: HellaSwag, ARC, MMLU, Truthful QA and WinoGrande

Model	ARC	HellaSwag	MMLU	Truthful QA	WinoGrande	Avg Score
Qwen3.5-9B-IQ1_M	56.4000 ±1.8119	68.00	35.7333 ±1.7510	28.5333 ±1.6500	68.9333 ±1.6909	51.52
Qwen3.5-9B-IQ2_M	64.2667 ±1.7510	73.07	35.0667 ±1.7436	30.8000 ±1.6869	69.2000 ±1.6869	54.48
Qwen3.5-9B-Q3_K	65.3333 ±1.7389	76.40	38.5333 ±1.7783	33.7333 ±1.7276	72.1333 ±1.6382	57.23
Qwen3.5-9B-Q4_K	66.6667 ±1.7225	75.47	38.9333 ±1.7816	32.9333 ±1.7172	72.6667 ±1.6284	57.33
Qwen3.5-9B-Q4_K-B	66.0000 ±1.7309	76.00	38.2667 ±1.7759	32.8000 ±1.7155	73.3333 ±1.6158	57.28
Qwen3.5-9B-Q4_K-U	66.5333 ±1.7242	75.87	38.1333 ±1.7748	32.4000 ±1.7100	72.4000 ±1.6334	57.07
Qwen3.5-9B-Q4_K_M-naive	65.2000 ±1.7405	71.99	38.6667 ±1.7794	33.6000 ±1.7259	72.1333 ±1.6382	56.32
Qwen3.5-9B-Q4_K_M-bartowski	64.4000 ±1.7496	75.33	38.0000 ±1.7736	33.4667 ±1.7242	72.6667 ±1.6284	56.77
Qwen3.5-9B-Q4_K_M-unsloth	65.6000 ±1.7358	75.47	38.4000 ±1.7771	33.8667 ±1.7292	71.7333 ±1.6453	57.01
Qwen3.5-9B-Q5_K	65.7333 ±1.7342	75.33	38.1333 ±1.7748	33.3333 ±1.7225	72.9333 ±1.6235	57.09
Qwen3.5-9B-Q6_K	66.1333 ±1.7292	76.00	38.2667 ±1.7759	33.4667 ±1.7242	72.4000 ±1.6334	57.25
Qwen3.5-9B-Q8_0	66.6667 ±1.7225	75.60	38.1333 ±1.7748	34.0000 ±1.7309	73.0667 ±1.6209	57.49

Tokens per second benchmarks

Scores generated using llama-bench. Standard (llama-quantize with no optimization) Q4_K_M quantization included for comparison.

model	size	params	backend	threads	test	t/s
Qwen3.5-9B-Q4_K	5.23 GiB	8.95 B	BLAS,MTL	12	pp512	695.67 ±1.13
Qwen3.5-9B-Q4_K	5.23 GiB	8.95 B	BLAS,MTL	12	tg128	61.14 ±0.06
Qwen3.5-9B-Q4_K	5.23 GiB	8.95 B	BLAS,MTL	12	pp1024+tg1024	99.64 ±5.19
Qwen3.5-9B-Q4_K_M-naive	5.23 GiB	8.95 B	BLAS,MTL	12	pp512	673.71 ±3.18
Qwen3.5-9B-Q4_K_M-naive	5.23 GiB	8.95 B	BLAS,MTL	12	tg128	62.92 ±0.39
Qwen3.5-9B-Q4_K_M-naive	5.23 GiB	8.95 B	BLAS,MTL	12	pp1024+tg1024	96.55 ±6.67
Qwen3.5-9B-Q4_K-B	5.48 GiB	8.95 B	BLAS,MTL	12	pp512	733.50 ±71.65
Qwen3.5-9B-Q4_K-B	5.48 GiB	8.95 B	BLAS,MTL	12	tg128	46.55 ±2.91
Qwen3.5-9B-Q4_K-B	5.48 GiB	8.95 B	BLAS,MTL	12	pp1024+tg1024	92.12 ±1.86
Qwen3.5-9B-Q4_K_M-bartowski	5.48 GiB	8.95 B	BLAS,MTL	12	pp512	742.12 ±69.85
Qwen3.5-9B-Q4_K_M-bartowski	5.48 GiB	8.95 B	BLAS,MTL	12	tg128	60.52 ±0.86
Qwen3.5-9B-Q4_K_M-bartowski	5.48 GiB	8.95 B	BLAS,MTL	12	pp1024+tg1024	108.24 ±0.16
Qwen3.5-9B-Q4_K_M-U	5.28 GiB	8.95 B	BLAS,MTL	12	pp512	825.73 ±1.08
Qwen3.5-9B-Q4_K_M-U	5.28 GiB	8.95 B	BLAS,MTL	12	tg128	58.86 ±2.26
Qwen3.5-9B-Q4_K_M-U	5.28 GiB	8.95 B	BLAS,MTL	12	pp1024+tg1024	97.26 ±0.23
Qwen3.5-9B-Q4_K_M-unsloth	5.28 GiB	8.95 B	BLAS,MTL	12	pp512	803.16 ±21.92
Qwen3.5-9B-Q4_K_M-unsloth	5.28 GiB	8.95 B	BLAS,MTL	12	tg128	61.54 ±0.69
Qwen3.5-9B-Q4_K_M-unsloth	5.28 GiB	8.95 B	BLAS,MTL	12	pp1024+tg1024	108.41 ±2.65

Metrics used

Perplexity: one of the key metrics used in NLP evaluation. It measures the quality of a language model by evaluating how well it predicts the next token given a particular sequence of words. A PPL of 1 indicates an exact match between predicted and actual, whereas values greater than one indicate a degree of "surprise" the generated token differs from the expected.

Kullback–Leibler (KL) Divergence: a statistical measure of how much a probability distribution differs from another. When quantizing models (or altering the original tensors in any way for that matter), the closest we can preserve the weights' probability distribution to the original model the better, thus the closest to 0 the better.

AI2 Reasoning Challenge (ARC): a benchmark to evaluate the ability of AI models to answer complex science questions that require logical reasoning beyond pattern matching.

HellaSwag: the Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations (bit of a mouthful!) is a benchmark designed to test commonsense natural language inference. It requires the model to predict the most likely ending of a sentence.

MMLU: the Massive Multitask Language Understanding evaluates LLMs’ general knowledge and problem-solving abilities across 57 subjects, including elementary mathematics, US history, computer science, and law.

Truthful QA: evaluates how well LLMs generate truthful responses to questions. It identifies whether AI models can avoid generating false or misleading information, particularly in areas where human knowledge is prone to misconceptions.

Winogrande: based on the Winograd Schema Challenge, is a natural language understanding task requiring models to resolve ambiguities in sentences involving pronoun references.

Credits

LLaMa C++ has a large and vibrant community of contributors (~1,600 last time I checked) that actively maintain and extend its functionality, adding new models and architectures almost as fast as they appear. Considering the breakneck speed at which the AI/ML field is advancing, this alone is a remarkable feat!

While I'm grateful to all contributors, I want to recognise three in particular:

Colin Kealty (Bartowski), for the many contributions and for being one of the best sources of high quality quantized models available on Hugging Face
Georgi Gerganov for his amazing work with llama.cpp and the ggml/gguf libraries
Iwan Kawrakow for being one of the key authors behind the many quantization algorithms and the imatrix functionality.

Downloads last month: 5,045

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

1-bit

2-bit

4-bit

6-bit

8-bit

16-bit

View +1 variant

Model tree for eaddario/Qwen3.5-9B-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(176)

this model

eaddario
/

Qwen3.5-9B-GGUF