Qwen3-4B-Thinking-2507-Heretic-GGUF

Llamacpp imatrix Quantizations of Qwen3-4B-Thinking-2507-Heretic by becnic (from original Qwen3-4B-Thinking-2507)

Using llama.cpp release b7120 for quantization.

Original model: https://huggingface.co/becnic/Qwen3-4B-Thinking-2507-Heretic

Run them directly with llama.cpp, or any other llama.cpp based project

Download a file (not the whole branch) from below:

Filename	Quant type	File Size	Split	Description
Qwen3-4B-Thinking-2507-Q8_0.gguf	Q8_0	4.28GB	false	Extremely high quality

Downloading using huggingface-cli

Click to view download instructions

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download becnic/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf" --local-dir ./

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

huggingface-cli download becnic/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf/*" --local-dir ./

Abliteration parameters

Parameter	Value
direction_index	19.42
attn.o_proj.max_weight	1.23
attn.o_proj.max_weight_position	22.34
attn.o_proj.min_weight	0.69
attn.o_proj.min_weight_distance	10.42
mlp.down_proj.max_weight	1.12
mlp.down_proj.max_weight_position	29.64
mlp.down_proj.min_weight	1.08
mlp.down_proj.min_weight_distance	20.24

Performance

Metric	This model	Original model (Qwen/Qwen3-4B-Thinking-2507)
KL divergence	0.06	0 (by definition)
Refusals	6/100	96/100

Model Overview

Qwen3-4B-Thinking-2507 has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 4.0B
Number of Paramaters (Non-Embedding): 3.6B
Number of Layers: 36
Number of Attention Heads (GQA): 32 for Q and 8 for KV
Context Length: 262,144 natively.

NOTE: This model supports only thinking mode. Meanwhile, specifying enable_thinking=True is no longer required.

Additionally, to enforce model thinking, the default chat template automatically includes <think>. Therefore, it is normal for the model's output to contain only </think> without an explicit opening <think> tag.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Downloads last month: 3

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

8-bit

Model tree for becnic/Qwen3-4B-Thinking-2507-Heretic-GGUF

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

becnic/Qwen3-4B-Thinking-2507-Heretic

Quantized

(3)

this model