dxx117-Qwen3.5-397B-A17B-heretic-GGUF

Qwen3.5-397B-A17B Heretic GGUF Quants

This repo contains my GGUF quants of Sabomako's BF16 GGUF release of Qwen3.5-397B-A17B Heretic.

I made these quants to cover a few practical tradeoffs for a model this large:

Q/K quants for stronger quality preservation
IQ quants for better quality-to-size efficiency
a small set of variants that are actually useful on my hardware, rather than every possible quant for completeness

These files are intended for llama.cpp-compatible runtimes and for people who want ready-to-use quants of this specific Heretic variant without having to build them themselves.

Source model

These quants were made from:

BF16 GGUF source: Sabomako/Qwen3.5-397B-A17B-heretic-bf16-GGUF
Underlying Heretic model: trohrbaugh/Qwen3.5-397B-A17B-heretic
Original upstream base: Qwen/Qwen3.5-397B-A17B

Quantizations in this repo

This repo includes or will include these GGUF variants:

IQ2_M
IQ3_XXS
Q3_K_L
IQ4_XS
Q4_K_L

Additional variants may be added over time if they fill a real gap.

What the variants are for

I picked these quants to cover different use cases.

IQ2_M is the smallest serious option here and is mainly for cases where fitting the model is the top priority.
IQ3_XXS is a step up from IQ2_M for people who can afford a bit more space and want noticeably better quality while still staying in a relatively compact range for a model this large.
Q3_K_L is the stronger 3-bit option when you want to preserve more quality.
IQ4_XS is the more efficient 4-bit-ish option with a strong quality-to-size balance.
Q4_K_L is the highest-quality quant in this repo and the one to start with if your priority is output quality.

Quantization approach

Quantization was performed from the BF16 GGUF source
I used an imatrix-based workflow
All quants in this repo use a quality-focused mixed setup
In every quant here, the main model tensors are quantized to the target quant type, while the token embeddings and output tensor are kept at Q8_0
This applies to the IQ quants as well as the Q/K quants
The goal was to preserve as much quality as possible where it seemed worthwhile, rather than chase the absolute smallest file size

Important note on naming

The quant names in this repo follow the familiar shorthand such as IQ2_M, IQ3_XXS, IQ4_XS, Q3_K_L, and Q4_K_L, but they are not pure stock versions of those formats.

In practice, all of them are mixed variants where:

the main tensors use the named quant type
token_embd.weight is kept at Q8_0
output.weight is kept at Q8_0

I manually verified that these Q8_0 layers are present in the generated files.

Which quant should you use?

A rough guide:

IQ2_M if your main goal is making the model fit on the smallest possible hardware
IQ3_XXS if you want a noticeably better low-end option than IQ2_M without jumping too far in size
Q3_K_L if you want a stronger 3-bit option with more emphasis on quality
IQ4_XS if you want an efficient 4-bit-ish quant with a strong quality-to-size balance
Q4_K_L if you want the best quality in this repo

Notes on the tradeoffs

At this model size, the differences between quant families are not just about raw size.

IQ2_M is the fit-first floor
IQ3_XXS is a better low-end quality/size compromise if you can spare the extra space
The IQ variants are the more space-efficient options overall
The Q/K variants are the more quality-focused options in their tier
Because all variants here keep embeddings and output at Q8_0, they should be thought of as slightly more quality-focused than plain stock versions with the same short name

Intended use

These files are for local inference, experimentation, and benchmarking in llama.cpp-compatible runtimes.

Possible use cases include:

chat and assistant use
coding
long-context experimentation
quant comparison on very large MoE models
running this specific Heretic variant locally without having to quantize it yourself

Notes

This is a quantized derivative repo, not the original model release
Behaviour will differ from the BF16 source to some extent because quantization always changes the model
Different runtimes and frontends may differ in speed, memory use, and feature support
These are still very large files, even at lower quant levels

Compatibility

These files are intended for:

llama.cpp
llama.cpp-based frontends and servers
other runtimes that support these GGUF quant formats

Actual support depends on the version of the backend you are using.

Credits

Credit goes to:

Sabomako for publishing the BF16 GGUF source used for these quants
trohrbaugh for the underlying Qwen3.5-397B-A17B-heretic model
Qwen for the original Qwen3.5-397B-A17B base model
Bartowski for popularising the higher-precision embed/output quant style that inspired the setup used here

Final note

I made this repo because I wanted ready-to-use, quality-focused GGUFs of this specific model variant.

If you want the best quality in this set, start with Q4_K_L.
If you want the most practical fit-first option, start with IQ2_M.
If you want a noticeably better low-end option and can spare the space, try IQ3_XXS.
If you want a middle ground, try Q3_K_L or IQ4_XS.

Downloads last month: 1,137

GGUF

Model size

396B params

Architecture

qwen35moe

Hardware compatibility

2-bit

3-bit

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CCSSNE/dxx117-Qwen3.5-397B-A17B-heretic-GGUF

Base model

RadicalNotionAI/Qwen3.5-397B-A17B-heretic

Quantized

Sabomako/Qwen3.5-397B-A17B-heretic-bf16-GGUF

Quantized

(4)

this model

Collection including CCSSNE/dxx117-Qwen3.5-397B-A17B-heretic-GGUF

Qwen3.5-397b

Collection

感觉不如27b • 6 items • Updated 1 day ago