Qwen3.5-397B-A17B Heretic GGUF Quants
This repo contains my GGUF quants of Sabomako's BF16 GGUF release of Qwen3.5-397B-A17B Heretic.
I made these quants to cover a few practical tradeoffs for a model this large:
- Q/K quants for stronger quality preservation
- IQ quants for better quality-to-size efficiency
- a small set of variants that are actually useful on my hardware, rather than every possible quant for completeness
These files are intended for llama.cpp-compatible runtimes and for people who want ready-to-use quants of this specific Heretic variant without having to build them themselves.
Source model
These quants were made from:
- BF16 GGUF source:
Sabomako/Qwen3.5-397B-A17B-heretic-bf16-GGUF - Underlying Heretic model:
trohrbaugh/Qwen3.5-397B-A17B-heretic - Original upstream base:
Qwen/Qwen3.5-397B-A17B
Quantizations in this repo
This repo includes or will include these GGUF variants:
IQ2_MIQ3_XXSQ3_K_LIQ4_XSQ4_K_L
Additional variants may be added over time if they fill a real gap.
What the variants are for
I picked these quants to cover different use cases.
- IQ2_M is the smallest serious option here and is mainly for cases where fitting the model is the top priority.
- IQ3_XXS is a step up from IQ2_M for people who can afford a bit more space and want noticeably better quality while still staying in a relatively compact range for a model this large.
- Q3_K_L is the stronger 3-bit option when you want to preserve more quality.
- IQ4_XS is the more efficient 4-bit-ish option with a strong quality-to-size balance.
- Q4_K_L is the highest-quality quant in this repo and the one to start with if your priority is output quality.
Quantization approach
- Quantization was performed from the BF16 GGUF source
- I used an imatrix-based workflow
- All quants in this repo use a quality-focused mixed setup
- In every quant here, the main model tensors are quantized to the target quant type, while the token embeddings and output tensor are kept at Q8_0
- This applies to the IQ quants as well as the Q/K quants
- The goal was to preserve as much quality as possible where it seemed worthwhile, rather than chase the absolute smallest file size
Important note on naming
The quant names in this repo follow the familiar shorthand such as IQ2_M, IQ3_XXS, IQ4_XS, Q3_K_L, and Q4_K_L, but they are not pure stock versions of those formats.
In practice, all of them are mixed variants where:
- the main tensors use the named quant type
token_embd.weightis kept at Q8_0output.weightis kept at Q8_0
I manually verified that these Q8_0 layers are present in the generated files.
Which quant should you use?
A rough guide:
- IQ2_M if your main goal is making the model fit on the smallest possible hardware
- IQ3_XXS if you want a noticeably better low-end option than IQ2_M without jumping too far in size
- Q3_K_L if you want a stronger 3-bit option with more emphasis on quality
- IQ4_XS if you want an efficient 4-bit-ish quant with a strong quality-to-size balance
- Q4_K_L if you want the best quality in this repo
Notes on the tradeoffs
At this model size, the differences between quant families are not just about raw size.
- IQ2_M is the fit-first floor
- IQ3_XXS is a better low-end quality/size compromise if you can spare the extra space
- The IQ variants are the more space-efficient options overall
- The Q/K variants are the more quality-focused options in their tier
- Because all variants here keep embeddings and output at Q8_0, they should be thought of as slightly more quality-focused than plain stock versions with the same short name
Intended use
These files are for local inference, experimentation, and benchmarking in llama.cpp-compatible runtimes.
Possible use cases include:
- chat and assistant use
- coding
- long-context experimentation
- quant comparison on very large MoE models
- running this specific Heretic variant locally without having to quantize it yourself
Notes
- This is a quantized derivative repo, not the original model release
- Behaviour will differ from the BF16 source to some extent because quantization always changes the model
- Different runtimes and frontends may differ in speed, memory use, and feature support
- These are still very large files, even at lower quant levels
Compatibility
These files are intended for:
- llama.cpp
- llama.cpp-based frontends and servers
- other runtimes that support these GGUF quant formats
Actual support depends on the version of the backend you are using.
Credits
Credit goes to:
- Sabomako for publishing the BF16 GGUF source used for these quants
- trohrbaugh for the underlying Qwen3.5-397B-A17B-heretic model
- Qwen for the original Qwen3.5-397B-A17B base model
- Bartowski for popularising the higher-precision embed/output quant style that inspired the setup used here
Final note
I made this repo because I wanted ready-to-use, quality-focused GGUFs of this specific model variant.
If you want the best quality in this set, start with Q4_K_L.
If you want the most practical fit-first option, start with IQ2_M.
If you want a noticeably better low-end option and can spare the space, try IQ3_XXS.
If you want a middle ground, try Q3_K_L or IQ4_XS.
- Downloads last month
- 1,137
2-bit
3-bit
4-bit
Model tree for CCSSNE/dxx117-Qwen3.5-397B-A17B-heretic-GGUF
Base model
RadicalNotionAI/Qwen3.5-397B-A17B-heretic