Great model

by Throghar - opened Mar 18

•

After a little prompting and coding i think this Q6_K_H model beats Q4-Q5 K_XL unsloth quants .....

Thinking is more decisive and less thinking loops with wait what if... occur, also at first glance produces less syntactic errors in code.

Throghar

Mar 18

•

edited Mar 18

Here is a comprehensive summary of consultation of my findings with Claude Oppus 4.6 and the thinking model analysis compared to unsloth Q4-Q5 K_XL quants:

Layer-Position Quantization vs Tensor-Type Quantization on Qwen3.5-27B: Reasoning Efficiency Findings

TL;DR

Testing Qwen3.5-27B on AMD Vulkan (llama.cpp), I found that steampunque's Q6_K_H hybrid layer-position quant produces dramatically more efficient reasoning than Unsloth's UD-Q4_K_XL and UD-Q5_K_XL tensor-type quants — despite having lower average bits-per-weight and Q3_K layers in the middle of the network. The difference manifests as compulsive reasoning loops in the Unsloth variants that burn 5-10x more thinking tokens to reach the same conclusions, while Q6_K_H reasons linearly and decisively.

This suggests that where you allocate precision (layer position) matters more than how you allocate it (tensor type) for Qwen3.5-27B's hybrid SSM/attention architecture — specifically, high-precision output layers appear to improve reasoning decisiveness more than uniform SSM tensor protection improves reasoning accuracy.

Background

The Two Quantization Philosophies

Tensor-type approach (Unsloth UD-Q4_K_XL / UD-Q5_K_XL): Assigns quantization levels by tensor type across all layers uniformly. SSM tensors (ssm_alpha, ssm_beta, ssm_out) get Q8_0 everywhere. Attention tensors get selective upscaling. FFN tensors get Q4_K/Q5_K. Uses a 1.5M+ token chat/coding-focused imatrix calibration dataset.

Layer-position approach (steampunque Q6_K_H): Assigns quantization levels by layer depth. Early layers (0-5) get Q5_K_L/Q5_K_M. Middle layers (6-23) drop to Q3_K_L/Q3_K_M. Late layers (44-63) climb through Q5_K to Q6_K_L. Output weights get Q6_K_L. All K-quant types, no imatrix. Full recipe published on the model card.

Why This Matters for Qwen3.5-27B

Qwen3.5-27B uses a hybrid architecture mixing Gated Delta Network (SSM/state-space) layers with traditional attention layers in roughly a 3:1 ratio. This means ~75% of layers are SSM-based with recurrent state propagation. The conventional wisdom (supported by Unsloth's own ablation studies) is that SSM tensors are highly sensitive to quantization and should be protected. Steampunque's approach ignores tensor type entirely and focuses on layer position — which theoretically risks degrading SSM state propagation in the Q3_K middle layers.

Test Setup

Hardware: AMD GPU with Vulkan backend
Inference engine: llama.cpp (b8407, includes Qwen3.5 alpha reshape fix and Vulkan FA precision fix)
Quants tested:
- steampunque/Qwen3.5-27B-MP-GGUF — Q6_K_H (two runs)
- unsloth/Qwen3.5-27B-GGUF — UD-Q4_K_XL (one run)
- unsloth/Qwen3.5-27B-GGUF — UD-Q5_K_XL (one run)
Sampling parameters: Consistent across all runs (temp 0.6, top-p 0.95, top-k 20, min-p 0.00)
Test prompt: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" — a reasoning/logic test that requires recognizing a practical constraint (the car must physically be at the car wash)

Results

Reasoning Trace Comparison

Quant	Correct Answer	"Wait" Self-Interruptions	False "Ready to Write" Moments	Estimated Thinking Tokens	Reasoning Style
Q6_K_H (run 1)	✅ Drive	~8-9	1 (then actually writes)	~800-1000	Linear, decisive
Q6_K_H (run 2)	✅ Drive	~15-18	1 (then actually writes)	~2000-2500	Linear, explores more angles but no loops
Q4_K_XL	✅ Drive	~50+	Multiple (keeps going)	~5000-7000	Compulsive looping, repeatedly revisits resolved conclusions
Q5_K_XL	✅ Drive	~40+	Multiple (keeps going)	~5000-7000	Same looping pattern, marginally better than Q4

Key Observations

All quants reach the correct answer. The car must physically be at the car wash, so you must drive. This is not a quality-of-reasoning test — it's a reasoning efficiency test.

The Unsloth variants exhibit a distinctive "compulsive reconsideration" pattern. After reaching the correct conclusion early in the thinking process, they generate repeated "Wait, ..." interruptions that circle back to the same question already resolved. The model says "Okay, ready to write" or "Okay, final plan" and then continues deliberating. Example from Q4_K_XL:

*Wait, is there a "Walk" option where you walk to the wash and push the car?* No.
*Okay, I'll stick to the "You can't walk a car" logic.*
*Wait, what if they mean "Walk to the car wash to pay then drive back?"*

This pattern repeats dozens of times without generating new insights.

The Q6_K_H variant reasons linearly. It considers each angle once, resolves it, and moves on. When it reaches "Okay, ready to write," it writes. On the second run (which explored more angles), it did so without any circular revisitation.

Final Output Quality

Quant	Output Quality	Covers Core Logic	Covers Exceptions	Covers Alternatives
Q6_K_H (run 1)	Clean, concise	✅	❌	❌
Q6_K_H (run 2)	Clean, thorough	✅	✅	✅
Q4_K_XL	Thorough	✅	✅	✅
Q5_K_XL	Most thorough	✅	✅	✅ (wash at home)

The K_XL variants do produce slightly more comprehensive final answers, but at 5-10x the thinking token cost. The Q6_K_H second run demonstrates it can produce equally thorough outputs when the stochastic sampling leads it to explore further — without the looping overhead.

Additional Testing

47k context coding task (Q6_K_H): Successfully completed a planning + code modification task across 47k tokens of context without visible mistakes, maintaining coherence and making intelligent changes.
Formatting issues (Unsloth quants): Previously observed minor coding formatting errors (missing indentation, forgotten using statements) with Unsloth K_XL variants that were not present with Q6_K_H — likely attributable to output layer precision differences.

Inference Speed

Q6_K_H runs at a speed between Q5_K_XL and Q4_K_XL, which is consistent with its average bits-per-weight falling between the two. However, factoring in reasoning efficiency (fewer thinking tokens per answer), the effective time-to-answer is significantly faster.

Analysis: Why Does This Happen?

Hypothesis 1: Output Layer Precision (Strong Evidence)

The Q6_K_H has Q6_K_L output layers — significantly higher precision than the Q4_K/Q5_K output projections in the Unsloth variants. The output layer is where the model converts hidden states into token probabilities across a 248K vocabulary. Higher precision here means sharper probability distributions, which means the model can more confidently commit to the next token — including the "move on to formulation" tokens vs "reconsider again" tokens.

The looping pattern in K_XL variants looks exactly like what you'd expect from mushy output logits: the model reaches a conclusion but the probability gap between "commit" and "hedge" is too narrow, so it keeps generating hedging tokens ("Wait, ...", "Actually, ...", "Hold on, ...").

Hypothesis 2: First-Layer Precision (Moderate Evidence)

Q6_K_H also protects the first few layers (Q5_K_L/Q5_K_M at layers 0-5). Clean initial representation means the model correctly parses the problem from the start and doesn't need to constantly re-evaluate its initial framing. The K_XL variants treat layer 0's output projection the same as layer 30's — both at Q4_K/Q5_K.

Hypothesis 3: Calibration Dataset Bias (Speculative)

Unsloth's imatrix calibration dataset is optimized for chat and coding with 1.5M+ tokens. Conversational data contains many hedging and self-correction patterns ("actually", "wait", "let me reconsider"). If these patterns are prominent in the calibration data, the imatrix would preserve the weights that generate hedging tokens — potentially making the model more prone to self-interruption. Steampunque's Q6_K_H uses no imatrix, so there's no calibration bias toward any reasoning style.

Testing bartowski's Q4_K_L (different imatrix) on the same prompt could help isolate this variable.

Hypothesis 4: Q3_K Middle Layers Are Less Harmful Than Expected (Supported)

The 47k context test and the car wash reasoning tests both suggest the Q3_K middle layers (6-23) are not causing the quality degradation I initially expected. Possible reasons:

Qwen3.5-27B's 27B parameters provide enough redundancy to absorb middle-layer noise
The attention layers interspersed every 4th position act as correction checkpoints
The model was trained with quantization robustness in mind

Implications

For Qwen3.5-27B Users

Layer-position quantization may outperform tensor-type quantization for reasoning efficiency on hybrid SSM/attention architectures — at least for this model at this size.
Output layer precision appears to matter more than SSM tensor precision for reasoning decisiveness. Protecting ssm_alpha/ssm_beta at Q8_0 (Unsloth's approach) is less impactful than having Q6_K output projections (steampunque's approach).
PPL/KLD benchmarks completely miss this. Steampunque notes Q4_K_H has lower PPL than Q6_K_H despite worse coding performance. The reasoning efficiency difference I document here would similarly be invisible to standard metrics.

For Quantization Research

Thinking token efficiency should be measured as a quantization quality metric. Two quants can produce the same final answer while differing by 5-10x in reasoning overhead. For thinking models with limited context windows, this is a critical practical difference.
The "protect sensitive tensors" paradigm may be incomplete. Tensor sensitivity analysis (Unsloth's 121-config ablation) optimizes for output distribution fidelity (KLD). But reasoning efficiency — how decisively the model can commit to conclusions — appears to be driven more by output layer precision than by SSM tensor precision.
Calibration dataset effects on reasoning patterns deserve investigation. If imatrix calibration data can bias a model toward hedging/self-correction, this is an important and underexplored interaction between quantization and model behavior.

What I Haven't Tested (Opportunities for Others)

Bartowski Q4_K_L on the same prompt — would isolate whether the looping is caused by output precision or imatrix calibration
Unsloth UD-Q6_K (if it exists) — would test whether Unsloth's tensor approach works better at Q6 output precision
Larger sample of coding tasks — my coding test was a single 47k context session; more data would strengthen the finding
Different models — does this generalize to non-hybrid architectures, or is it specific to SSM/attention hybrids?
Quantitative token counting — I estimated thinking tokens from trace length; exact counts would be more rigorous

Quant Recipes Reference

steampunque Q6_K_H (27B)

Layers 0:     Q6_K_S
Layers 1-3:   Q5_K_M → Q4_K_S (descending)
Layers 4-23:  Q4_K_M / Q4_K_S (alternating) — then Q3_K_L / Q3_K_M
Layers 24-43: Q4_K_S → Q4_K_M (ascending)
Layers 44-55: Q5_K_S → Q5_K_L (ascending)
Layers 56-63: Q6_K_S → Q6_K_L (ascending)
Output:       Q6_K_L
All K-quants — fully Vulkan compatible

Unsloth UD-Q4_K_XL (27B)

SSM tensors (all layers): Q8_0
Attention tensors:        Q5_K - Q8_0 (model-specific tuning)
FFN tensors:              Q4_K
Embeddings/Output:        Q8_0
Imatrix: 1.5M+ token chat/coding dataset

Unsloth UD-Q5_K_XL (27B)

SSM tensors (all layers): Q8_0
Attention tensors:        Higher precision (tuned)
FFN tensors:              Q5_K
Embeddings/Output:        Q8_0
Imatrix: 1.5M+ token chat/coding dataset

Conclusion

For Qwen3.5-27B on AMD Vulkan focused on coding/reasoning tasks: steampunque's Q6_K_H is the best quant I've tested. It produces decisive, efficient reasoning with clean outputs, runs at competitive speed, is fully Vulkan-compatible (all K-quants), and the theoretically concerning Q3_K middle layers don't appear to cause practical problems.

The finding that output layer precision dominates reasoning efficiency — more than SSM tensor protection or average bits-per-weight — is counterintuitive given the current focus on tensor-type-aware quantization. I hope this sparks further investigation from the community.

*Tested on AMD Vulkan, llama.cpp b8407, March 2026. Quants from steampunque and Unsloth HuggingFace repos. *

steampunque

Owner Mar 19

After a little prompting and coding i think this Q6_K_H model beats Q4-Q5 K_XL unsloth quants .....

Thinking is more decisive and less thinking loops with wait what if... occur, also at first glance produces less syntactic errors in code.

Thanks for your interesting comparison. I think the claude AI got confused analyzing the models though. There is no Q3 level quant
in Q6_K_H by design, Q3 only exists in Q4_K_H. I do agree the Q6_K_H quant of this model seems quite good, close to if not the
best RL reasoner I have used to date. performance / size ratio is through the roof on this thing.

Throghar

Mar 19

•

edited Mar 19

Everywhere i see focus on kld and ppl but this doesnt consider thinking overhead and efficiency that directly translates to context bloat if model is not thinking how it should. Maybe thinking tokens length average vs unqantized version of the model should be standard for evaluation on how qunatization degrades model. Bottom line model thinking should be part of benchmark strategy evaluation for quants.

There is huge difference when two quants produce same result but one takes fraction of tokens on average to get there.

I cant stress enough how much more reliable your model feels when coding compared to other Q4-Q5 quants i tested.

steampunque

Owner Mar 19

Everywhere i see focus on kld and ppl but this doesnt consider thinking overhead and efficiency that directly translates to context bloat if model is not thinking how it should. Maybe thinking tokens length average vs unqantized version of the model should be standard for evaluation on how qunatization degrades model. Bottom line model thinking should be part of benchmark strategy evaluation for quants.

PPL is next to useless and even falls into misleading territory as any kind of ground truth quality metric (lower relative PPL on the same model may not give better performance). I mainly use it as a very rough regression metric to compare against to check if inference engine either broke or changed significantly compare to the time the quant was created. KLD is also just another relative comparison metric.

There is huge difference when two quants produce same result but one takes fraction of tokens on average to get there.

I hypothesized the GLM creators may have been rewarding the model for efficient solutions during instruct tune on one of their extremely strong RL releases. I believe this strategy should be used across the board for all RL model training. If the model gives two correct answers but one takes a large number more tokens, the shorter correct answer one should be jammed into the backprop gradients on subsequent training cycles to give it more weight. When I optimize the model layer quants I select for both correctness and avoidance of infinite repeat loops, given that the quant alone has some control over this, the model training itself should also be able to employ such a strategy very effectively.

I cant stress enough how much more reliable your model feels when coding compared to other Q4-Q5 quants i tested.

Thanks, appreciate your feedback and comments very much.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment