I spun this fellow up, showed it it's config and asked for any suggestions, here is the reply

#1
by HankFromResearch - opened

That is fantastic news! 🎉 Hearing that the model is passing with flying colors is the best feedback I could ask for. It sounds like that quantization strategy is hitting the sweet spot between precision and efficiency for your specific use case.

Since you're open to optimization, here are a few targeted suggestions that might squeeze out even more performance, depending on your hardware constraints and specific goals:

1. Quantize rms_norm to Q8_0

Qwen models rely heavily on RMSNorm (Root Mean Square Layer Normalization) for stability. In many high-performance quantizations, these normalization weights are often left in Q4, but moving them to Q8_0 can reduce numerical drift during long reasoning chains.

  • Why: Small errors in normalization can compound over hundreds of layers, leading to slight degradation in logic consistency.
  • Change:
    --tensor-type rms_norm=q8_0
    
  • Cost: Negligible VRAM increase (these tensors are small).

2. Refine the Imatrix for Reasoning

You are currently using the Unsloth imatrix, which is excellent for general chat and coding. However, since you are specifically targeting logical reasoning (like the lineage-bench):

  • Why: A general imatrix optimizes for general perplexity. A custom imatrix generated from a dataset rich in logic puzzles, math, or chain-of-thought data might prioritize the weights responsible for reasoning over conversational fluency.
  • Suggestion: If you have the compute, consider generating a custom imatrix using the lineage-bench dataset itself or a similar reasoning corpus before quantizing.

3. K/V Cache Trade-off (Q8_0 vs BF16)

You currently have attn_k and attn_v in BF16. This is great for accuracy but eats up VRAM quickly, especially with long contexts.

  • Why: If you find you are running out of memory on very long contexts (e.g., 128k+), switching K/V to Q8_0 instead of BF16 usually retains >95% of the performance with half the memory footprint.
  • Suggestion: Test Q8_0 for K/V if you hit memory walls. If you have plenty of VRAM, keeping them at BF16 is fine for maximum precision.

4. Inference Parameters

Sometimes the "performance" isn't just about the weights, but how you sample the output. For reasoning tasks:

  • Temperature: Keep it low (0.1 - 0.3) for deterministic logic.
  • Top-P / Min-P: min_p (e.g., 0.05) is often better than top_p for reasoning models as it prevents the model from "hallucinating" rare tokens that might break a logical chain.
  • Repeat Penalty: Slightly higher (1.1 - 1.2) can help prevent loops in long reasoning traces.

5. Verification on SSM Layers

One small technical note: Standard Qwen3.5-27B is a Transformer architecture. The script includes ssm_ba, ssm_beta, and ssm_alpha (State Space Model layers).

  • Observation: If this is a standard Qwen3.5-27B, those specific tensors might not exist in the model, or they might be part of a specific hybrid variant you are using.
  • Suggestion: If those layers aren't actually present in the base weights, the flags will just be ignored harmlessly. If they are present (e.g., in a specific fine-tune), keeping them in BF16 is the right call for stability. Just ensure the quantization script matches the actual model architecture to avoid confusion.

Summary

If I had to pick the single highest-impact change for reasoning without hurting VRAM much, I'd go with Option 1 (rms_norm to Q8_0). It's a "free" upgrade for stability.

How much VRAM headroom do you have left during these tests? That usually dictates whether we can push for higher precision (Q5_K_M) or if we need to stay lean (IQ4_NL). 🛠️🚀

the llama.cpp settings:

--temp 0.6
--top-p 0.95
--min-p 0.01
--top-k 40
-c 65536
--fit on
-fa on
-t 20
--no-mmap
--jinja
-ctk q8_0
-ctv q8_0

I don’t see an RMS Norm tensor. Not a bad call on the imatrix; maybe I’ll get around to making a custom calibration dataset some day. Regarding KV cache: I do run v-cache in Q8, but for long horizon tasks k-cache in Q8 I’m still testing. Without something like ik_lcpp’s Hadamard transform I’m hesitant to quantize k-cache at all right now.

I think your mods may have busted out some of the Qwen teams secret sauce, either that or its hallucinating hard.

Nevertheless, this paper does exist:
https://arxiv.org/abs/1910.07467

and the code
https://github.com/bzhangGo/rmsnorm

as well as some critique from NeurIPS
https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Reviews.html

What I do know is I ran it with your mods, low temp, min-p as it suggested, with a repeat penalty at 1.2 and it nailed an advanced physics rubric, then again a broken clock is right twice a day.

dinerburger changed discussion status to closed

Sign up or log in to comment