How is this different from the other quants?

by TomLucidor - opened Feb 25

Feb 25

I really want to get into testing the newest models with vLLM-MLX or MLX-OpenAI-Server and play with agents, but what is qx64 vs "regular" Q4? https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit https://huggingface.co/McG-221/Qwen3.5-35B-A3B-mxfp8

nightmedia

Owner Feb 25

•

edited Feb 25

In the qx quants I enhance some extra layers, additional to the differential quanting already available

Reduced to its essentials, this is how Deckard(qx) operates. This is the formula I apply to Qwens, other architectures need different tuning. This one is ported directly from the Qwen3 line, so it's possible that after I do a bit more research, I might refine it for the 3.5, since a lot of expert layers could be enhanced and are generally small

            if "v_proj" in path or "v_a_proj" in path or "v_b_proj" in path:
                layer_bits = high_bits
                group_size = high_group_size
                if use_all_bits:
                    layer_bits = head_bits

            # Only enhance down_proj for use_more_bits layers
            if "down_proj" in path and use_more_bits:
                layer_bits = high_bits

            if "model.embed_tokens" in path:
                layer_bits = high_bits
                if uses_brainstorming:
                    group_size = 32

            if "lm_head" in path:
                layer_bits = head_bits
                if uses_brainstorming:
                    group_size = 32

nightmedia

Owner Feb 25

How it is different--some metrics

Highest arc usually, and generally the highest performance for its size, outperforming q8(mxfp8 is a fair approximation)

https://huggingface.co/nightmedia/Qwen3-30B-A3B-Element16-qx86-hi-mlx

Qwen3-30B-A3B-Element16
mxfp8    0.561,0.705,0.885,0.739,0.452,0.794,0.702
qx86-hi  0.562,0.751,0.882,0.752,0.468,0.807,0.695
qx64-hi  0.574,0.753,0.878,0.748,0.464,0.805,0.688
mxfp4    0.544,0.699,0.875,0.741,0.438,0.800,0.671

https://huggingface.co/nightmedia/Qwen3-30B-A3B-Element16b-qx86-hi-mlx

Qwen3-30B-A3B-Element16b
mxfp8    0.561,0.710,0.880,0.739,0.452,0.797,0.696
qx86-hi  0.569,0.742,0.882,0.751,0.466,0.803,0.699
qx64-hi  0.574,0.759,0.872,0.750,0.468,0.799,0.708
mxfp4    0.538,0.701,0.879,0.741,0.442,0.803,0.671

nightmedia

Owner Feb 26

I will have an updated version of this quant soon, that should improve performance a bit

nightmedia

Owner Feb 26

The quants were updated in place, and the updated formula is more stable now

bobig

Feb 26

•

edited Feb 26

It brings out some extra intelligence, Nightmedia's models are better. People who ask AI for relationship advice probably won't notice the differience.

But for science, analysis, new problems....it does help.

TomLucidor

Feb 26

@bobig do people really RP with this model? What about Kimi-Linear or Nemotron? (like REALLY)

nightmedia

Owner Feb 26

Not with this model, no :)

Definitely no new model can do RP in their standard configuration, since the bulletpoint output on everything sort of kills the buzz.

We will see the old Qwens for a while now. Even Coder-Next has better "vibes", in fact a bit too much, and I was waiting for someone to do a training on it, so I can do a merge; for that mergekit should support the Next architecture, which is a different thing altogether.

I noticed with pretty much all merged models a vast improvement in social skills. It's not just the 100+ points in arc that we usually get, it's the availability of alternate chains of thought from the models in the merge. This increases the perceived entropy from stock to a level where some models claimed they "broke the fourth wall" with it. To each his own, whatever that wall means

nightmedia

Owner Feb 26

•

edited Feb 26

Here is the current Deckard(qx) that I use for Qwen and Gemmas. The formulas for Next, GLM, Ring/Ling, etc.. are different, enhancing the platform-specific layers that need it. As you see, it's a work in progress, and nothing proprietary about it :)

def mixed_quant_predicate_builder(
    recipe: str, model: nn.Module, group_size: int = 64
) -> Callable[[str, nn.Module, dict], Union[bool, dict]]:
    mode = "affine"
    high_bits = 6

    very_high_bits = 8
    low_group_size = group_size
    high_group_size = group_size
    is_highres = 0
    is_test = 0
    if "hi" in recipe:
       low_group_size = 32
       high_group_size = 32
       is_highres = 1
   # these are already there
    if recipe == "mixed_2_6":
        low_bits = 2
    elif recipe == "mixed_3_4":
        low_bits = 3
        high_bits = 4
    elif recipe == "mixed_3_6":
        low_bits = 3
    elif recipe == "mixed_4_6":
        low_bits = 4
   # these are new and mapped to the quant
    elif "ring_moe_4" in recipe:
        low_bits = 4
        high_bits = 6
    elif "deckard_3" in recipe:
        low_bits = 3
        high_bits = 6
    elif "deckard_4" in recipe:
        low_bits = 4
        high_bits = 6
    elif "deckard_5" in recipe:
        low_bits = 5
        high_bits = 8
    elif "deckard_6" in recipe:
        low_bits = 6
        high_bits = 8
    else:
        raise ValueError(f"Invalid quant recipe {recipe}")
...

    def mixed_quant_predicate(
        path: str,
        module: nn.Module,
    ) -> Union[bool, dict]:
        """Implements mixed quantization predicates with similar choices to, for example, llama.cpp's Q4_K_M.
        Ref: https://github.com/ggerganov/llama.cpp/blob/917786f43d0f29b7c77a0c56767c0fa4df68b1c5/src/llama.cpp#L5265
        By Alex Barron: https://gist.github.com/barronalex/84addb8078be21969f1690c1454855f3
        """
        index = (
            int(path.split(".")[layer_location])
            if len(path.split(".")) > layer_location
            else 0
        )
        use_more_bits = (
            index < num_layers // 8
            or index >= 7 * num_layers // 8
            or (index - num_layers // 8) % 3 == 2
        )
        is_highres = 0
        low_group_size = 64
        high_group_size = 32
        group_size = low_group_size
        if "hi" in recipe:
            is_highres = 1
            group_size = high_group_size

        if "deckard" in recipe:
            low_group_size = 64
            high_group_size = 64
            head_bits = high_bits
            layer_bits = low_bits
            is_highres = "hi" in recipe
            uses_brainstorming = "bx" in recipe or "by" in recipe or "bz" in recipe or "bh" in recipe
            is_brainstorming = uses_brainstorming and (
                "bx" in recipe and num_layers - index < 22
                or "bh" in recipe and num_layers - index < 21
                or "bz" in recipe and num_layers - index < 37
                or "by" in recipe and num_layers - index < 42 );
            use_all_bits = is_highres and index < num_layers // 8;
            #if low_bits > 4:
            #    head_bits = 8
            if is_highres:
                low_group_size = 32
                high_group_size = 32

            group_size = low_group_size

            if is_brainstorming:
                layer_bits = high_bits
                group_size = 32
                use_all_bits = 0

            # Support Qwen3.5 MoE
            is_attention = ( "linear_attn.conv1d" in path
                or "linear_attn.in_proj_a" in path
                or "linear_attn.in_proj_b" in path )

            is_mlp = ( "mlp.gate" in path
                or "mlp.shared_expert_gate" in path )
            enhance_attention = 1
            enhance_mlp = 1

            if ( use_more_bits and is_mlp
                or ( not use_more_bits
                    and ( is_attention and enhance_attention
                        or is_mlp and enhance_mlp
                        or "mtp.fc" in path ))):
               layer_bits = high_bits
               if is_highres:
                  group_size = high_group_size

            # Previous models
            if "v_proj" in path or "v_a_proj" in path or "v_b_proj" in path:
                layer_bits = high_bits
                group_size = high_group_size
                if use_all_bits:
                    layer_bits = head_bits

            # Only enhance down_proj for use_more_bits layers
            if "down_proj" in path and use_more_bits:
                layer_bits = high_bits

            if "model.embed_tokens" in path:
                layer_bits = high_bits
                if uses_brainstorming:
                    group_size = 32

            if "lm_head" in path:
                layer_bits = head_bits
                if uses_brainstorming:
                    group_size = 32

            if "layers.0" in path and "x" in recipe:
                layer_bits = high_bits
                group_size = high_group_size
                if uses_brainstorming:
                    group_size = 32

            return {"group_size": group_size, "bits": layer_bits, "mode": mode}


....

QUANT_RECIPES = [ "mixed_2_6", "mixed_3_4", "mixed_3_6", "mixed_4_6", "deckard_3", "deckard_4", "deckard_5", "deckard_6", "deckard_3xs", "deckard_3s", "deckard_3_hi", "deckard_4_hi", "deckard_5_hi", "deckard_6_hi" ]

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment