How is this different from the other quants?
I really want to get into testing the newest models with vLLM-MLX or MLX-OpenAI-Server and play with agents, but what is qx64 vs "regular" Q4? https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit https://huggingface.co/McG-221/Qwen3.5-35B-A3B-mxfp8
In the qx quants I enhance some extra layers, additional to the differential quanting already available
Reduced to its essentials, this is how Deckard(qx) operates. This is the formula I apply to Qwens, other architectures need different tuning. This one is ported directly from the Qwen3 line, so it's possible that after I do a bit more research, I might refine it for the 3.5, since a lot of expert layers could be enhanced and are generally small
if "v_proj" in path or "v_a_proj" in path or "v_b_proj" in path:
layer_bits = high_bits
group_size = high_group_size
if use_all_bits:
layer_bits = head_bits
# Only enhance down_proj for use_more_bits layers
if "down_proj" in path and use_more_bits:
layer_bits = high_bits
if "model.embed_tokens" in path:
layer_bits = high_bits
if uses_brainstorming:
group_size = 32
if "lm_head" in path:
layer_bits = head_bits
if uses_brainstorming:
group_size = 32
How it is different--some metrics
Highest arc usually, and generally the highest performance for its size, outperforming q8(mxfp8 is a fair approximation)
https://huggingface.co/nightmedia/Qwen3-30B-A3B-Element16-qx86-hi-mlx
Qwen3-30B-A3B-Element16
mxfp8 0.561,0.705,0.885,0.739,0.452,0.794,0.702
qx86-hi 0.562,0.751,0.882,0.752,0.468,0.807,0.695
qx64-hi 0.574,0.753,0.878,0.748,0.464,0.805,0.688
mxfp4 0.544,0.699,0.875,0.741,0.438,0.800,0.671
https://huggingface.co/nightmedia/Qwen3-30B-A3B-Element16b-qx86-hi-mlx
Qwen3-30B-A3B-Element16b
mxfp8 0.561,0.710,0.880,0.739,0.452,0.797,0.696
qx86-hi 0.569,0.742,0.882,0.751,0.466,0.803,0.699
qx64-hi 0.574,0.759,0.872,0.750,0.468,0.799,0.708
mxfp4 0.538,0.701,0.879,0.741,0.442,0.803,0.671
I will have an updated version of this quant soon, that should improve performance a bit
The quants were updated in place, and the updated formula is more stable now
It brings out some extra intelligence, Nightmedia's models are better. People who ask AI for relationship advice probably won't notice the differience.
But for science, analysis, new problems....it does help.
Not with this model, no :)
Definitely no new model can do RP in their standard configuration, since the bulletpoint output on everything sort of kills the buzz.
We will see the old Qwens for a while now. Even Coder-Next has better "vibes", in fact a bit too much, and I was waiting for someone to do a training on it, so I can do a merge; for that mergekit should support the Next architecture, which is a different thing altogether.
I noticed with pretty much all merged models a vast improvement in social skills. It's not just the 100+ points in arc that we usually get, it's the availability of alternate chains of thought from the models in the merge. This increases the perceived entropy from stock to a level where some models claimed they "broke the fourth wall" with it. To each his own, whatever that wall means
Here is the current Deckard(qx) that I use for Qwen and Gemmas. The formulas for Next, GLM, Ring/Ling, etc.. are different, enhancing the platform-specific layers that need it. As you see, it's a work in progress, and nothing proprietary about it :)
def mixed_quant_predicate_builder(
recipe: str, model: nn.Module, group_size: int = 64
) -> Callable[[str, nn.Module, dict], Union[bool, dict]]:
mode = "affine"
high_bits = 6
very_high_bits = 8
low_group_size = group_size
high_group_size = group_size
is_highres = 0
is_test = 0
if "hi" in recipe:
low_group_size = 32
high_group_size = 32
is_highres = 1
# these are already there
if recipe == "mixed_2_6":
low_bits = 2
elif recipe == "mixed_3_4":
low_bits = 3
high_bits = 4
elif recipe == "mixed_3_6":
low_bits = 3
elif recipe == "mixed_4_6":
low_bits = 4
# these are new and mapped to the quant
elif "ring_moe_4" in recipe:
low_bits = 4
high_bits = 6
elif "deckard_3" in recipe:
low_bits = 3
high_bits = 6
elif "deckard_4" in recipe:
low_bits = 4
high_bits = 6
elif "deckard_5" in recipe:
low_bits = 5
high_bits = 8
elif "deckard_6" in recipe:
low_bits = 6
high_bits = 8
else:
raise ValueError(f"Invalid quant recipe {recipe}")
...
def mixed_quant_predicate(
path: str,
module: nn.Module,
) -> Union[bool, dict]:
"""Implements mixed quantization predicates with similar choices to, for example, llama.cpp's Q4_K_M.
Ref: https://github.com/ggerganov/llama.cpp/blob/917786f43d0f29b7c77a0c56767c0fa4df68b1c5/src/llama.cpp#L5265
By Alex Barron: https://gist.github.com/barronalex/84addb8078be21969f1690c1454855f3
"""
index = (
int(path.split(".")[layer_location])
if len(path.split(".")) > layer_location
else 0
)
use_more_bits = (
index < num_layers // 8
or index >= 7 * num_layers // 8
or (index - num_layers // 8) % 3 == 2
)
is_highres = 0
low_group_size = 64
high_group_size = 32
group_size = low_group_size
if "hi" in recipe:
is_highres = 1
group_size = high_group_size
if "deckard" in recipe:
low_group_size = 64
high_group_size = 64
head_bits = high_bits
layer_bits = low_bits
is_highres = "hi" in recipe
uses_brainstorming = "bx" in recipe or "by" in recipe or "bz" in recipe or "bh" in recipe
is_brainstorming = uses_brainstorming and (
"bx" in recipe and num_layers - index < 22
or "bh" in recipe and num_layers - index < 21
or "bz" in recipe and num_layers - index < 37
or "by" in recipe and num_layers - index < 42 );
use_all_bits = is_highres and index < num_layers // 8;
#if low_bits > 4:
# head_bits = 8
if is_highres:
low_group_size = 32
high_group_size = 32
group_size = low_group_size
if is_brainstorming:
layer_bits = high_bits
group_size = 32
use_all_bits = 0
# Support Qwen3.5 MoE
is_attention = ( "linear_attn.conv1d" in path
or "linear_attn.in_proj_a" in path
or "linear_attn.in_proj_b" in path )
is_mlp = ( "mlp.gate" in path
or "mlp.shared_expert_gate" in path )
enhance_attention = 1
enhance_mlp = 1
if ( use_more_bits and is_mlp
or ( not use_more_bits
and ( is_attention and enhance_attention
or is_mlp and enhance_mlp
or "mtp.fc" in path ))):
layer_bits = high_bits
if is_highres:
group_size = high_group_size
# Previous models
if "v_proj" in path or "v_a_proj" in path or "v_b_proj" in path:
layer_bits = high_bits
group_size = high_group_size
if use_all_bits:
layer_bits = head_bits
# Only enhance down_proj for use_more_bits layers
if "down_proj" in path and use_more_bits:
layer_bits = high_bits
if "model.embed_tokens" in path:
layer_bits = high_bits
if uses_brainstorming:
group_size = 32
if "lm_head" in path:
layer_bits = head_bits
if uses_brainstorming:
group_size = 32
if "layers.0" in path and "x" in recipe:
layer_bits = high_bits
group_size = high_group_size
if uses_brainstorming:
group_size = 32
return {"group_size": group_size, "bits": layer_bits, "mode": mode}
....
QUANT_RECIPES = [ "mixed_2_6", "mixed_3_4", "mixed_3_6", "mixed_4_6", "deckard_3", "deckard_4", "deckard_5", "deckard_6", "deckard_3xs", "deckard_3s", "deckard_3_hi", "deckard_4_hi", "deckard_5_hi", "deckard_6_hi" ]