Qwen3-Next-REAP-48B

#3
by KnutJaegersberg - opened

Can you do this for the regular Qwen-Next model as well? are there benchmarks on which performs better, say a 6bit Next-REAP-48B or a 4 bit Next-80b?

@KnutJaegersberg
I haven't run any benchmarks apart from some basic checks. The 4-bit quantized model from main stream has q4 in attention layer which might lead to some issue when context length is ultra long.

Here is a piece of quantization log (Q4_K_XL) for 60B-A3B model:
[ 810/ 843] blk.46.attn_gate.weight - [ 2048, 4096, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 16.000 MiB
[ 811/ 843] blk.46.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 812/ 843] blk.46.attn_qkv.weight - [ 2048, 8192, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 32.000 MiB
[ 813/ 843] blk.46.ffn_down_exps.weight - [ 512, 2048, 384, 1], type = bf16, (manual override: q4_K -> bf16) size = 768.000 MiB
[ 814/ 843] blk.46.ffn_down_shexp.weight - [ 512, 2048, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 815/ 843] blk.46.ffn_gate_exps.weight - [ 2048, 512, 384, 1], type = bf16, converting to q4_K .. size = 768.00 MiB -> 216.00 MiB
[ 816/ 843] blk.46.ffn_gate_inp.weight - [ 2048, 384, 1, 1], type = f32, size = 3.000 MiB
[ 817/ 843] blk.46.ffn_gate_inp_shexp.weight - [ 2048, 1, 1, 1], type = bf16, size = 0.004 MiB
[ 818/ 843] blk.46.ffn_gate_shexp.weight - [ 2048, 512, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 819/ 843] blk.46.ffn_up_exps.weight - [ 2048, 512, 384, 1], type = bf16, converting to q4_K .. size = 768.00 MiB -> 216.00 MiB
[ 820/ 843] blk.46.ffn_up_shexp.weight - [ 2048, 512, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 821/ 843] blk.46.post_attention_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 822/ 843] blk.46.ssm_a - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB
[ 823/ 843] blk.46.ssm_ba.weight - [ 2048, 64, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 0.250 MiB
[ 824/ 843] blk.46.ssm_conv1d.weight - [ 4, 8192, 1, 1], type = f32, size = 0.125 MiB
[ 825/ 843] blk.46.ssm_dt.bias - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB
[ 826/ 843] blk.46.ssm_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MiB
[ 827/ 843] blk.46.ssm_out.weight - [ 4096, 2048, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 16.000 MiB
[ 828/ 843] blk.47.attn_k.weight - [ 2048, 512, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 829/ 843] blk.47.attn_k_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB
[ 830/ 843] blk.47.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 831/ 843] blk.47.attn_output.weight - [ 4096, 2048, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 16.000 MiB
[ 832/ 843] blk.47.attn_q.weight - [ 2048, 8192, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 32.000 MiB
[ 833/ 843] blk.47.attn_q_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB
[ 834/ 843] blk.47.attn_v.weight - [ 2048, 512, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 835/ 843] blk.47.ffn_down_exps.weight - [ 512, 2048, 384, 1], type = bf16, (manual override: q4_K -> bf16) size = 768.000 MiB
[ 836/ 843] blk.47.ffn_down_shexp.weight - [ 512, 2048, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 837/ 843] blk.47.ffn_gate_exps.weight - [ 2048, 512, 384, 1], type = bf16, converting to q4_K .. size = 768.00 MiB -> 216.00 MiB
[ 838/ 843] blk.47.ffn_gate_inp.weight - [ 2048, 384, 1, 1], type = f32, size = 3.000 MiB
[ 839/ 843] blk.47.ffn_gate_inp_shexp.weight - [ 2048, 1, 1, 1], type = bf16, size = 0.004 MiB
[ 840/ 843] blk.47.ffn_gate_shexp.weight - [ 2048, 512, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 841/ 843] blk.47.ffn_up_exps.weight - [ 2048, 512, 384, 1], type = bf16, converting to q4_K .. size = 768.00 MiB -> 216.00 MiB
[ 842/ 843] blk.47.ffn_up_shexp.weight - [ 2048, 512, 1, 1], type = bf16, (manual override: q4_K -> bf16) size = 2.000 MiB
[ 843/ 843] blk.47.post_attention_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB

The red-marked show the original size. You can imagine that the q4 hybrid attention doesn't save much memory. If I remember it correctly, you would get 2-4 GB VRam more using q4 attention in total. In fact, attention is really important when the context length is high enough. Personally, I prefer to keep attention in the original precision. All the XL variants in this repo keep the embedding, attention, shared experts and output layers in the original precision to "buy" some safeguard.

Sign up or log in to comment