llama.cpp version: b7802

Quantization command:

./build/bin/llama-quantize \
  --token-embedding-type f16 \
  --output-tensor-type f16 \
  --tensor-type ".*attn.*=F16" \
  --tensor-type ".*norm.*=F32" \
  --tensor-type ".*bias=F16" \
  --tensor-type ".*shexp.*=F16" \
  GLM-4.7-Flash-F16.gguf \
  GLM-4.7-Flash-MXFP4.gguf \
  MXFP4_MOE

i changed the moe experts (other than the shared expert) to mxfp4 and the dense layer to q8 and keep everything else the same as original for best quality

i can load only around 45k context on 3090 with this config tho

Downloads last month
8
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xrist0bg/GLM-4.7-Flash-MXFP4

Quantized
(78)
this model