llama.cpp version: b7802

Quantization command:

./build/bin/llama-quantize \
  --token-embedding-type f16 \
  --output-tensor-type f16 \
  --tensor-type ".*attn.*=F16" \
  --tensor-type ".*norm.*=F32" \
  --tensor-type ".*bias=F16" \
  --tensor-type ".*shexp.*=F16" \
  GLM-4.7-Flash-F16.gguf \
  GLM-4.7-Flash-MXFP4.gguf \
  MXFP4_MOE

i changed the moe experts (other than the shared expert) to mxfp4 and the dense layer to q8 and keep everything else the same as original for best quality

i can load only around 45k context on 3090 with this config tho

Downloads last month: 8

GGUF

Model size

30B params

Architecture

deepseek2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xrist0bg/GLM-4.7-Flash-MXFP4

Base model

zai-org/GLM-4.7-Flash

Quantized

(78)

this model