llama.cpp version: b7802
Quantization command:
./build/bin/llama-quantize \
--token-embedding-type f16 \
--output-tensor-type f16 \
--tensor-type ".*attn.*=F16" \
--tensor-type ".*norm.*=F32" \
--tensor-type ".*bias=F16" \
--tensor-type ".*shexp.*=F16" \
GLM-4.7-Flash-F16.gguf \
GLM-4.7-Flash-MXFP4.gguf \
MXFP4_MOE
i changed the moe experts (other than the shared expert) to mxfp4 and the dense layer to q8 and keep everything else the same as original for best quality
i can load only around 45k context on 3090 with this config tho
- Downloads last month
- 8
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for xrist0bg/GLM-4.7-Flash-MXFP4
Base model
zai-org/GLM-4.7-Flash