GLM-4.7-Flash Q8_0 GGUF
Q8_0 quantization of zai-org/GLM-4.7-Flash for use with llama.cpp.
Model Details
| Property | Value |
|---|---|
| Base model | zai-org/GLM-4.7-Flash |
| Architecture | 30B-A3B MoE (DeepSeek v2) |
| Quantization | Q8_0 |
| Size | ~30 GB |
| Context length | 128K tokens |
Hardware Requirements
- Minimum VRAM: 32 GB (single GPU)
- Recommended: 56 GB (dual GPU, e.g., RTX 5090 + RTX 4090)
Usage
Basic usage with llama.cpp
llama-server -m GLM-4.7-Flash-Q8_0.gguf -ngl 99 -c 65536
Full 128K context on dual GPU
llama-server -m GLM-4.7-Flash-Q8_0.gguf \
-ngl 99 \
-c 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--split-mode layer \
--tensor-split 32,24 \
--host 0.0.0.0 \
--port 8080
OpenAI-compatible API
After starting the server, the API is available at:
http://localhost:8080/v1/chat/completionshttp://localhost:8080/v1/completions
Quantization Notes
This model was quantized from ngxson/GLM-4.7-Flash-GGUF F16 version.
The missing deepseek2.rope.scaling.yarn_log_multiplier metadata key was added to enable quantization with llama.cpp.
Quality Comparison
| Quantization | Size | Perplexity Impact |
|---|---|---|
| F16 | 56 GB | Baseline |
| Q8_0 | 30 GB | ~0.1% |
| Q4_K_M | 18 GB | ~2-4% |
Q8_0 provides near-lossless quality while reducing model size by ~47%.
- Downloads last month
- 49
Hardware compatibility
Log In to add your hardware
8-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for khasinski/GLM-4.7-Flash-Q8_0-GGUF
Base model
zai-org/GLM-4.7-Flash