GLM-4.7-Flash Opus Reasoning (GGUF Q4_K_M)
Production-ready quantized model - 16.9 GB (69.7% compressed)
Model Description
This is a GGUF Q4_K_M quantized version of the fine-tuned GLM-4.7-Flash model, optimized for fast inference with llama.cpp.
Quantization Details
- Format: GGUF (GPT-Generated Unified Format)
- Quantization: Q4_K_M (4-bit K-means)
- Model Size: 16.9 GB (from 55.8 GB)
- Compression: 69.7% size reduction
- Precision: 4.84 BPW (bits per weight)
Usage with llama.cpp
Command Line
# Download llama.cpp and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
# Run inference
./build/bin/llama-cli -m glm-flash-2500-Q4_KM.gguf -p "Write a Python function to merge two sorted lists" -n 256 -t 8
Python Binding
from llama_cpp import Llama
model = Llama(
model_path="glm-flash-2500-Q4_KM.gguf",
n_ctx=8192,
n_threads=8,
)
output = model(
"Write a function to merge two sorted lists:",
max_tokens=256,
stop=["
"],
echo=True
)
print(output['choices'][0]['text'])
Interactive Mode
./build/bin/llama-cli -m glm-flash-2500-Q4_KM.gguf -cnv -i -t 8
Performance Metrics
- Prompt Speed: ~65 tokens/second
- Generation Speed: ~22 tokens/second
- Memory Usage: Efficient for production
- Latency: Low latency inference
Model Capabilities
This model excels at:
- Tool-use: Agent workflows and function calling
- Reasoning: Mathematical and logical problems
- Coding: Python, debugging, code explanation
- Problem Solving: Multi-step reasoning
Hardware Requirements
Minimum:
- RAM: 20 GB
- CPU: Modern multi-core processor
- Storage: 20 GB free space
Recommended:
- RAM: 32 GB
- CPU: 8+ cores
- GPU: Not required (CPU inference)
Base Model
This GGUF model was created from: https://huggingface.co/austindixson/glm-4.7-flash-Opus-Reasoning
Which was fine-tuned from: https://huggingface.co/unsloth/GLM-4.7-Flash
License
Apache 2.0
Quantized and optimized for production use
- Downloads last month
- 66
Hardware compatibility
Log In to add your hardware
Model tree for austindixson/glm-4.7-flash-Opus-Reasoning-Q4_KM
Base model
zai-org/GLM-4.7-Flash Finetuned
unsloth/GLM-4.7-Flash