GLM-4.7-Flash Opus Reasoning (GGUF Q4_K_M)

Production-ready quantized model - 16.9 GB (69.7% compressed)

Model Description

This is a GGUF Q4_K_M quantized version of the fine-tuned GLM-4.7-Flash model, optimized for fast inference with llama.cpp.

Quantization Details

Format: GGUF (GPT-Generated Unified Format)
Quantization: Q4_K_M (4-bit K-means)
Model Size: 16.9 GB (from 55.8 GB)
Compression: 69.7% size reduction
Precision: 4.84 BPW (bits per weight)

Usage with llama.cpp

Command Line

# Download llama.cpp and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

# Run inference
./build/bin/llama-cli   -m glm-flash-2500-Q4_KM.gguf   -p "Write a Python function to merge two sorted lists"   -n 256   -t 8

Python Binding

from llama_cpp import Llama

model = Llama(
    model_path="glm-flash-2500-Q4_KM.gguf",
    n_ctx=8192,
    n_threads=8,
)

output = model(
    "Write a function to merge two sorted lists:",
    max_tokens=256,
    stop=["

"],
    echo=True
)
print(output['choices'][0]['text'])

Interactive Mode

./build/bin/llama-cli   -m glm-flash-2500-Q4_KM.gguf   -cnv   -i   -t 8

Performance Metrics

Prompt Speed: ~65 tokens/second
Generation Speed: ~22 tokens/second
Memory Usage: Efficient for production
Latency: Low latency inference

Model Capabilities

This model excels at:

Tool-use: Agent workflows and function calling
Reasoning: Mathematical and logical problems
Coding: Python, debugging, code explanation
Problem Solving: Multi-step reasoning

Hardware Requirements

Minimum:

RAM: 20 GB
CPU: Modern multi-core processor
Storage: 20 GB free space

Recommended:

RAM: 32 GB
CPU: 8+ cores
GPU: Not required (CPU inference)

Base Model

This GGUF model was created from: https://huggingface.co/austindixson/glm-4.7-flash-Opus-Reasoning

Which was fine-tuned from: https://huggingface.co/unsloth/GLM-4.7-Flash

License

Apache 2.0

Quantized and optimized for production use

Downloads last month: 66

GGUF

Model size

30B params

Architecture

deepseek2

Hardware compatibility

Model tree for austindixson/glm-4.7-flash-Opus-Reasoning-Q4_KM

Base model

zai-org/GLM-4.7-Flash

Finetuned

unsloth/GLM-4.7-Flash

Finetuned

austindixson/glm-4.7-flash-Opus-Reasoning

Quantized

(1)

this model