GLM-4.7-Flash Opus Reasoning (GGUF Q4_K_M)

Production-ready quantized model - 16.9 GB (69.7% compressed)

Model Description

This is a GGUF Q4_K_M quantized version of the fine-tuned GLM-4.7-Flash model, optimized for fast inference with llama.cpp.

Quantization Details

  • Format: GGUF (GPT-Generated Unified Format)
  • Quantization: Q4_K_M (4-bit K-means)
  • Model Size: 16.9 GB (from 55.8 GB)
  • Compression: 69.7% size reduction
  • Precision: 4.84 BPW (bits per weight)

Usage with llama.cpp

Command Line

# Download llama.cpp and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

# Run inference
./build/bin/llama-cli   -m glm-flash-2500-Q4_KM.gguf   -p "Write a Python function to merge two sorted lists"   -n 256   -t 8

Python Binding

from llama_cpp import Llama

model = Llama(
    model_path="glm-flash-2500-Q4_KM.gguf",
    n_ctx=8192,
    n_threads=8,
)

output = model(
    "Write a function to merge two sorted lists:",
    max_tokens=256,
    stop=["

"],
    echo=True
)
print(output['choices'][0]['text'])

Interactive Mode

./build/bin/llama-cli   -m glm-flash-2500-Q4_KM.gguf   -cnv   -i   -t 8

Performance Metrics

  • Prompt Speed: ~65 tokens/second
  • Generation Speed: ~22 tokens/second
  • Memory Usage: Efficient for production
  • Latency: Low latency inference

Model Capabilities

This model excels at:

  • Tool-use: Agent workflows and function calling
  • Reasoning: Mathematical and logical problems
  • Coding: Python, debugging, code explanation
  • Problem Solving: Multi-step reasoning

Hardware Requirements

Minimum:

  • RAM: 20 GB
  • CPU: Modern multi-core processor
  • Storage: 20 GB free space

Recommended:

  • RAM: 32 GB
  • CPU: 8+ cores
  • GPU: Not required (CPU inference)

Base Model

This GGUF model was created from: https://huggingface.co/austindixson/glm-4.7-flash-Opus-Reasoning

Which was fine-tuned from: https://huggingface.co/unsloth/GLM-4.7-Flash

License

Apache 2.0


Quantized and optimized for production use

Downloads last month
66
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for austindixson/glm-4.7-flash-Opus-Reasoning-Q4_KM

Quantized
(1)
this model