How to use from
llama.cppInstall from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf VECTORVV1/Qwen3-30B-A3B:Q8_0# Run inference directly in the terminal:
llama-cli -hf VECTORVV1/Qwen3-30B-A3B:Q8_0Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf VECTORVV1/Qwen3-30B-A3B:Q8_0# Run inference directly in the terminal:
./llama-cli -hf VECTORVV1/Qwen3-30B-A3B:Q8_0Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf VECTORVV1/Qwen3-30B-A3B:Q8_0# Run inference directly in the terminal:
./build/bin/llama-cli -hf VECTORVV1/Qwen3-30B-A3B:Q8_0Use Docker
docker model run hf.co/VECTORVV1/Qwen3-30B-A3B:Q8_0Quick Links
GLM-4.7-Flash-Uncensored-HauhauCS-Aggressive
Join the Discord for updates, roadmaps, projects, or just to chat.
GLM-4.7 Flash uncensored by HauhauCS.
About
No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended - just without the refusals.
These are meant to be the best lossless uncensored models out there.
Aggressive vs Balanced
The Aggressive variant removes more refusal behavior. Use this if the Balanced variant still refuses too much.
For agentic coding or tasks requiring higher reliability, use the Balanced variant instead.
Downloads
| File | Quant | Size |
|---|---|---|
| GLM-4.7-Flash-Uncensored-HauhauCS-Aggressive-FP16.gguf | FP16 | 56 GB |
| GLM-4.7-Flash-Uncensored-HauhauCS-Aggressive-Q8_0.gguf | Q8_0 | 30 GB |
| GLM-4.7-Flash-Uncensored-HauhauCS-Aggressive-Q6_K.gguf | Q6_K | 23 GB |
| GLM-4.7-Flash-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf | Q4_K_M | 17 GB |
Specs
- 30B-A3B MoE (31B total, ~3B active per forward pass)
- 202K context
- Based on zai-org/GLM-4.7-Flash
Recommended Settings
From the official Z.ai authors:
General use:
--temp 1.0 --top-p 0.95
Tool-calling / agentic:
--temp 0.7 --top-p 1.0
Important:
- Disable repeat penalty (or
--repeat-penalty 1.0) - For llama.cpp: use
--min-p 0.01(default 0.05 is too high) - Use
--jinjaflag for llama.cpp
Note: Not recommended for Ollama due to chat template issues. Works well with llama.cpp, LM Studio, Jan.
Usage
Works with llama.cpp, LM Studio, Jan, koboldcpp, etc.
- Downloads last month
- 60
Hardware compatibility
Log In to add your hardware
8-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf VECTORVV1/Qwen3-30B-A3B:Q8_0# Run inference directly in the terminal: llama-cli -hf VECTORVV1/Qwen3-30B-A3B:Q8_0