GLM-5.1 TQ3 (3-bit weight compression)

Native TQ3 checkpoint of zai-org/GLM-5 (769B MoE, 40B active).

Compression

BF16 TQ3
Checkpoint size ~1,510 GB 309 GB
Compression ratio 1x 4.9x

Created using turboquant-plus-vllm streaming checkpoint creation on a $0.11/hr CPU instance. Total cost: $0.84.

Status

Not yet tested on GPU. This checkpoint was created and uploaded automatically. Quality validation on a multi-GPU setup is pending.

The same code path was validated on GLM-4.7-Flash (355B, same MoE architecture with 64 experts) where it loaded successfully and scored correctly on all test prompts with 13.3 GB GPU memory.

Architecture

GLM-5.1 uses the Glm4MoeLiteNaiveMoe architecture:

  • 769B total parameters, 40B active per token
  • 256 routed experts, 8 active per token, 1 shared expert
  • 78 layers, hidden_size=6144
  • Multi-head Latent Attention (MLA)
  • First 3 layers are dense (not MoE)
  • 200K context window

How it works

The WHT rotation + Gaussian Lloyd-Max codebook from TurboQuant (ICLR 2026). After a random Walsh-Hadamard rotation, weight distributions become near-Gaussian, making them efficiently quantizable with 8 centroids (3-bit) per 128-element group. Zero calibration data needed.

The checkpoint stores packed 3-bit indices + per-group norms. The loader handles:

  • Per-expert 2D → fused 3D regrouping (gate_proj + up_proj → gate_up_proj fusion)
  • Router/gate weight decompression in-place
  • Meta-device model creation for low-memory loading

Usage

pip install turboquant-plus-vllm@git+https://github.com/varjoranta/turboquant-vllm.git
from turboquant_vllm import load_tq3_model

model, tokenizer = load_tq3_model("varjosoft/GLM-5.1-Open-TQ3", device="cuda")
# Requires multi-GPU setup — see requirements below

GPU requirements for inference

Setup Total VRAM Per-GPU Cost/hr (Verda)
8× A100 80GB 640 GB 45 GB $10.32
4× H200 141GB 564 GB 90 GB $13.56
2× B300 262GB 524 GB 180 GB $13.98

Without TQ3, the BF16 model requires 1,510 GB VRAM (minimum 8× B300 at $55.92/hr).

Software requirements

  • transformers >= 5.5.0
  • turboquant-plus-vllm (GitHub)
  • PyTorch with CUDA

Comparison with other quantizations

Method Size Calibration Format Target
This (TQ3) 309 GB (4.9x) None Safetensors GPU serving (vLLM/PyTorch)
Unsloth Dynamic 2-bit 236 GB (6.4x) 300K+ tokens GGUF Local/CPU (llama.cpp)
BF16 original 1,510 GB N/A Safetensors 8× B300+

License

MIT (same as base model). Created by Varjosoft Oy.

Downloads last month
208
Safetensors
Model size
289B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nwzjk/GLM-5.1-Open-TQ3

Base model

zai-org/GLM-5
Finetuned
(36)
this model

Paper for nwzjk/GLM-5.1-Open-TQ3