LLaDA-8B-Quantized
World's first INT8 and INT4 weight-only quantization for LLaDA โ a masked diffusion large language model trained from scratch at 8B scale.
Code & full documentation: github.com/qubitronlabsdev/llada-quantization
Model Description
LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens in parallel via iterative masked denoising โ unlike autoregressive models (GPT, LLaMA) that generate one token at a time.
This repository provides two post-training quantized variants of GSAI-ML/LLaDA-8B-Instruct:
| File | Quantization | Size | Memory Saved | Speed (A100) |
|---|---|---|---|---|
llada_int8_quantized.pt |
INT8 per-row | 8.54 GB | 47% | 9.64 tok/s |
llada_int4_quantized.pt |
INT4 packed | 4.79 GB | 70% | 3.39 tok/s |
Original model (bfloat16): 16.13 GB
How It Works
All nn.Linear layers are replaced with custom quantized layers:
- INT8 โ weights scaled per-row to
[-127, 127]integers. Scale factors stored in float32. ~1 byte per weight. - INT4 โ weights scaled per-row to
[-8, 7]integers. Two values packed per byte (uint8). ~0.5 bytes per weight.
Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.
Usage
Installation
git clone https://github.com/qubitronlabsdev/llada-quantization
cd llada-quantization
pip install -r requirements.txt
Load and Generate
from inference import load_quantized, generate
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"GSAI-ML/LLaDA-8B-Instruct",
trust_remote_code=True
)
# Download weights from this repo first, then:
# INT8
model = load_quantized(
"llada_int8_quantized.pt",
mode="int8",
device="cuda"
)
# INT4
model = load_quantized(
"llada_int4_quantized.pt",
mode="int4",
device="cuda"
)
output = generate(model, tokenizer, "What is machine learning?")
print(output)
Quantize from Scratch
from quantize import run_and_save
run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
run_and_save(mode="int4", save_path="llada_int4_quantized.pt")
Hardware Requirements
| Variant | Min VRAM | Recommended |
|---|---|---|
| INT8 | 12 GB | A100 / H100 |
| INT4 | 8 GB | RTX 3090 / A100 |
Tested on: NVIDIA A100 80GB, NVIDIA H100
Limitations
- INT4 introduces slightly more quantization error than INT8
- Generation speed depends on sequence length and number of diffusion steps
- English only (inherited from base model)
Citation
If you use this work, please cite:
@misc{llada-quantization-2026,
title = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
author = {Dhiraj Choudhary},
year = {2026},
url = {https://github.com/qubitronlabsdev/llada-quantization}
}
Original LLaDA paper:
@article{nie2025large,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
year = {2025},
url = {https://arxiv.org/abs/2502.09992}
}
License
Apache 2.0 โ same as the original LLaDA model.
Model tree for qubitron/LLaDA-8B-Quantized
Base model
GSAI-ML/LLaDA-8B-Instruct