LLaDA-8B-Quantized

World's first INT8 and INT4 weight-only quantization for LLaDA โ€” a masked diffusion large language model trained from scratch at 8B scale.

Code & full documentation: github.com/qubitronlabsdev/llada-quantization


Model Description

LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens in parallel via iterative masked denoising โ€” unlike autoregressive models (GPT, LLaMA) that generate one token at a time.

This repository provides two post-training quantized variants of GSAI-ML/LLaDA-8B-Instruct:

File Quantization Size Memory Saved Speed (A100)
llada_int8_quantized.pt INT8 per-row 8.54 GB 47% 9.64 tok/s
llada_int4_quantized.pt INT4 packed 4.79 GB 70% 3.39 tok/s

Original model (bfloat16): 16.13 GB


How It Works

All nn.Linear layers are replaced with custom quantized layers:

  • INT8 โ€” weights scaled per-row to [-127, 127] integers. Scale factors stored in float32. ~1 byte per weight.
  • INT4 โ€” weights scaled per-row to [-8, 7] integers. Two values packed per byte (uint8). ~0.5 bytes per weight.

Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.


Usage

Installation

git clone https://github.com/qubitronlabsdev/llada-quantization
cd llada-quantization
pip install -r requirements.txt

Load and Generate

from inference import load_quantized, generate
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    trust_remote_code=True
)

# Download weights from this repo first, then:

# INT8
model = load_quantized(
    "llada_int8_quantized.pt",
    mode="int8",
    device="cuda"
)

# INT4
model = load_quantized(
    "llada_int4_quantized.pt",
    mode="int4",
    device="cuda"
)

output = generate(model, tokenizer, "What is machine learning?")
print(output)

Quantize from Scratch

from quantize import run_and_save

run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
run_and_save(mode="int4", save_path="llada_int4_quantized.pt")

Hardware Requirements

Variant Min VRAM Recommended
INT8 12 GB A100 / H100
INT4 8 GB RTX 3090 / A100

Tested on: NVIDIA A100 80GB, NVIDIA H100


Limitations

  • INT4 introduces slightly more quantization error than INT8
  • Generation speed depends on sequence length and number of diffusion steps
  • English only (inherited from base model)

Citation

If you use this work, please cite:

@misc{llada-quantization-2026,
  title  = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
  author = {Dhiraj Choudhary},
  year   = {2026},
  url    = {https://github.com/qubitronlabsdev/llada-quantization}
}

Original LLaDA paper:

@article{nie2025large,
  title  = {Large Language Diffusion Models},
  author = {Nie, Shen and others},
  year   = {2025},
  url    = {https://arxiv.org/abs/2502.09992}
}

License

Apache 2.0 โ€” same as the original LLaDA model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for qubitron/LLaDA-8B-Quantized

Finetuned
(28)
this model

Paper for qubitron/LLaDA-8B-Quantized