--- license: apache-2.0 language: - en base_model: - GSAI-ML/LLaDA-8B-Instruct pipeline_tag: text-generation tags: - diffusion-language-model - quantization library_name: transformers --- # LLaDA-8B-Quantized **World's first INT8 and INT4 weight-only quantization for [LLaDA](https://github.com/ML-GSAI/LLaDA) — a masked diffusion large language model trained from scratch at 8B scale.** > Code & full documentation: [github.com/qubitronlabsdev/llada-quantization](https://github.com/qubitronlabsdev/llada-quantization) --- ## Model Description LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens **in parallel** via iterative masked denoising — unlike autoregressive models (GPT, LLaMA) that generate one token at a time. This repository provides two post-training quantized variants of `GSAI-ML/LLaDA-8B-Instruct`: | File | Quantization | Size | Memory Saved | Speed (A100) | |---|---|---|---|---| | `llada_int8_quantized.pt` | INT8 per-row | 8.54 GB | **47%** | **9.64 tok/s** | | `llada_int4_quantized.pt` | INT4 packed | 4.79 GB | **70%** | 3.39 tok/s | Original model (bfloat16): 16.13 GB --- ## How It Works All `nn.Linear` layers are replaced with custom quantized layers: - **INT8** — weights scaled per-row to `[-127, 127]` integers. Scale factors stored in float32. ~1 byte per weight. - **INT4** — weights scaled per-row to `[-8, 7]` integers. Two values packed per byte (uint8). ~0.5 bytes per weight. Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic. --- ## Usage ### Installation ```bash git clone https://github.com/qubitronlabsdev/llada-quantization cd llada-quantization pip install -r requirements.txt ``` ### Load and Generate ```python from inference import load_quantized, generate from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True ) # Download weights from this repo first, then: # INT8 model = load_quantized( "llada_int8_quantized.pt", mode="int8", device="cuda" ) # INT4 model = load_quantized( "llada_int4_quantized.pt", mode="int4", device="cuda" ) output = generate(model, tokenizer, "What is machine learning?") print(output) ``` ### Quantize from Scratch ```python from quantize import run_and_save run_and_save(mode="int8", save_path="llada_int8_quantized.pt") run_and_save(mode="int4", save_path="llada_int4_quantized.pt") ``` --- ## Hardware Requirements | Variant | Min VRAM | Recommended | |---|---|---| | INT8 | 12 GB | A100 / H100 | | INT4 | 8 GB | RTX 3090 / A100 | Tested on: NVIDIA A100 80GB, NVIDIA H100 --- ## Limitations - INT4 introduces slightly more quantization error than INT8 - Generation speed depends on sequence length and number of diffusion steps - English only (inherited from base model) --- ## Citation If you use this work, please cite: ```bibtex @misc{llada-quantization-2026, title = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models}, author = {Dhiraj Choudhary}, year = {2026}, url = {https://github.com/qubitronlabsdev/llada-quantization} } ``` Original LLaDA paper: ```bibtex @article{nie2025large, title = {Large Language Diffusion Models}, author = {Nie, Shen and others}, year = {2025}, url = {https://arxiv.org/abs/2502.09992} } ``` --- ## License Apache 2.0 — same as the original LLaDA model.