fix loader instructions to clarify checkpoint download path

6d16787 verified about 4 hours ago

4.4 kB

license: other
license_name: deepseek
license_link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/LICENSE-MODEL
base_model: deepseek-ai/DeepSeek-V4-Flash
tags:
  - quantized
  - gptq
  - int2
  - moe
  - deepseek
  - deepseek-v4-flash
pipeline_tag: text-generation

DeepSeek-V4-Flash INT2-G64

INT2 group-64 quantization of DeepSeek-V4-Flash's 256 routed experts. The full 284B-parameter MoE fits in 96 GB of VRAM and runs on a single GPU.

Inference code, kernels, and the full quantization pipeline live at github.com/Infatoshi/dsv4-int2. This repository contains weights only — they will not load with vanilla transformers or vllm.

Numbers


Checkpoint size	75 GB (vs 132 GB MXFP4, 543 GB BF16)
Routed-expert format	INT2 g64, FP16 scale + INT4 zero
Layers	43 expert MoE layers (one per `layer_NN.safetensors`)
MMLU 0-shot, 14,042 questions, V4 chat template	72.46%
Decode throughput, RTX PRO 6000 Blackwell	17 tok/s eager (reference path; not perf-tuned)

The official BF16 V4-Flash-Base 5-shot MMLU is 88.7%; the gap is partly setup (0-shot vs 5-shot) and partly real quantization cost.

Format

Each layer_NN.safetensors holds the routed experts for one MoE layer. For each of the three projections (w1 gate, w3 up, w2 down):

w_packed: [E=256, K_out, K_in/16] uint32 — 16 INT2 values per uint32
w_scale: [E, K_out, K_in/G] float16 — per-group of G=64 input channels
w_zero_packed: [E, K_out, K_in/(2G)] int8 — INT4 zero-points, two-per-byte

Non-expert weights (MLA, embeddings, norms, shared expert, indexer, compressor, head) are NOT in this checkpoint — pull them from the upstream DeepSeek-V4-Flash MXFP4 release. The hybrid loader in the GitHub repo does this automatically.

quant_stats.json records per-layer GPTQ reconstruction error and routing-coverage stats (RTN-fallback count, visit min/max/median per expert).

Method

Standard GPTQ with INT2 g64, run per-expert. Calibration uses Mistral-7B-v0.1 layer-16 hidden states as the proxy distribution — chosen for portability rather than parity with V4. Two implications worth knowing before quoting these numbers:

Across 41 layers, 211 of 256 routed experts received zero calibration tokens (V4's HC-sinkhorn routing is highly domain-specific and Mistral natural-text activations don't reach all experts). Under-covered experts fall back to per-channel RTN.
V4 self-calibration would close this; it is not run here. See quant/v4_self_calib.py in the GitHub repo for a starting point.

Loading

This is research code; there is no from_pretrained path. To run inference:

git clone https://github.com/Infatoshi/dsv4-int2
cd dsv4-int2
uv venv && uv sync

# point the loader at this checkpoint + the upstream V4-Flash release
export DSV4_REF=/path/to/DeepSeek-V4-Flash             # MXFP4 release (tokenizer + non-expert weights)
export DSV4_INT2=/path/to/this/checkpoint              # this directory (download from HF)

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  uv run python eval/v4_int2/repl.py

Limitations

Quantization-only. This is a quant + reference inference path, not a perf-tuned serving stack. Decode hits ~26% of HBM peak.
Custom kernel required. Cannot be loaded with stock transformers or vLLM. Triton kernels in the GitHub repo handle dequantization on-the-fly.
Calibration coverage gap. 211/256 experts per layer get zero calibration visits under our setup. Rare-domain quality may be worse than the headline MMLU suggests.
Single-GPU only. Loader assumes world_size=1. No tensor parallelism.
Hardware tested: RTX PRO 6000 Blackwell SM_120 (96 GB). Other architectures should work via Triton autotune but have not been measured.

License

Source code on GitHub is MIT. These weights are derivatives of DeepSeek-V4-Flash and inherit the DeepSeek Model License.

Citation

@misc{dsv4int2,
  title  = {dsv4-int2: INT2 quantization of DeepSeek-V4-Flash for single-GPU inference},
  author = {Arledge, Elliot},
  year   = {2026},
  url    = {https://github.com/Infatoshi/dsv4-int2}
}