| --- |
| license: other |
| license_name: deepseek |
| license_link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/LICENSE-MODEL |
| base_model: deepseek-ai/DeepSeek-V4-Flash |
| tags: |
| - quantized |
| - gptq |
| - int2 |
| - moe |
| - deepseek |
| - deepseek-v4-flash |
| pipeline_tag: text-generation |
| --- |
| |
| # DeepSeek-V4-Flash INT2-G64 |
|
|
| INT2 group-64 quantization of [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)'s 256 routed experts. The full 284B-parameter MoE fits in 96 GB of VRAM and runs on a single GPU. |
|
|
| **Inference code, kernels, and the full quantization pipeline live at [github.com/Infatoshi/dsv4-int2](https://github.com/Infatoshi/dsv4-int2).** This repository contains weights only β they will not load with vanilla `transformers` or `vllm`. |
|
|
| ## Numbers |
|
|
| | | | |
| |---|---| |
| | Checkpoint size | **75 GB** (vs 132 GB MXFP4, 543 GB BF16) | |
| | Routed-expert format | INT2 g64, FP16 scale + INT4 zero | |
| | Layers | 43 expert MoE layers (one per `layer_NN.safetensors`) | |
| | MMLU 0-shot, 14,042 questions, V4 chat template | **72.46%** | |
| | Decode throughput, RTX PRO 6000 Blackwell | 17 tok/s eager (reference path; not perf-tuned) | |
|
|
| The official BF16 V4-Flash-Base 5-shot MMLU is 88.7%; the gap is partly setup (0-shot vs 5-shot) and partly real quantization cost. |
|
|
| ## Format |
|
|
| Each `layer_NN.safetensors` holds the routed experts for one MoE layer. For each of the three projections (`w1` gate, `w3` up, `w2` down): |
|
|
| - `w_packed`: `[E=256, K_out, K_in/16]` `uint32` β 16 INT2 values per `uint32` |
| - `w_scale`: `[E, K_out, K_in/G]` `float16` β per-group of `G=64` input channels |
| - `w_zero_packed`: `[E, K_out, K_in/(2G)]` `int8` β INT4 zero-points, two-per-byte |
|
|
| Non-expert weights (MLA, embeddings, norms, shared expert, indexer, compressor, head) are NOT in this checkpoint β pull them from the upstream [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) MXFP4 release. The hybrid loader in the GitHub repo does this automatically. |
|
|
| `quant_stats.json` records per-layer GPTQ reconstruction error and routing-coverage stats (RTN-fallback count, visit min/max/median per expert). |
|
|
| ## Method |
|
|
| Standard GPTQ with INT2 g64, run per-expert. Calibration uses Mistral-7B-v0.1 layer-16 hidden states as the proxy distribution β chosen for portability rather than parity with V4. Two implications worth knowing before quoting these numbers: |
|
|
| - Across 41 layers, 211 of 256 routed experts received zero calibration tokens (V4's HC-sinkhorn routing is highly domain-specific and Mistral natural-text activations don't reach all experts). Under-covered experts fall back to per-channel RTN. |
| - V4 self-calibration would close this; it is not run here. See `quant/v4_self_calib.py` in the GitHub repo for a starting point. |
|
|
| ## Loading |
|
|
| This is research code; there is no `from_pretrained` path. To run inference: |
|
|
| ```bash |
| git clone https://github.com/Infatoshi/dsv4-int2 |
| cd dsv4-int2 |
| uv venv && uv sync |
| |
| # point the loader at this checkpoint + the upstream V4-Flash release |
| export DSV4_REF=/path/to/DeepSeek-V4-Flash # MXFP4 release (tokenizer + non-expert weights) |
| export DSV4_INT2=/path/to/this/checkpoint # this directory (download from HF) |
| |
| PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ |
| uv run python eval/v4_int2/repl.py |
| ``` |
|
|
| ## Limitations |
|
|
| - **Quantization-only.** This is a quant + reference inference path, not a perf-tuned serving stack. Decode hits ~26% of HBM peak. |
| - **Custom kernel required.** Cannot be loaded with stock transformers or vLLM. Triton kernels in the GitHub repo handle dequantization on-the-fly. |
| - **Calibration coverage gap.** 211/256 experts per layer get zero calibration visits under our setup. Rare-domain quality may be worse than the headline MMLU suggests. |
| - **Single-GPU only.** Loader assumes `world_size=1`. No tensor parallelism. |
| - **Hardware tested:** RTX PRO 6000 Blackwell SM_120 (96 GB). Other architectures should work via Triton autotune but have not been measured. |
| |
| ## License |
| |
| Source code on GitHub is MIT. These weights are derivatives of DeepSeek-V4-Flash and inherit the [DeepSeek Model License](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/LICENSE-MODEL). |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{dsv4int2, |
| title = {dsv4-int2: INT2 quantization of DeepSeek-V4-Flash for single-GPU inference}, |
| author = {Arledge, Elliot}, |
| year = {2026}, |
| url = {https://github.com/Infatoshi/dsv4-int2} |
| } |
| ``` |
| |