DeepSeek-V4-Flash — INT4/INT8 for SGLang on Ampere
AOT-requantized version of deepseek-ai/DeepSeek-V4-Flash
that runs natively on Ampere GPUs (sm_86, RTX 3090 / A5000) under a
forked SGLang (deepseek_v4_ampere
branch). The native FP4+FP8 release requires Hopper FP4/FP8 tensor cores; this
checkpoint replaces those with Marlin INT4 W4A16 + INT8 W8A16, both of which
have native sm_86 paths.
Format
| Tensor pattern | Source format | Output format |
|---|---|---|
*.ffn.experts.E.{w1,w2,w3} |
MXFP4 (e2m1 + e8m0/32) | INT4 byte-packed + BF16 group=32 scales |
*.attn.{wq_a,wq_b,wkv,wo_a,wo_b} |
FP8 e4m3 + e8m0 128×128 | INT8 + BF16 128×128 block scales |
*.ffn.shared_experts.{w1,w2,w3} |
FP8 e4m3 + e8m0 128×128 | INT8 + BF16 128×128 block scales |
| Everything else (norms, embed, head, hc_*, attn_sink, gate) | BF16/FP32 | passthrough |
INT4 packing: low nibble = even-index element, high nibble = odd-index. Marlin
expects unsigned nibbles with implicit zero-point of 8, so values are
stored as unsigned = signed + 8.
config.json carries quantization_config.quant_method = "dsv4_int", which
the SGLang fork registers as a compressed-tensors-style config covering
both groups.
Why BF16 scales over FP16: BF16's 8-bit exponent matches FP32 / e8m0 exactly. For DeepSeek's typical weight-scale range (e8m0 in [2⁻¹²..2⁴]) BF16 vs FP16 SNR is within 0.1 dB; BF16 only matters for scales below ~2⁻¹⁴ where FP16 underflows.
Usage (SGLang fork on Ampere)
The fork's deepseek_v4_ampere branch is required — upstream SGLang doesn't
recognise quant_method=dsv4_int and doesn't have the sm_86 plumbing patches.
git clone --branch deepseek_v4_ampere https://github.com/AppMana/forks-sglang.git
cd forks-sglang && pip install -e .
python -m sglang.launch_server \
--model-path appmana/deepseek-v4-int8-int4-sglang \
--pp-size 12 \
--tp-size 1 \
--mem-fraction-static 0.85 \
--context-length 8192 \
--max-running-requests 4 \
--swa-full-tokens-ratio 0.5 \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--trust-remote-code
Required environment overrides for sm_86:
export SGLANG_HACK_FLASHMLA_BACKEND=torch # Hopper-only kernel -> torch fallback
export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 # deep_gemm metadata -> torch
export SGLANG_OPT_FP8_WO_A_GEMM=0 # wo_a is INT8, not FP8
export SGLANG_OPT_FUSE_WQA_WKV=0 # individual loads, no FP8 fusion
export SGLANG_APPLY_CONFIG_BACKUP=none # use real config.json
Reproduction
forks-sglang/tools/ampere/dsv4_requant_shard.py is the converter:
hf download deepseek-ai/DeepSeek-V4-Flash --local-dir <SRC>
python tools/ampere/dsv4_requant_shard.py \
--src <SRC> --dst <DST> --device cuda:0
~157 GB output, ~50s per shard on an RTX A5000 — full conversion in ~35 minutes.
Limitations
- Phase A torch sparse-MLA fallback is currently used for the V4 DSA indexer on sm_86 (FP32 accumulation, slow but correct). Phase B (TileLang sparse-MLA kernel for sm_86) is in progress — when it lands, single-token latency will improve substantially.
- The conversion is lossy: ~21 dB SNR for FP4→INT4 (limited by INT4's 16 levels, not BF16 scales), ~25 dB SNR for FP8→INT8 with realistic weight outlier patterns. This matches a 1:1 lossy translation from the highest- precision public artifact (FP4+FP8 native; DeepSeek did not release a BF16/FP32 V4-Flash).
License
MIT, matching the base model.
- Downloads last month
- 266
Model tree for appmana/deepseek-v4-int8-int4-sglang
Base model
deepseek-ai/DeepSeek-V4-Flash