DeepSeek-V4-Flash — INT4/INT8 for SGLang on Ampere

AOT-requantized version of deepseek-ai/DeepSeek-V4-Flash that runs natively on Ampere GPUs (sm_86, RTX 3090 / A5000) under a forked SGLang (deepseek_v4_ampere branch). The native FP4+FP8 release requires Hopper FP4/FP8 tensor cores; this checkpoint replaces those with Marlin INT4 W4A16 + INT8 W8A16, both of which have native sm_86 paths.

Format

Tensor pattern Source format Output format
*.ffn.experts.E.{w1,w2,w3} MXFP4 (e2m1 + e8m0/32) INT4 byte-packed + BF16 group=32 scales
*.attn.{wq_a,wq_b,wkv,wo_a,wo_b} FP8 e4m3 + e8m0 128×128 INT8 + BF16 128×128 block scales
*.ffn.shared_experts.{w1,w2,w3} FP8 e4m3 + e8m0 128×128 INT8 + BF16 128×128 block scales
Everything else (norms, embed, head, hc_*, attn_sink, gate) BF16/FP32 passthrough

INT4 packing: low nibble = even-index element, high nibble = odd-index. Marlin expects unsigned nibbles with implicit zero-point of 8, so values are stored as unsigned = signed + 8.

config.json carries quantization_config.quant_method = "dsv4_int", which the SGLang fork registers as a compressed-tensors-style config covering both groups.

Why BF16 scales over FP16: BF16's 8-bit exponent matches FP32 / e8m0 exactly. For DeepSeek's typical weight-scale range (e8m0 in [2⁻¹²..2⁴]) BF16 vs FP16 SNR is within 0.1 dB; BF16 only matters for scales below ~2⁻¹⁴ where FP16 underflows.

Usage (SGLang fork on Ampere)

The fork's deepseek_v4_ampere branch is required — upstream SGLang doesn't recognise quant_method=dsv4_int and doesn't have the sm_86 plumbing patches.

git clone --branch deepseek_v4_ampere https://github.com/AppMana/forks-sglang.git
cd forks-sglang && pip install -e .

python -m sglang.launch_server \
  --model-path appmana/deepseek-v4-int8-int4-sglang \
  --pp-size 12 \
  --tp-size 1 \
  --mem-fraction-static 0.85 \
  --context-length 8192 \
  --max-running-requests 4 \
  --swa-full-tokens-ratio 0.5 \
  --nsa-prefill-backend tilelang \
  --nsa-decode-backend tilelang \
  --trust-remote-code

Required environment overrides for sm_86:

export SGLANG_HACK_FLASHMLA_BACKEND=torch        # Hopper-only kernel -> torch fallback
export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1       # deep_gemm metadata -> torch
export SGLANG_OPT_FP8_WO_A_GEMM=0                # wo_a is INT8, not FP8
export SGLANG_OPT_FUSE_WQA_WKV=0                 # individual loads, no FP8 fusion
export SGLANG_APPLY_CONFIG_BACKUP=none           # use real config.json

Reproduction

forks-sglang/tools/ampere/dsv4_requant_shard.py is the converter:

hf download deepseek-ai/DeepSeek-V4-Flash --local-dir <SRC>
python tools/ampere/dsv4_requant_shard.py \
  --src <SRC> --dst <DST> --device cuda:0

~157 GB output, ~50s per shard on an RTX A5000 — full conversion in ~35 minutes.

Limitations

  • Phase A torch sparse-MLA fallback is currently used for the V4 DSA indexer on sm_86 (FP32 accumulation, slow but correct). Phase B (TileLang sparse-MLA kernel for sm_86) is in progress — when it lands, single-token latency will improve substantially.
  • The conversion is lossy: ~21 dB SNR for FP4→INT4 (limited by INT4's 16 levels, not BF16 scales), ~25 dB SNR for FP8→INT8 with realistic weight outlier patterns. This matches a 1:1 lossy translation from the highest- precision public artifact (FP4+FP8 native; DeepSeek did not release a BF16/FP32 V4-Flash).

License

MIT, matching the base model.

Downloads last month
266
Safetensors
Model size
158B params
Tensor type
BF16
·
I64
·
F32
·
I8
·
F8_E8M0
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for appmana/deepseek-v4-int8-int4-sglang

Quantized
(27)
this model