Qwen3-4B-nvfp4-Compressible
This is a pure NVFP4 checkpoint derived from Qwen/Qwen3-4B with a scale-inflation re-quantization procedure designed to make the packed NVFP4 weights more compressible while keeping the model format unchanged.
The checkpoint stays in standard compressed-tensors NVFP4 format:
weight_packedweight_scaleweight_global_scale
No mixed precision was introduced. The output is still a normal NVFP4 checkpoint.
中文说明
- 这个模型来自
Qwen/Qwen3-4B的原始 BF16 权重,不是直接在已有 NVFP4 权重上做简单替换。 - 生成时使用
llmat/Qwen3-4B-NVFP4作为 NVFP4 模板和初始 scale 参考,再把 scale 按alpha=2.0放大后重新量化回标准 NVFP4。 - 目标是在尽量保持精度的前提下,让
weight_packed的字节分布更不均匀,从而更容易被后续压缩。
How This Checkpoint Was Obtained
This model was produced in two stages:
- Use the public BF16 source model
Qwen/Qwen3-4Bas the weight source. - Use
llmat/Qwen3-4B-NVFP4only as a structural/template baseline for:- the compressed-tensors NVFP4 layout
- initial per-block scales
- tokenizer and config files
- Inflate the effective per-block scales by a constant
alpha = 2.0. - Re-quantize the original BF16 weights back into standard NVFP4 packed codes.
In other words, this is not a post-hoc edit of already dequantized NVFP4 weights only. The final checkpoint was regenerated from the original higher-precision weights while preserving the NVFP4 deployment format.
Method Summary
For each quantized linear weight tensor:
- Decode the baseline effective block scales from
weight_scale / weight_global_scale. - Multiply all block scales by
alpha = 2.0. - Quantize the BF16 source weights to NVFP4 using those inflated scales.
- Write the new
weight_packed,weight_scale, andweight_global_scaletensors back into the checkpoint.
The intuition is that larger scales push more values toward low-magnitude FP4 codes such as 0, +-0.5, and +-1, which makes the packed byte stream less uniform and therefore easier to compress downstream.
Reference Inputs
- BF16 source model:
Qwen/Qwen3-4B - NVFP4 template baseline:
llmat/Qwen3-4B-NVFP4
Experimental Summary
The numbers below come from the accompanying experiment scripts and a fixed small-sample MMLU check.
| Variant | Compression Proxy | Weighted MSE | MMLU Sample |
|---|---|---|---|
Baseline llmat/Qwen3-4B-NVFP4 |
0.96% |
4.96e-6 vs BF16 |
29 / 40 = 72.5% |
This checkpoint (alpha=2.0 from BF16) |
12.07% |
7.46e-6 vs BF16 |
30 / 40 = 75.0% |
Notes:
- The compression metric above is an entropy-based proxy on the packed NVFP4 byte stream, not the
BF16 -> NVFP4compression ratio. - The MMLU result is from a fixed 8-subject, 40-question sample, not a full MMLU run.
Files
model.safetensors: quantized model weightsnvfp4_scale_inflation_from_full_precision_export.json: per-layer export summary and aggregate metrics- tokenizer/config files copied from the template checkpoint
Reproduce
The checkpoint was generated with code from:
https://github.com/DrXuQian/Model-Optimizer
Relevant scripts:
experimental/nvfp4_scale_inflation/export_from_full_precision.pyexperimental/nvfp4_scale_inflation/scale_inflation.pyexperimental/nvfp4_scale_inflation/eval_mmlu_batched.py
Example export command:
python -m experimental.nvfp4_scale_inflation.export_from_full_precision \
--full-precision-model-dir Qwen3-4B \
--template-nvfp4-dir Qwen3-4B-NVFP4 \
--output-dir Qwen3-4B-NVFP4-frombf16-alpha2 \
--alpha 2.0 \
--optimize-max-layers 0 \
--device cpu
Example evaluation command:
python -m experimental.nvfp4_scale_inflation.eval_mmlu_batched \
--model-path Qwen3-4B-NVFP4-frombf16-alpha2 \
--output-json outputs/mmlu_batched_sample8_frombf16_alpha2.json \
--batch-size 8 \
--limit-per-subject 5 \
--subjects abstract_algebra,college_computer_science,clinical_knowledge,miscellaneous,econometrics,sociology,philosophy,high_school_world_history
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "DrQianXu/Qwen3-4B-nvfp4-Compressible"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
device_map="auto",
dtype="auto",
)
Caveats
- This is an experimental checkpoint.
- The main goal was to improve compressibility of the NVFP4 packed weights while keeping the checkpoint format unchanged.
- Full benchmark coverage was not completed yet.
- Downloads last month
- 7