Qwen3-4B-nvfp4-Compressible

This is a pure NVFP4 checkpoint derived from Qwen/Qwen3-4B with a scale-inflation re-quantization procedure designed to make the packed NVFP4 weights more compressible while keeping the model format unchanged.

The checkpoint stays in standard compressed-tensors NVFP4 format:

weight_packed
weight_scale
weight_global_scale

No mixed precision was introduced. The output is still a normal NVFP4 checkpoint.

中文说明

这个模型来自 Qwen/Qwen3-4B 的原始 BF16 权重，不是直接在已有 NVFP4 权重上做简单替换。
生成时使用 llmat/Qwen3-4B-NVFP4 作为 NVFP4 模板和初始 scale 参考，再把 scale 按 alpha=2.0 放大后重新量化回标准 NVFP4。
目标是在尽量保持精度的前提下，让 weight_packed 的字节分布更不均匀，从而更容易被后续压缩。

How This Checkpoint Was Obtained

This model was produced in two stages:

Use the public BF16 source model Qwen/Qwen3-4B as the weight source.
Use llmat/Qwen3-4B-NVFP4 only as a structural/template baseline for:
- the compressed-tensors NVFP4 layout
- initial per-block scales
- tokenizer and config files
Inflate the effective per-block scales by a constant alpha = 2.0.
Re-quantize the original BF16 weights back into standard NVFP4 packed codes.

In other words, this is not a post-hoc edit of already dequantized NVFP4 weights only. The final checkpoint was regenerated from the original higher-precision weights while preserving the NVFP4 deployment format.

Method Summary

For each quantized linear weight tensor:

Decode the baseline effective block scales from weight_scale / weight_global_scale.
Multiply all block scales by alpha = 2.0.
Quantize the BF16 source weights to NVFP4 using those inflated scales.
Write the new weight_packed, weight_scale, and weight_global_scale tensors back into the checkpoint.

The intuition is that larger scales push more values toward low-magnitude FP4 codes such as 0, +-0.5, and +-1, which makes the packed byte stream less uniform and therefore easier to compress downstream.

Reference Inputs

BF16 source model: Qwen/Qwen3-4B
NVFP4 template baseline: llmat/Qwen3-4B-NVFP4

Experimental Summary

The numbers below come from the accompanying experiment scripts and a fixed small-sample MMLU check.

Variant	Compression Proxy	Weighted MSE	MMLU Sample
Baseline `llmat/Qwen3-4B-NVFP4`	`0.96%`	`4.96e-6` vs BF16	`29 / 40 = 72.5%`
This checkpoint (`alpha=2.0` from BF16)	`12.07%`	`7.46e-6` vs BF16	`30 / 40 = 75.0%`

Notes:

The compression metric above is an entropy-based proxy on the packed NVFP4 byte stream, not the BF16 -> NVFP4 compression ratio.
The MMLU result is from a fixed 8-subject, 40-question sample, not a full MMLU run.

Files

model.safetensors: quantized model weights
nvfp4_scale_inflation_from_full_precision_export.json: per-layer export summary and aggregate metrics
tokenizer/config files copied from the template checkpoint

Reproduce

The checkpoint was generated with code from:

https://github.com/DrXuQian/Model-Optimizer

Relevant scripts:

experimental/nvfp4_scale_inflation/export_from_full_precision.py
experimental/nvfp4_scale_inflation/scale_inflation.py
experimental/nvfp4_scale_inflation/eval_mmlu_batched.py

Example export command:

python -m experimental.nvfp4_scale_inflation.export_from_full_precision \
  --full-precision-model-dir Qwen3-4B \
  --template-nvfp4-dir Qwen3-4B-NVFP4 \
  --output-dir Qwen3-4B-NVFP4-frombf16-alpha2 \
  --alpha 2.0 \
  --optimize-max-layers 0 \
  --device cpu

Example evaluation command:

python -m experimental.nvfp4_scale_inflation.eval_mmlu_batched \
  --model-path Qwen3-4B-NVFP4-frombf16-alpha2 \
  --output-json outputs/mmlu_batched_sample8_frombf16_alpha2.json \
  --batch-size 8 \
  --limit-per-subject 5 \
  --subjects abstract_algebra,college_computer_science,clinical_knowledge,miscellaneous,econometrics,sociology,philosophy,high_school_world_history

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "DrQianXu/Qwen3-4B-nvfp4-Compressible"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto",
    dtype="auto",
)

Caveats

This is an experimental checkpoint.
The main goal was to improve compressibility of the NVFP4 packed weights while keeping the checkpoint format unchanged.
Full benchmark coverage was not completed yet.

Downloads last month: 7

Safetensors

Model size

3B params

Tensor type

F32

BF16

F8_E4M3

Model tree for DrQianXu/Qwen3-4B-nvfp4-Compressible

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Quantized

(210)

this model