Qwen3-4B-nvfp4-Compressible

This is a pure NVFP4 checkpoint derived from Qwen/Qwen3-4B with a scale-inflation re-quantization procedure designed to make the packed NVFP4 weights more compressible while keeping the model format unchanged.

The checkpoint stays in standard compressed-tensors NVFP4 format:

  • weight_packed
  • weight_scale
  • weight_global_scale

No mixed precision was introduced. The output is still a normal NVFP4 checkpoint.

中文说明

  • 这个模型来自 Qwen/Qwen3-4B 的原始 BF16 权重,不是直接在已有 NVFP4 权重上做简单替换。
  • 生成时使用 llmat/Qwen3-4B-NVFP4 作为 NVFP4 模板和初始 scale 参考,再把 scale 按 alpha=2.0 放大后重新量化回标准 NVFP4。
  • 目标是在尽量保持精度的前提下,让 weight_packed 的字节分布更不均匀,从而更容易被后续压缩。

How This Checkpoint Was Obtained

This model was produced in two stages:

  1. Use the public BF16 source model Qwen/Qwen3-4B as the weight source.
  2. Use llmat/Qwen3-4B-NVFP4 only as a structural/template baseline for:
    • the compressed-tensors NVFP4 layout
    • initial per-block scales
    • tokenizer and config files
  3. Inflate the effective per-block scales by a constant alpha = 2.0.
  4. Re-quantize the original BF16 weights back into standard NVFP4 packed codes.

In other words, this is not a post-hoc edit of already dequantized NVFP4 weights only. The final checkpoint was regenerated from the original higher-precision weights while preserving the NVFP4 deployment format.

Method Summary

For each quantized linear weight tensor:

  1. Decode the baseline effective block scales from weight_scale / weight_global_scale.
  2. Multiply all block scales by alpha = 2.0.
  3. Quantize the BF16 source weights to NVFP4 using those inflated scales.
  4. Write the new weight_packed, weight_scale, and weight_global_scale tensors back into the checkpoint.

The intuition is that larger scales push more values toward low-magnitude FP4 codes such as 0, +-0.5, and +-1, which makes the packed byte stream less uniform and therefore easier to compress downstream.

Reference Inputs

  • BF16 source model: Qwen/Qwen3-4B
  • NVFP4 template baseline: llmat/Qwen3-4B-NVFP4

Experimental Summary

The numbers below come from the accompanying experiment scripts and a fixed small-sample MMLU check.

Variant Compression Proxy Weighted MSE MMLU Sample
Baseline llmat/Qwen3-4B-NVFP4 0.96% 4.96e-6 vs BF16 29 / 40 = 72.5%
This checkpoint (alpha=2.0 from BF16) 12.07% 7.46e-6 vs BF16 30 / 40 = 75.0%

Notes:

  • The compression metric above is an entropy-based proxy on the packed NVFP4 byte stream, not the BF16 -> NVFP4 compression ratio.
  • The MMLU result is from a fixed 8-subject, 40-question sample, not a full MMLU run.

Files

  • model.safetensors: quantized model weights
  • nvfp4_scale_inflation_from_full_precision_export.json: per-layer export summary and aggregate metrics
  • tokenizer/config files copied from the template checkpoint

Reproduce

The checkpoint was generated with code from:

  • https://github.com/DrXuQian/Model-Optimizer

Relevant scripts:

  • experimental/nvfp4_scale_inflation/export_from_full_precision.py
  • experimental/nvfp4_scale_inflation/scale_inflation.py
  • experimental/nvfp4_scale_inflation/eval_mmlu_batched.py

Example export command:

python -m experimental.nvfp4_scale_inflation.export_from_full_precision \
  --full-precision-model-dir Qwen3-4B \
  --template-nvfp4-dir Qwen3-4B-NVFP4 \
  --output-dir Qwen3-4B-NVFP4-frombf16-alpha2 \
  --alpha 2.0 \
  --optimize-max-layers 0 \
  --device cpu

Example evaluation command:

python -m experimental.nvfp4_scale_inflation.eval_mmlu_batched \
  --model-path Qwen3-4B-NVFP4-frombf16-alpha2 \
  --output-json outputs/mmlu_batched_sample8_frombf16_alpha2.json \
  --batch-size 8 \
  --limit-per-subject 5 \
  --subjects abstract_algebra,college_computer_science,clinical_knowledge,miscellaneous,econometrics,sociology,philosophy,high_school_world_history

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "DrQianXu/Qwen3-4B-nvfp4-Compressible"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto",
    dtype="auto",
)

Caveats

  • This is an experimental checkpoint.
  • The main goal was to improve compressibility of the NVFP4 packed weights while keeping the checkpoint format unchanged.
  • Full benchmark coverage was not completed yet.
Downloads last month
7
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DrQianXu/Qwen3-4B-nvfp4-Compressible

Finetuned
Qwen/Qwen3-4B
Quantized
(210)
this model