You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen3-VL-2B SFT+GRPO v9 NVFP4

NVFP4 post-training quantized version of the #1 ranked Denali-AI model (qwen3-vl-2b-sft-grpo-v9, 89.5% weighted). Quantized using NVIDIA ModelOpt with 512-sample calibration. Retains 99.9% JSON parse rate and achieves 74.6% weighted score — ranked #2/13 overall.

Model Details

Property	Value
Architecture	Qwen3-VL
Parameters	2B
Base Model	Qwen/Qwen3-VL-2B
Fine-tuned From	Denali-AI/qwen3-vl-2b-sft-grpo-v9
Training	SFT (LoRA) + GRPO (full precision)
Quantization	NVFP4 PTQ (ModelOpt, 512 calibration samples)
Task	Garment Attribute Extraction (9-field JSON)
Output Format	Structured JSON

Key Highlights

Post-training quantization of the best-performing model — trained at full precision first, then quantized
99.9% JSON parse rate retained after quantization (vs 100% full precision)
74.6% weighted score — only 14.9pp drop from the 89.5% full-precision version
17.2 samples/s throughput — actually faster than the full-precision version (15.9/s) due to reduced memory bandwidth
Vision encoder excluded from quantization to preserve visual understanding
~4x smaller memory footprint vs BF16

Benchmark Results

Rank #6/21 on eval_hard_3500

Metric	NVFP4 PTQ	Full Precision	Delta
Weighted Score	84.2%	89.5%	-14.9pp
SBERT+NLI Combined	74.1%	78.5%	-4.4pp
JSON Parse Rate	99.9%	100%	-0.1pp
Throughput	17.2/s	15.9/s	+1.3/s
Inference Time	203s	220s	-17s

Per-Field Scores

Field	SBERT	NLI	Levenshtein	Token F1	SBERT+NLI	Weight
type	78.5%	65.4%	70.6%	61.2%	68.3%	2.5x
color	84.8%	61.8%	71.0%	54.3%	76.1%	1.0x
pattern	70.2%	65.2%	67.8%	55.3%	61.9%	1.0x
closure	62.6%	56.9%	60.9%	49.1%	57.0%	1.0x
sleeve	85.3%	86.8%	84.4%	71.4%	86.1%	1.0x
neckline	77.1%	74.8%	75.2%	68.8%	71.1%	1.0x
defect	78.1%	78.3%	77.8%	77.4%	78.2%	2.0x
brand	79.6%	79.5%	80.2%	78.0%	79.1%	1.5x
size	89.6%	89.5%	89.5%	89.4%	89.4%	1.5x

Visualizations

Radar Chart

Leaderboard

Multi-Metric Breakdown

Quality vs Throughput

Full Leaderboard

Rank	Model	Weighted	SBERT+NLI	JSON Parse	Throughput	Inference
1	qwen3-vl-8b-sft+grpo	80.9%	78.7%	100%	7.5/s	464s
2	qwen3-vl-2b-sft-grpo-v9	79.9%	78.5%	100%	15.9/s	220s
3	qwen3-vl-8b-instruct-base	78.1%	75.6%	100%	5.5/s	640s
4	qwen3-vl-8b-instruct-nvfp4	77.8%	75.0%	100%	8.2/s	424s
5	qwen35-2b-base	76.2%	73.0%	100%	6.6/s	534s
6	qwen3-vl-2b-sft-grpo-v9-nvfp4 >>>	74.6%	74.1%	100%	17.2/s	203s
7	qwen3-vl-2b-instruct-base	68.0%	66.7%	100%	15.1/s	231s
8	internvl3-2b-grpo-gtpo-full	67.5%	64.3%	100%	11.8/s	297s
9	internvl3-2b-grpo-gtpo-fp8	67.1%	63.8%	100%	14.3/s	244s
10	internvl3-2b-base	66.8%	63.7%	100%	11.8/s	297s
11	moondream2-base	63.8%	61.8%	100%	1.4/s	2416s
12	qwen35-2b-sft-grpo-gtpo-v8	60.7%	60.1%	100%	11.3/s	309s
13	qwen35-2b-sft-v7	58.6%	58.9%	100%	11.6/s	302s
14	qwen35-35b-a3b-gptq-int4	51.5%	48.7%	14%	1.6/s	2124s
15	qwen35-9b-nvfp4-v10	48.9%	46.0%	8%	1.7/s	2075s
16	qwen35-9b-sft-nvfp4-v11	48.3%	45.5%	8%	1.7/s	2023s
17	qwen35-2b-base-nvfp4-v10	45.9%	42.9%	0%	4.0/s	878s
18	qwen3.5-122b-a10b-nvfp4	45.9%	42.9%	0%	1.2/s	2893s
19	qwen35-2b-sft-nvfp4-v11	45.9%	42.9%	0%	4.0/s	876s
20	qwen35-2b-sft-grpo-gtpo-nvfp4	45.9%	42.9%	0%	3.9/s	907s
21	qwen3-vl-8b-sft-grpo	0.0%	0.0%	100%	0.0/s	462s

Comparative Analysis

vs qwen3-vl-2b-sft-grpo-v9 (full precision, #1): -14.9pp weighted, but 1.3/s faster throughput and ~4x smaller memory footprint. The accuracy drop is primarily in type (-10.2pp), color (-8.7pp), and pattern (-16.6pp) fields where fine-grained discrimination is most affected by quantization.
vs internvl3-2b-grpo-gtpo-full (#2 at 72.7%): +1.9pp weighted — PTQ NVFP4 of the best model still outperforms the #2 full-precision model.
vs LoRA-on-NVFP4 approach (qwen35-2b-sft-nvfp4, 42.9%): +31.7pp — PTQ (quantize after training) vastly outperforms training on already-quantized models.

Improvement Recommendations

FP8 quantization: Try FP8 (E4M3) instead of FP4 for less accuracy loss — may retain 85%+ weighted score
GPTQ/AWQ: Compare with weight-only quantization methods that preserve activation precision
Calibration tuning: More calibration samples (1024+) or task-specific calibration data may improve accuracy
Mixed precision: Keep critical layers (first/last transformer blocks) at FP8 while quantizing middle layers to FP4

Alternative Models

SmolVLM-2B: Compact VLM that may quantize better due to simpler architecture
Qwen3.5-VL-2B: Newer architecture with Gated DeltaNet — though our tests show NVFP4 PTQ breaks its JSON output
PaliGemma-2: Google's VLM with strong structured output capabilities

Quantization Pipeline

This model was created using NVIDIA ModelOpt post-training quantization:

Load full-precision fine-tuned model (SFT + GRPO trained)
Configure NVFP4 quantization with exclusions for lm_head and vision encoder
Calibrate on 512 training samples with max_seq_len=2048
Quantize weights to NVFP4 (E2M1 FP4 with FP8 block scaling, group_size=16)
Export HF-compatible checkpoint via export_hf_checkpoint()

import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint

nvfp4_cfg = mtq.NVFP4_DEFAULT_CFG.copy()
nvfp4_cfg["quant_cfg"]["*lm_head*"] = {"enable": False}
nvfp4_cfg["quant_cfg"]["*model.visual*"] = {"enable": False}

model = mtq.quantize(model, nvfp4_cfg, forward_loop=calibration_loop)
export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir=output_dir)

Evaluation Methodology

Models are evaluated on the eval_hard_3500 benchmark using:

Metric	Description
SBERT Cosine	Semantic similarity via sentence-transformers (all-MiniLM-L6-v2)
NLI Score	Natural language inference entailment scoring
Levenshtein Ratio	Fuzzy string matching
Token F1	Token-level precision/recall
Weighted Score	Field-weighted aggregate (type=2.5x, defect=2.0x, brand/size=1.5x)

Citation

@misc{denali-ai-qwen3-vl-2b-sft-grpo-v9-nvfp4,
  title={Qwen3-VL-2B SFT+GRPO v9 NVFP4},
  author={Denali AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9-nvfp4}
}

License

This model is released under the Apache 2.0 License.

Downloads last month: 20

Safetensors

Model size

2B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Denali-AI/qwen3-vl-2b-sft-grpo-v9-nvfp4

Qwen3-VL Models

Collection

Garment classification models based on Qwen3-VL (2B) • 7 items • Updated 22 days ago