You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3-VL-2B SFT+GRPO v9 NVFP4

NVFP4 post-training quantized version of the #1 ranked Denali-AI model (qwen3-vl-2b-sft-grpo-v9, 89.5% weighted). Quantized using NVIDIA ModelOpt with 512-sample calibration. Retains 99.9% JSON parse rate and achieves 74.6% weighted score — ranked #2/13 overall.

Model Details

Property Value
Architecture Qwen3-VL
Parameters 2B
Base Model Qwen/Qwen3-VL-2B
Fine-tuned From Denali-AI/qwen3-vl-2b-sft-grpo-v9
Training SFT (LoRA) + GRPO (full precision)
Quantization NVFP4 PTQ (ModelOpt, 512 calibration samples)
Task Garment Attribute Extraction (9-field JSON)
Output Format Structured JSON

Key Highlights

  • Post-training quantization of the best-performing model — trained at full precision first, then quantized
  • 99.9% JSON parse rate retained after quantization (vs 100% full precision)
  • 74.6% weighted score — only 14.9pp drop from the 89.5% full-precision version
  • 17.2 samples/s throughput — actually faster than the full-precision version (15.9/s) due to reduced memory bandwidth
  • Vision encoder excluded from quantization to preserve visual understanding
  • ~4x smaller memory footprint vs BF16

Benchmark Results

Rank #6/21 on eval_hard_3500

Metric NVFP4 PTQ Full Precision Delta
Weighted Score 84.2% 89.5% -14.9pp
SBERT+NLI Combined 74.1% 78.5% -4.4pp
JSON Parse Rate 99.9% 100% -0.1pp
Throughput 17.2/s 15.9/s +1.3/s
Inference Time 203s 220s -17s

Per-Field Scores

Field SBERT NLI Levenshtein Token F1 SBERT+NLI Weight
type 78.5% 65.4% 70.6% 61.2% 68.3% 2.5x
color 84.8% 61.8% 71.0% 54.3% 76.1% 1.0x
pattern 70.2% 65.2% 67.8% 55.3% 61.9% 1.0x
closure 62.6% 56.9% 60.9% 49.1% 57.0% 1.0x
sleeve 85.3% 86.8% 84.4% 71.4% 86.1% 1.0x
neckline 77.1% 74.8% 75.2% 68.8% 71.1% 1.0x
defect 78.1% 78.3% 77.8% 77.4% 78.2% 2.0x
brand 79.6% 79.5% 80.2% 78.0% 79.1% 1.5x
size 89.6% 89.5% 89.5% 89.4% 89.4% 1.5x

Visualizations

Radar Chart

Radar Chart

Leaderboard

Leaderboard

Multi-Metric Breakdown

Metrics

Quality vs Throughput

Throughput

Full Leaderboard

Rank Model Weighted SBERT+NLI JSON Parse Throughput Inference
1 qwen3-vl-8b-sft+grpo 80.9% 78.7% 100% 7.5/s 464s
2 qwen3-vl-2b-sft-grpo-v9 79.9% 78.5% 100% 15.9/s 220s
3 qwen3-vl-8b-instruct-base 78.1% 75.6% 100% 5.5/s 640s
4 qwen3-vl-8b-instruct-nvfp4 77.8% 75.0% 100% 8.2/s 424s
5 qwen35-2b-base 76.2% 73.0% 100% 6.6/s 534s
6 qwen3-vl-2b-sft-grpo-v9-nvfp4 >>> 74.6% 74.1% 100% 17.2/s 203s
7 qwen3-vl-2b-instruct-base 68.0% 66.7% 100% 15.1/s 231s
8 internvl3-2b-grpo-gtpo-full 67.5% 64.3% 100% 11.8/s 297s
9 internvl3-2b-grpo-gtpo-fp8 67.1% 63.8% 100% 14.3/s 244s
10 internvl3-2b-base 66.8% 63.7% 100% 11.8/s 297s
11 moondream2-base 63.8% 61.8% 100% 1.4/s 2416s
12 qwen35-2b-sft-grpo-gtpo-v8 60.7% 60.1% 100% 11.3/s 309s
13 qwen35-2b-sft-v7 58.6% 58.9% 100% 11.6/s 302s
14 qwen35-35b-a3b-gptq-int4 51.5% 48.7% 14% 1.6/s 2124s
15 qwen35-9b-nvfp4-v10 48.9% 46.0% 8% 1.7/s 2075s
16 qwen35-9b-sft-nvfp4-v11 48.3% 45.5% 8% 1.7/s 2023s
17 qwen35-2b-base-nvfp4-v10 45.9% 42.9% 0% 4.0/s 878s
18 qwen3.5-122b-a10b-nvfp4 45.9% 42.9% 0% 1.2/s 2893s
19 qwen35-2b-sft-nvfp4-v11 45.9% 42.9% 0% 4.0/s 876s
20 qwen35-2b-sft-grpo-gtpo-nvfp4 45.9% 42.9% 0% 3.9/s 907s
21 qwen3-vl-8b-sft-grpo 0.0% 0.0% 100% 0.0/s 462s

Comparative Analysis

  • vs qwen3-vl-2b-sft-grpo-v9 (full precision, #1): -14.9pp weighted, but 1.3/s faster throughput and ~4x smaller memory footprint. The accuracy drop is primarily in type (-10.2pp), color (-8.7pp), and pattern (-16.6pp) fields where fine-grained discrimination is most affected by quantization.
  • vs internvl3-2b-grpo-gtpo-full (#2 at 72.7%): +1.9pp weighted — PTQ NVFP4 of the best model still outperforms the #2 full-precision model.
  • vs LoRA-on-NVFP4 approach (qwen35-2b-sft-nvfp4, 42.9%): +31.7pp — PTQ (quantize after training) vastly outperforms training on already-quantized models.

Improvement Recommendations

  • FP8 quantization: Try FP8 (E4M3) instead of FP4 for less accuracy loss — may retain 85%+ weighted score
  • GPTQ/AWQ: Compare with weight-only quantization methods that preserve activation precision
  • Calibration tuning: More calibration samples (1024+) or task-specific calibration data may improve accuracy
  • Mixed precision: Keep critical layers (first/last transformer blocks) at FP8 while quantizing middle layers to FP4

Alternative Models

  • SmolVLM-2B: Compact VLM that may quantize better due to simpler architecture
  • Qwen3.5-VL-2B: Newer architecture with Gated DeltaNet — though our tests show NVFP4 PTQ breaks its JSON output
  • PaliGemma-2: Google's VLM with strong structured output capabilities

Quantization Pipeline

This model was created using NVIDIA ModelOpt post-training quantization:

  1. Load full-precision fine-tuned model (SFT + GRPO trained)
  2. Configure NVFP4 quantization with exclusions for lm_head and vision encoder
  3. Calibrate on 512 training samples with max_seq_len=2048
  4. Quantize weights to NVFP4 (E2M1 FP4 with FP8 block scaling, group_size=16)
  5. Export HF-compatible checkpoint via export_hf_checkpoint()
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint

nvfp4_cfg = mtq.NVFP4_DEFAULT_CFG.copy()
nvfp4_cfg["quant_cfg"]["*lm_head*"] = {"enable": False}
nvfp4_cfg["quant_cfg"]["*model.visual*"] = {"enable": False}

model = mtq.quantize(model, nvfp4_cfg, forward_loop=calibration_loop)
export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir=output_dir)

Evaluation Methodology

Models are evaluated on the eval_hard_3500 benchmark using:

Metric Description
SBERT Cosine Semantic similarity via sentence-transformers (all-MiniLM-L6-v2)
NLI Score Natural language inference entailment scoring
Levenshtein Ratio Fuzzy string matching
Token F1 Token-level precision/recall
Weighted Score Field-weighted aggregate (type=2.5x, defect=2.0x, brand/size=1.5x)

Citation

@misc{denali-ai-qwen3-vl-2b-sft-grpo-v9-nvfp4,
  title={Qwen3-VL-2B SFT+GRPO v9 NVFP4},
  author={Denali AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9-nvfp4}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
20
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Denali-AI/qwen3-vl-2b-sft-grpo-v9-nvfp4