Mohaaxa commited on
Commit
3f41552
·
verified ·
1 Parent(s): a30c55f

NOVA pipeline: W4A16 | generic | run=qwen25vl3b_w4a16_generic

Browse files
README.md CHANGED
@@ -1,68 +1,67 @@
1
- ---
2
- base_model: Qwen/Qwen2.5-VL-3B-Instruct
3
- tags:
4
- - quantized
5
- - w4a16
6
- - robotics
7
- - nova-robot
8
- pipeline_tag: image-text-to-text
9
- language:
10
- - en
11
- ---
12
 
13
- # Qwen2.5-VL-3B-Instruct-W4A16-generic
14
 
15
- Quantized with the NOVA quantization pipeline on 2026-04-22.
16
- Base model: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
17
 
18
- ## Quantization details
19
 
20
- | Parameter | Value |
21
- |---|---|
22
- | Method | `W4A16` |
23
- | Group size | 128 |
24
- | Calibration | `generic` |
25
- | Ignored modules | `re:.*lm_head, re:.*visual.*` |
26
- | Tool | `llm-compressor >= 0.5.1` |
27
 
28
- ## Benchmark results
29
 
30
- | Metric | Value |
31
- |---|---|
32
- | Perplexity (wikitext-2, 20 samples) | 20.643 |
33
- | OCR sanity check | ✅ PASS |
34
- | Tokens / second | 4.7 |
35
- | TTFT (exact, prefill only) | 1.2 ms |
36
- | TPOT (exact, per output token) | 215.0 ms |
37
- | Inference VRAM | 9.57 GB |
38
- | Disk size | 4.03 GB |
39
 
40
- > TTFT and TPOT measured with `BaseStreamer` injection for exact
41
- > prefill/decode separation. Values are not conflated throughput averages.
42
 
43
- ## Registry notes
44
 
45
- - Pin transformers==4.49.0 if using AutoAWQ (archived May 2025).
46
  - Use llm-compressor>=0.5.1 for new quantization runs.
47
  - Projector (model.visual.merger) kept at FP32 — matched by visual.* regex.
48
  - OCR and bbox grounding regress 5x faster than MMMU under aggressive quant.
49
  - Keep merger at FP32, not BF16, for best bbox coordinate precision.
50
 
51
- ## Usage
52
 
53
- ```python
54
- from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
55
- import torch
56
 
57
- model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
58
- "Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic",
59
- torch_dtype=torch.bfloat16,
60
- device_map="auto",
61
- )
62
- processor = AutoProcessor.from_pretrained("Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic")
63
- ```
64
 
65
- ## Citation
66
 
67
- If you use this model in research, please cite the NOVA project.
68
- Pipeline source: `Mohaaxa/nova-quant-pipeline`
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-VL-3B-Instruct
3
+ tags:
4
+ - quantized
5
+ - w4a16
6
+ - robotics
7
+ - nova-robot
8
+ pipeline_tag: image-text-to-text
9
+ language:
10
+ - en
11
+ ---
12
 
13
+ # Qwen2.5-VL-3B-Instruct-W4A16-generic
14
 
15
+ Quantized with the NOVA quantization pipeline on 2026-04-22.
16
+ Base model: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
17
 
18
+ ## Quantization details
19
 
20
+ | Parameter | Value |
21
+ |---|---|
22
+ | Method | `W4A16` |
23
+ | Group size | 128 |
24
+ | Calibration | `generic` |
25
+ | Ignored modules | `re:.*lm_head, re:.*visual.*` |
26
+ | Tool | `llm-compressor >= 0.4.2` |
27
 
28
+ ## Benchmark results
29
 
30
+ | Metric | Value |
31
+ |---|---|
32
+ | Perplexity (wikitext-2, 20 samples) | 20.864 |
33
+ | OCR sanity check | ✅ PASS |
34
+ | Tokens / second | 1.8 |
35
+ | TTFT (exact, prefill only) | 801.9 ms |
36
+ | TPOT (exact, per output token) | 566.3 ms |
37
+ | Inference VRAM | 16.63 GB |
38
+ | Disk size | 8.21 GB |
39
 
40
+ > TTFT and TPOT measured with `BaseStreamer` injection (prompt-skip corrected).
 
41
 
42
+ ## Registry notes
43
 
44
+ - Pin transformers==4.49.0 if using AutoAWQ (archived May 2025).
45
  - Use llm-compressor>=0.5.1 for new quantization runs.
46
  - Projector (model.visual.merger) kept at FP32 — matched by visual.* regex.
47
  - OCR and bbox grounding regress 5x faster than MMMU under aggressive quant.
48
  - Keep merger at FP32, not BF16, for best bbox coordinate precision.
49
 
50
+ ## Usage
51
 
52
+ ```python
53
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
54
+ import torch
55
 
56
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
57
+ "Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic",
58
+ torch_dtype=torch.bfloat16,
59
+ device_map="auto",
60
+ )
61
+ processor = AutoProcessor.from_pretrained("Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic")
62
+ ```
63
 
64
+ ## Citation
65
 
66
+ If you use this model in research, please cite the NOVA project.
67
+ Pipeline source: `Mohaaxa/nova-quant-pipeline`
config.json CHANGED
@@ -39,7 +39,7 @@
39
  }
40
  }
41
  },
42
- "format": "pack-quantized",
43
  "global_compression_ratio": null,
44
  "ignore": [
45
  "visual.blocks.0.attn.qkv",
@@ -208,7 +208,7 @@
208
  ],
209
  "kv_cache_scheme": null,
210
  "quant_method": "compressed-tensors",
211
- "quantization_status": "compressed"
212
  },
213
  "rms_norm_eps": 1e-06,
214
  "rope_scaling": {
 
39
  }
40
  }
41
  },
42
+ "format": "dense",
43
  "global_compression_ratio": null,
44
  "ignore": [
45
  "visual.blocks.0.attn.qkv",
 
208
  ],
209
  "kv_cache_scheme": null,
210
  "quant_method": "compressed-tensors",
211
+ "quantization_status": "frozen"
212
  },
213
  "rms_norm_eps": 1e-06,
214
  "rope_scaling": {
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bacd32c4c662c186301b4f911d7622cc930f67642bff0503a0eb159fe1b7a1ab
3
+ size 4987765320
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c7670baa573e15dce9ed4c493506f9937a1707d564e5575f5fcfcc571f0b26a
3
+ size 3208992064
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
quant_meta.json CHANGED
@@ -8,7 +8,7 @@
8
  ],
9
  "calibration": "generic",
10
  "group_size": 128,
11
- "ppl": 20.64317496976273,
12
  "baseline_ppl": 19.096206092078926,
13
  "sanity_passed": true,
14
  "sanity_output": "QA_TEST_8472"
 
8
  ],
9
  "calibration": "generic",
10
  "group_size": 128,
11
+ "ppl": 20.86361947953847,
12
  "baseline_ppl": 19.096206092078926,
13
  "sanity_passed": true,
14
  "sanity_output": "QA_TEST_8472"