Reza2kn commited on
Commit
f6f9235
·
verified ·
1 Parent(s): 1566056

Update quality table with mixed8 (INT8 enc + INT4 dec = 91.9%) as recommended config

Browse files
Files changed (1) hide show
  1. README.md +18 -13
README.md CHANGED
@@ -73,19 +73,24 @@ Conv2d front-end at fp16.
73
  ## Quality
74
 
75
  Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
76
- example clips (real-world noisy conditions), word-level agreement (1 − WER):
77
-
78
- | Variant | Average agreement | 100% samples |
79
- | --- | ---: | ---: |
80
- | PT bf16 (original) | **95.1%** | 6 / 8 |
81
- | ONNX fp16 (this export) | 87.8% | 6 / 8 |
82
- | **ONNX INT4 (this repo)** | **87.8%** | 6 / 8 |
83
-
84
- The INT4 quantization is **lossless within the export envelope** the same
85
- score as the fp16 ONNX export. The gap to bf16 PT is the export itself
86
- (numerical drift through dynamo + Cache lowering on a few hard samples
87
- `echo` / `recording`), not the quantization. Most clean speech transcribes at
88
- 100%.
 
 
 
 
 
89
 
90
  ## Inference (Python)
91
 
 
73
  ## Quality
74
 
75
  Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
76
+ example clips (real-world noisy conditions, all English), word-level
77
+ agreement (1 − WER), prompt forced to `language English`:
78
+
79
+ | Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size |
80
+ | --- | --- | --- | ---: | ---: | ---: |
81
+ | PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB |
82
+ | ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB |
83
+ | **ONNX recommended** | **INT8** | **INT4** | **91.9%** | **6 / 8** | **2.3 GB** |
84
+ | ONNX small | INT4 | INT4 | 87.8% | 6 / 8 | 2.0 GB |
85
+
86
+ The recommended config (INT8 audio encoder + INT4 LLM decoder) is the
87
+ size/quality sweet spot for browser deployment. Forcing the language
88
+ (rather than auto-detecting) recovers most of the quantization drift.
89
+
90
+ **Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
91
+ language skips the model's audio-quality-router language detection,
92
+ which is where the PT model loses points on `echo` (heard "by the walk"
93
+ instead of "while walking") and `recording` (truncated).
94
 
95
  ## Inference (Python)
96