Reza2kn
/

mega-asr-onnx

Reza2kn commited on about 19 hours ago

Commit

f6f9235

verified ·

1 Parent(s): 1566056

Update quality table with mixed8 (INT8 enc + INT4 dec = 91.9%) as recommended config

Files changed (1) hide show

README.md CHANGED Viewed

@@ -73,19 +73,24 @@ Conv2d front-end at fp16.
 ## Quality
 Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
-example clips (real-world noisy conditions), word-level agreement (1 − WER):
-| Variant | Average agreement | 100% samples |
-| --- | ---: | ---: |
-| PT bf16 (original) | **95.1%** | 6 / 8 |
-| ONNX fp16 (this export) | 87.8% | 6 / 8 |
-| **ONNX INT4 (this repo)** | **87.8%** | 6 / 8 |
-The INT4 quantization is **lossless within the export envelope** — the same
-score as the fp16 ONNX export. The gap to bf16 PT is the export itself
-(numerical drift through dynamo + Cache lowering on a few hard samples
-`echo` / `recording`), not the quantization. Most clean speech transcribes at
-100%.
 ## Inference (Python)

 ## Quality
 Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
+example clips (real-world noisy conditions, all English), word-level
+agreement (1 − WER), prompt forced to `language English`:
+| Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size |
+| --- | --- | --- | ---: | ---: | ---: |
+| PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB |
+| ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB |
+| **ONNX recommended** | **INT8** | **INT4** | **91.9%** | **6 / 8** | **2.3 GB** |
+| ONNX small | INT4 | INT4 | 87.8% | 6 / 8 | 2.0 GB |
+The recommended config (INT8 audio encoder + INT4 LLM decoder) is the
+size/quality sweet spot for browser deployment. Forcing the language
+(rather than auto-detecting) recovers most of the quantization drift.
+**Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
+language skips the model's audio-quality-router language detection,
+which is where the PT model loses points on `echo` (heard "by the walk"
+instead of "while walking") and `recording` (truncated).
 ## Inference (Python)