File size: 8,060 Bytes
f90b5a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
005e85f
 
f90b5a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
005e85f
f90b5a2
005e85f
 
 
 
f90b5a2
3eeae98
005e85f
 
 
 
 
 
 
 
 
 
 
f90b5a2
 
 
 
 
005e85f
 
 
3eeae98
f90b5a2
 
005e85f
 
3eeae98
 
 
 
005e85f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3eeae98
 
 
005e85f
3eeae98
005e85f
 
3eeae98
 
005e85f
 
 
 
3eeae98
 
f90b5a2
3eeae98
 
 
 
 
005e85f
3eeae98
 
005e85f
 
f90b5a2
 
005e85f
3eeae98
 
005e85f
 
 
 
 
 
3eeae98
f90b5a2
005e85f
 
 
f90b5a2
3eeae98
f90b5a2
3eeae98
005e85f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f90b5a2
 
005e85f
 
 
3eeae98
005e85f
 
3eeae98
005e85f
 
 
 
3eeae98
 
 
005e85f
 
 
 
 
 
 
3eeae98
005e85f
 
 
 
f90b5a2
 
 
3eeae98
 
 
f90b5a2
 
 
3eeae98
005e85f
f90b5a2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- 8bit
- mixed-precision
- lut
- palettize
- on-device
- apple-silicon
- ios
- macos
- qwen3
- qwen3-asr
- mega-asr
- anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---

# Mega-ASR β€” CoreML mixed 8/4 (end-to-end ASR)

CoreML deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR)
(Qwen3-ASR-1.7B base) with an **`input_embeds`-aware decoder** so audio
embeddings can be scattered at `<|audio_pad|>` positions to do real ASR β€”
not just text generation.

Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom
`convert_embeds_mixed.py` that:
1. Monkey-patches `QwenModel.forward` + `QwenForCausalLM.forward` to accept
   pre-embedded `hidden_states` (skipping the internal `embed_tokens`
   lookup) so audio scatter works at inference.
2. Enumerates the MIL program's const-weight ops by name pattern and applies
   **LUT-8 palettization to attention projections** (q/k/v/o_proj) and
   **LUT-4 to MLP projections** (gate/up/down_proj) β€” mirroring the MLX
   `mixed8_4` recipe that closed the gap to GPTQ on the LLM portion.
3. Runs `compute_precision=FLOAT32` β€” fp16 compute precision produces
   all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid
   community finding for the same base model).

## What's in this repo

| File | Size | Role |
| --- | ---: | --- |
| `coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/` | **1.87 GB** | **Recommended.** Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg. |
| `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` | 826 MB | Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed. |
| `coreml/mega-asr-llm_lut4.mlpackage/` | 974 MB | Standalone Qwen3 text LLM with `input_ids` input (no audio scatter). |
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
| `tokenizer/*` | β€” | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
| `inference_asr.py` | β€” | End-to-end ASR pipeline (ONNX encoder + CoreML LLM) |
| `convert_embeds.py` / `convert_embeds_mixed.py` | β€” | The custom converters |

## Quality (bench)

8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
agreement (1 βˆ’ WER), prompt forced to `language English`, ONNX fp32
audio encoder + the CoreML LLM, ran with `compute_units=ALL` (Metal GPU
since ANE compilation fails on this model size + stateful KV cache):

| Per-sample | Mixed 8/4 (recommended) | Uniform LUT-4 |
| --- | ---: | ---: |
| distortion | 100% | 100% |
| dropout | 100% | 100% |
| echo (hard, heavy reverb) | **64.7%** | 47.1% |
| far_field | 100% | 100% |
| mixed | 100% | 100% |
| noise | 100% | 100% |
| obstructed | **100%** | 88.2% |
| recording (hard, truncated audio) | 60.0% | 60.0% |
| **AVERAGE** | **90.6%** | 86.9% |

Mixed 8/4 lifts CoreML from 86.9% β†’ 90.6% (+3.7) by allocating the 4
attention projections per layer to LUT-8 (16 unique values for every 8
channels) while keeping the 3 MLP projections at LUT-4 (16 unique values
per 8 channels). Attention layers in Qwen3 are quality-critical β€” same
result we found in the MLX port.

Cross-backend leaderboard (same 8 samples, same audio encoder):

| Backend | Agreement |
| --- | ---: |
| ONNX recommended (GPTQ INT4) | 92.7% |
| MLX recommended (mixed 8/4) | 92.2% |
| **CoreML recommended (mixed 8/4)** | **90.6%** |
| CoreML LUT-4 baseline | 86.9% |
| ONNX RTN INT4 baseline | 87.8% |

The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference
(k-means clustering vs activation-aware Hessian redistribution). The two
hard samples (`echo`, `recording`) are audio-quality-limited and stay
around 60-65% across all 4-bit backends.

## Inference

```bash
pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
    --mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \
    --encoder-path onnx/audio_encoder_fp32.onnx \
    --examples-dir examples \
    --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \
    --compute-unit ALL
```

The pipeline:
1. **Mel features** via Qwen3-ASR's `WhisperFeatureExtractor`.
2. **Audio encoder** (ONNX fp32) β†’ audio embeddings `(F, 2048)`.
3. **Prompt + scatter**: build the Qwen3-ASR chat template with English
   forcing, expand the single `<|audio_pad|>` placeholder to F slots,
   lookup text embeds via the HF model's `embed_tokens` weight, scatter
   audio embeds at the placeholder positions.
4. **CoreML prefill**: feed each token's embedding one-at-a-time to
   populate the in-model KV cache state.
5. **CoreML decode**: greedy step-by-step until `<|im_end|>`.

The KV cache lives inside the CoreML model as `state`. Call
`model.make_state()` once per request, then thread the same state object
through every `predict()` call.

## Conversion details

```python
# Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4.
prog = mlmodel._mil_program
for op in prog.functions["main"].operations:
    if op.op_type != "const": continue
    n = op.name.lower()
    if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")):
        attn_ops.append(op.name)
    elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")):
        mlp_ops.append(op.name)

config = OptimizationConfig(op_name_configs={
    **{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops},
    **{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops},
})
mlmodel = palettize_weights(mlmodel, config)
```

The model exposes 84 attention weight ops (28 layers Γ— 3 attention
projections after the GQA-shared k/v gets clustered into k+v ops) and
84 MLP weight ops (28 layers Γ— 3 MLP projections).

`compute_precision=FLOAT32` is mandatory β€” fp16 compute on Qwen3-ASR
produces all-NaN logits (RMSNorm + attention score overflow).

A `coremltools` local patch was needed in
`coremltools/converters/mil/frontend/torch/ops.py` `_cast`: numpy arrays
of size 1 need to be coerced to scalar via `.flatten()[0].item()` before
the dtype call β€” see `convert_embeds_mixed.py` setup notes.

## Known limitations

1. **ANE rejected**. CoreML's ANE compiler fails (`MILCompilerForANE
   error: failed to compile ANE model using ANEF`) β€” likely due to model
   size + stateful KV cache. `CPU_AND_NE` fails to load. `ALL` runs on
   **Metal GPU** (correct + ~3-4Γ— faster than `CPU_ONLY`), which is the
   recommended setting.
2. **Audio encoder is ONNX**. The 24-layer Whisper-style encoder isn't
   ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the
   encoder via `onnxruntime` and the LLM via `coremltools`.
3. **Quality below ONNX/MLX** by ~2% at 4-bit, due to LUT k-means being
   weaker than GPTQ on this architecture. The uniform LUT-4 variant is
   smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is
   recommended for best quality.

## Companion repos

- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) β€” full ONNX pipeline (GPTQ-INT4, 92.7%)
- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) β€” MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) β€” browser demo (WebGPU)

## Credits

- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with custom input_embeds + mixed-precision patches
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)