File size: 8,060 Bytes
f90b5a2 005e85f f90b5a2 005e85f f90b5a2 005e85f f90b5a2 3eeae98 005e85f f90b5a2 005e85f 3eeae98 f90b5a2 005e85f 3eeae98 005e85f 3eeae98 005e85f 3eeae98 005e85f 3eeae98 005e85f 3eeae98 f90b5a2 3eeae98 005e85f 3eeae98 005e85f f90b5a2 005e85f 3eeae98 005e85f 3eeae98 f90b5a2 005e85f f90b5a2 3eeae98 f90b5a2 3eeae98 005e85f f90b5a2 005e85f 3eeae98 005e85f 3eeae98 005e85f 3eeae98 005e85f 3eeae98 005e85f f90b5a2 3eeae98 f90b5a2 3eeae98 005e85f f90b5a2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | ---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- 8bit
- mixed-precision
- lut
- palettize
- on-device
- apple-silicon
- ios
- macos
- qwen3
- qwen3-asr
- mega-asr
- anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---
# Mega-ASR β CoreML mixed 8/4 (end-to-end ASR)
CoreML deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR)
(Qwen3-ASR-1.7B base) with an **`input_embeds`-aware decoder** so audio
embeddings can be scattered at `<|audio_pad|>` positions to do real ASR β
not just text generation.
Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom
`convert_embeds_mixed.py` that:
1. Monkey-patches `QwenModel.forward` + `QwenForCausalLM.forward` to accept
pre-embedded `hidden_states` (skipping the internal `embed_tokens`
lookup) so audio scatter works at inference.
2. Enumerates the MIL program's const-weight ops by name pattern and applies
**LUT-8 palettization to attention projections** (q/k/v/o_proj) and
**LUT-4 to MLP projections** (gate/up/down_proj) β mirroring the MLX
`mixed8_4` recipe that closed the gap to GPTQ on the LLM portion.
3. Runs `compute_precision=FLOAT32` β fp16 compute precision produces
all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid
community finding for the same base model).
## What's in this repo
| File | Size | Role |
| --- | ---: | --- |
| `coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/` | **1.87 GB** | **Recommended.** Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg. |
| `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` | 826 MB | Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed. |
| `coreml/mega-asr-llm_lut4.mlpackage/` | 974 MB | Standalone Qwen3 text LLM with `input_ids` input (no audio scatter). |
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
| `tokenizer/*` | β | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
| `inference_asr.py` | β | End-to-end ASR pipeline (ONNX encoder + CoreML LLM) |
| `convert_embeds.py` / `convert_embeds_mixed.py` | β | The custom converters |
## Quality (bench)
8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
agreement (1 β WER), prompt forced to `language English`, ONNX fp32
audio encoder + the CoreML LLM, ran with `compute_units=ALL` (Metal GPU
since ANE compilation fails on this model size + stateful KV cache):
| Per-sample | Mixed 8/4 (recommended) | Uniform LUT-4 |
| --- | ---: | ---: |
| distortion | 100% | 100% |
| dropout | 100% | 100% |
| echo (hard, heavy reverb) | **64.7%** | 47.1% |
| far_field | 100% | 100% |
| mixed | 100% | 100% |
| noise | 100% | 100% |
| obstructed | **100%** | 88.2% |
| recording (hard, truncated audio) | 60.0% | 60.0% |
| **AVERAGE** | **90.6%** | 86.9% |
Mixed 8/4 lifts CoreML from 86.9% β 90.6% (+3.7) by allocating the 4
attention projections per layer to LUT-8 (16 unique values for every 8
channels) while keeping the 3 MLP projections at LUT-4 (16 unique values
per 8 channels). Attention layers in Qwen3 are quality-critical β same
result we found in the MLX port.
Cross-backend leaderboard (same 8 samples, same audio encoder):
| Backend | Agreement |
| --- | ---: |
| ONNX recommended (GPTQ INT4) | 92.7% |
| MLX recommended (mixed 8/4) | 92.2% |
| **CoreML recommended (mixed 8/4)** | **90.6%** |
| CoreML LUT-4 baseline | 86.9% |
| ONNX RTN INT4 baseline | 87.8% |
The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference
(k-means clustering vs activation-aware Hessian redistribution). The two
hard samples (`echo`, `recording`) are audio-quality-limited and stay
around 60-65% across all 4-bit backends.
## Inference
```bash
pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
--mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \
--encoder-path onnx/audio_encoder_fp32.onnx \
--examples-dir examples \
--qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \
--compute-unit ALL
```
The pipeline:
1. **Mel features** via Qwen3-ASR's `WhisperFeatureExtractor`.
2. **Audio encoder** (ONNX fp32) β audio embeddings `(F, 2048)`.
3. **Prompt + scatter**: build the Qwen3-ASR chat template with English
forcing, expand the single `<|audio_pad|>` placeholder to F slots,
lookup text embeds via the HF model's `embed_tokens` weight, scatter
audio embeds at the placeholder positions.
4. **CoreML prefill**: feed each token's embedding one-at-a-time to
populate the in-model KV cache state.
5. **CoreML decode**: greedy step-by-step until `<|im_end|>`.
The KV cache lives inside the CoreML model as `state`. Call
`model.make_state()` once per request, then thread the same state object
through every `predict()` call.
## Conversion details
```python
# Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4.
prog = mlmodel._mil_program
for op in prog.functions["main"].operations:
if op.op_type != "const": continue
n = op.name.lower()
if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")):
attn_ops.append(op.name)
elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")):
mlp_ops.append(op.name)
config = OptimizationConfig(op_name_configs={
**{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops},
**{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops},
})
mlmodel = palettize_weights(mlmodel, config)
```
The model exposes 84 attention weight ops (28 layers Γ 3 attention
projections after the GQA-shared k/v gets clustered into k+v ops) and
84 MLP weight ops (28 layers Γ 3 MLP projections).
`compute_precision=FLOAT32` is mandatory β fp16 compute on Qwen3-ASR
produces all-NaN logits (RMSNorm + attention score overflow).
A `coremltools` local patch was needed in
`coremltools/converters/mil/frontend/torch/ops.py` `_cast`: numpy arrays
of size 1 need to be coerced to scalar via `.flatten()[0].item()` before
the dtype call β see `convert_embeds_mixed.py` setup notes.
## Known limitations
1. **ANE rejected**. CoreML's ANE compiler fails (`MILCompilerForANE
error: failed to compile ANE model using ANEF`) β likely due to model
size + stateful KV cache. `CPU_AND_NE` fails to load. `ALL` runs on
**Metal GPU** (correct + ~3-4Γ faster than `CPU_ONLY`), which is the
recommended setting.
2. **Audio encoder is ONNX**. The 24-layer Whisper-style encoder isn't
ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the
encoder via `onnxruntime` and the LLM via `coremltools`.
3. **Quality below ONNX/MLX** by ~2% at 4-bit, due to LUT k-means being
weaker than GPTQ on this architecture. The uniform LUT-4 variant is
smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is
recommended for best quality.
## Companion repos
- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) β full ONNX pipeline (GPTQ-INT4, 92.7%)
- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) β MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) β browser demo (WebGPU)
## Credits
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with custom input_embeds + mixed-precision patches
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
|