Add README
Browse files
README.md
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
- ja
|
| 7 |
+
- ko
|
| 8 |
+
- multilingual
|
| 9 |
+
library_name: coremltools
|
| 10 |
+
tags:
|
| 11 |
+
- coreml
|
| 12 |
+
- ane
|
| 13 |
+
- apple-neural-engine
|
| 14 |
+
- automatic-speech-recognition
|
| 15 |
+
- asr
|
| 16 |
+
- speech-recognition
|
| 17 |
+
- robust-asr
|
| 18 |
+
- quantized
|
| 19 |
+
- int4
|
| 20 |
+
- 4bit
|
| 21 |
+
- lut
|
| 22 |
+
- palettize
|
| 23 |
+
- on-device
|
| 24 |
+
- apple-silicon
|
| 25 |
+
- ios
|
| 26 |
+
- macos
|
| 27 |
+
- qwen3
|
| 28 |
+
- qwen3-asr
|
| 29 |
+
- mega-asr
|
| 30 |
+
- anemll
|
| 31 |
+
pipeline_tag: automatic-speech-recognition
|
| 32 |
+
base_model: zhifeixie/Mega-ASR
|
| 33 |
+
base_model_relation: quantized
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
# Mega-ASR β CoreML LUT-4 (Apple Neural Engine)
|
| 37 |
+
|
| 38 |
+
CoreML LUT-4 (4-bit lookup-table palettized) export of the LLM portion of
|
| 39 |
+
[zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (Qwen3-ASR-1.7B
|
| 40 |
+
base), produced via [ANEMLL](https://github.com/Anemll/Anemll) β the Apple
|
| 41 |
+
Neural Engine reference converter β with `--chunk 4 --lut 4 --context-length 512`.
|
| 42 |
+
|
| 43 |
+
The resulting `.mlpackage` is a stateful CoreML model with native ANE
|
| 44 |
+
attention layouts, in-model KV cache state, and 16-way split LM head for
|
| 45 |
+
efficient ANE residency.
|
| 46 |
+
|
| 47 |
+
## What's in this repo
|
| 48 |
+
|
| 49 |
+
| File | Size | Role |
|
| 50 |
+
| --- | ---: | --- |
|
| 51 |
+
| `coreml/mega-asr-llm_lut4.mlpackage/` | **974 MB** | Qwen3 1.7B LLM, ANE-targeted, LUT-4 palettized weights, stateful KV cache |
|
| 52 |
+
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime β CoreML port pending) |
|
| 53 |
+
| `tokenizer/*` | β | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
|
| 54 |
+
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
|
| 55 |
+
|
| 56 |
+
## Model I/O
|
| 57 |
+
|
| 58 |
+
The `mega-asr-llm_lut4.mlpackage` follows ANEMLL's stateful step-decoder layout:
|
| 59 |
+
|
| 60 |
+
**Inputs** (single-token step):
|
| 61 |
+
| name | shape | dtype |
|
| 62 |
+
| --- | --- | --- |
|
| 63 |
+
| `input_ids` | `(1, 1)` | int32 |
|
| 64 |
+
| `position_ids` | `(1,)` | int32 |
|
| 65 |
+
| `causal_mask` | `(1, 1, 1, 512)` | float16 |
|
| 66 |
+
| `current_pos` | `(1,)` | int32 |
|
| 67 |
+
| `update_mask` | `(1, 1, 512, 1)` | float16 |
|
| 68 |
+
|
| 69 |
+
**Outputs**: `logits1` β¦ `logits16`, each `(1, 1, 9496)` float16 β concat
|
| 70 |
+
along last axis to get the 151936-dim vocabulary.
|
| 71 |
+
|
| 72 |
+
**State**: `model_model_kv_cache_0` β shape `(56, 8, 512, 128)` float16 (28
|
| 73 |
+
layers Γ 2 (K/V) Γ 8 KV heads Γ 512 max context Γ 128 head dim). Create with
|
| 74 |
+
`model.make_state()` and pass to every `predict()`.
|
| 75 |
+
|
| 76 |
+
## Quick run (Python)
|
| 77 |
+
|
| 78 |
+
```python
|
| 79 |
+
import coremltools as ct
|
| 80 |
+
import numpy as np
|
| 81 |
+
|
| 82 |
+
m = ct.models.MLModel("coreml/mega-asr-llm_lut4.mlpackage",
|
| 83 |
+
compute_units=ct.ComputeUnit.CPU_AND_NE)
|
| 84 |
+
state = m.make_state()
|
| 85 |
+
out = m.predict({
|
| 86 |
+
"input_ids": np.array([[40]], dtype=np.int32), # token 'I'
|
| 87 |
+
"position_ids": np.array([0], dtype=np.int32),
|
| 88 |
+
"causal_mask": np.zeros((1, 1, 1, 512), dtype=np.float16),
|
| 89 |
+
"current_pos": np.array([0], dtype=np.int32),
|
| 90 |
+
"update_mask": np.zeros((1, 1, 512, 1), dtype=np.float16),
|
| 91 |
+
}, state=state)
|
| 92 |
+
all_logits = np.concatenate([out[f"logits{i}"][0, 0] for i in range(1, 17)])
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## ASR limitation (current)
|
| 96 |
+
|
| 97 |
+
This conversion exports the **standard text-LLM interface** (`input_ids` β
|
| 98 |
+
internal `embed_tokens` β forward). End-to-end ASR requires scattering
|
| 99 |
+
**audio embeddings** at `<|audio_pad|>` placeholder positions, which means
|
| 100 |
+
the model needs to accept `input_embeddings` *instead of* `input_ids`.
|
| 101 |
+
|
| 102 |
+
That requires forking ANEMLL's `qwen_model.py` to expose pre-embedded
|
| 103 |
+
hidden_states as the entry point, then re-running the conversion. (See
|
| 104 |
+
[`aoiandroid/Qwen3-ASR-1.7B-CoreML`](https://huggingface.co/aoiandroid/Qwen3-ASR-1.7B-CoreML)
|
| 105 |
+
for a prior community attempt of the same pattern; their decoder is named
|
| 106 |
+
`qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage` and pairs with a
|
| 107 |
+
separately stored `qwen3_asr_embeddings.bin`.)
|
| 108 |
+
|
| 109 |
+
Until the input_embeddings variant lands, this artifact is usable as:
|
| 110 |
+
- A standalone Qwen3 1.7B CoreML LLM (e.g. text-only chat with the same
|
| 111 |
+
prompt format the base model expects).
|
| 112 |
+
- A starting point for building an ANE-targeted Mega-ASR ASR pipeline by
|
| 113 |
+
re-converting with the embedding bypass.
|
| 114 |
+
|
| 115 |
+
## Conversion details
|
| 116 |
+
|
| 117 |
+
```bash
|
| 118 |
+
# After cloning ANEMLL (https://github.com/Anemll/Anemll):
|
| 119 |
+
python -m anemll.ane_converter.qwen_converter \
|
| 120 |
+
--model /path/to/Qwen3-ASR-1.7B-llm-only \
|
| 121 |
+
--prefix mega-asr-llm --lut 4 \
|
| 122 |
+
--context-length 512 --batch-size 64 --chunk 4 \
|
| 123 |
+
--output /path/to/out
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
The Qwen3-ASR-1.7B LLM weights were first extracted from `zhifeixie/Mega-ASR`
|
| 127 |
+
by stripping the `thinker.model.` prefix and dropping the tied lm_head
|
| 128 |
+
(see [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) for
|
| 129 |
+
the extraction script).
|
| 130 |
+
|
| 131 |
+
Coremltools 9.0 needed one local patch: the `_cast` op handler in
|
| 132 |
+
`coremltools/converters/mil/frontend/torch/ops.py` does not handle numpy
|
| 133 |
+
arrays of size 1 β fixed by extracting the scalar via `.flatten()[0].item()`
|
| 134 |
+
before the dtype coercion.
|
| 135 |
+
|
| 136 |
+
## Companion repos
|
| 137 |
+
|
| 138 |
+
- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) β full ONNX pipeline (GPTQ-INT4 decoder, 92.7% on VITW)
|
| 139 |
+
- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) β MLX 4-bit (mixed8/4 attention/MLP, 92.2% on VITW)
|
| 140 |
+
- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) β live browser demo (WebGPU)
|
| 141 |
+
|
| 142 |
+
## Credits
|
| 143 |
+
|
| 144 |
+
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
|
| 145 |
+
- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) (Apple Neural Engine LLM port toolkit)
|
| 146 |
+
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
|