license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- lut
- palettize
- on-device
- apple-silicon
- ios
- macos
- qwen3
- qwen3-asr
- mega-asr
- anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
Mega-ASR β CoreML LUT-4 (Apple Neural Engine)
CoreML LUT-4 (4-bit lookup-table palettized) export of the LLM portion of
zhifeixie/Mega-ASR (Qwen3-ASR-1.7B
base), produced via ANEMLL β the Apple
Neural Engine reference converter β with --chunk 4 --lut 4 --context-length 512.
The resulting .mlpackage is a stateful CoreML model with native ANE
attention layouts, in-model KV cache state, and 16-way split LM head for
efficient ANE residency.
What's in this repo
| File | Size | Role |
|---|---|---|
coreml/mega-asr-llm_lut4.mlpackage/ |
974 MB | Qwen3 1.7B LLM, ANE-targeted, LUT-4 palettized weights, stateful KV cache |
onnx/audio_encoder_fp32.onnx |
1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime β CoreML port pending) |
tokenizer/* |
β | Original Qwen3-ASR tokenizer (<|audio_pad|>, <asr_text>, etc.) |
examples/*.wav |
~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
Model I/O
The mega-asr-llm_lut4.mlpackage follows ANEMLL's stateful step-decoder layout:
Inputs (single-token step):
| name | shape | dtype |
|---|---|---|
input_ids |
(1, 1) |
int32 |
position_ids |
(1,) |
int32 |
causal_mask |
(1, 1, 1, 512) |
float16 |
current_pos |
(1,) |
int32 |
update_mask |
(1, 1, 512, 1) |
float16 |
Outputs: logits1 β¦ logits16, each (1, 1, 9496) float16 β concat
along last axis to get the 151936-dim vocabulary.
State: model_model_kv_cache_0 β shape (56, 8, 512, 128) float16 (28
layers Γ 2 (K/V) Γ 8 KV heads Γ 512 max context Γ 128 head dim). Create with
model.make_state() and pass to every predict().
Quick run (Python)
import coremltools as ct
import numpy as np
m = ct.models.MLModel("coreml/mega-asr-llm_lut4.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE)
state = m.make_state()
out = m.predict({
"input_ids": np.array([[40]], dtype=np.int32), # token 'I'
"position_ids": np.array([0], dtype=np.int32),
"causal_mask": np.zeros((1, 1, 1, 512), dtype=np.float16),
"current_pos": np.array([0], dtype=np.int32),
"update_mask": np.zeros((1, 1, 512, 1), dtype=np.float16),
}, state=state)
all_logits = np.concatenate([out[f"logits{i}"][0, 0] for i in range(1, 17)])
ASR limitation (current)
This conversion exports the standard text-LLM interface (input_ids β
internal embed_tokens β forward). End-to-end ASR requires scattering
audio embeddings at <|audio_pad|> placeholder positions, which means
the model needs to accept input_embeddings instead of input_ids.
That requires forking ANEMLL's qwen_model.py to expose pre-embedded
hidden_states as the entry point, then re-running the conversion. (See
aoiandroid/Qwen3-ASR-1.7B-CoreML
for a prior community attempt of the same pattern; their decoder is named
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage and pairs with a
separately stored qwen3_asr_embeddings.bin.)
Until the input_embeddings variant lands, this artifact is usable as:
- A standalone Qwen3 1.7B CoreML LLM (e.g. text-only chat with the same prompt format the base model expects).
- A starting point for building an ANE-targeted Mega-ASR ASR pipeline by re-converting with the embedding bypass.
Conversion details
# After cloning ANEMLL (https://github.com/Anemll/Anemll):
python -m anemll.ane_converter.qwen_converter \
--model /path/to/Qwen3-ASR-1.7B-llm-only \
--prefix mega-asr-llm --lut 4 \
--context-length 512 --batch-size 64 --chunk 4 \
--output /path/to/out
The Qwen3-ASR-1.7B LLM weights were first extracted from zhifeixie/Mega-ASR
by stripping the thinker.model. prefix and dropping the tied lm_head
(see Reza2kn/mega-asr-mlx for
the extraction script).
Coremltools 9.0 needed one local patch: the _cast op handler in
coremltools/converters/mil/frontend/torch/ops.py does not handle numpy
arrays of size 1 β fixed by extracting the scalar via .flatten()[0].item()
before the dtype coercion.
Companion repos
- Reza2kn/mega-asr-onnx β full ONNX pipeline (GPTQ-INT4 decoder, 92.7% on VITW)
- Reza2kn/mega-asr-mlx β MLX 4-bit (mixed8/4 attention/MLP, 92.2% on VITW)
- Reza2kn/mega-asr-bench β live browser demo (WebGPU)
Credits
- Original model: zhifeixie/Mega-ASR (1.7B params, Apache-2.0)
- CoreML conversion via ANEMLL (Apple Neural Engine LLM port toolkit)
- Benchmark: Voices-in-the-Wild-Bench