mega-asr-coreml / README.md
Reza2kn's picture
Add README
f90b5a2 verified
metadata
license: apache-2.0
language:
  - en
  - zh
  - ja
  - ko
  - multilingual
library_name: coremltools
tags:
  - coreml
  - ane
  - apple-neural-engine
  - automatic-speech-recognition
  - asr
  - speech-recognition
  - robust-asr
  - quantized
  - int4
  - 4bit
  - lut
  - palettize
  - on-device
  - apple-silicon
  - ios
  - macos
  - qwen3
  - qwen3-asr
  - mega-asr
  - anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized

Mega-ASR β€” CoreML LUT-4 (Apple Neural Engine)

CoreML LUT-4 (4-bit lookup-table palettized) export of the LLM portion of zhifeixie/Mega-ASR (Qwen3-ASR-1.7B base), produced via ANEMLL β€” the Apple Neural Engine reference converter β€” with --chunk 4 --lut 4 --context-length 512.

The resulting .mlpackage is a stateful CoreML model with native ANE attention layouts, in-model KV cache state, and 16-way split LM head for efficient ANE residency.

What's in this repo

File Size Role
coreml/mega-asr-llm_lut4.mlpackage/ 974 MB Qwen3 1.7B LLM, ANE-targeted, LUT-4 palettized weights, stateful KV cache
onnx/audio_encoder_fp32.onnx 1.27 GB 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime β€” CoreML port pending)
tokenizer/* β€” Original Qwen3-ASR tokenizer (<|audio_pad|>, <asr_text>, etc.)
examples/*.wav ~3 MB 8 noisy benchmark clips from Voices-in-the-Wild-Bench

Model I/O

The mega-asr-llm_lut4.mlpackage follows ANEMLL's stateful step-decoder layout:

Inputs (single-token step):

name shape dtype
input_ids (1, 1) int32
position_ids (1,) int32
causal_mask (1, 1, 1, 512) float16
current_pos (1,) int32
update_mask (1, 1, 512, 1) float16

Outputs: logits1 … logits16, each (1, 1, 9496) float16 β€” concat along last axis to get the 151936-dim vocabulary.

State: model_model_kv_cache_0 β€” shape (56, 8, 512, 128) float16 (28 layers Γ— 2 (K/V) Γ— 8 KV heads Γ— 512 max context Γ— 128 head dim). Create with model.make_state() and pass to every predict().

Quick run (Python)

import coremltools as ct
import numpy as np

m = ct.models.MLModel("coreml/mega-asr-llm_lut4.mlpackage",
                     compute_units=ct.ComputeUnit.CPU_AND_NE)
state = m.make_state()
out = m.predict({
    "input_ids":   np.array([[40]], dtype=np.int32),          # token 'I'
    "position_ids": np.array([0], dtype=np.int32),
    "causal_mask":  np.zeros((1, 1, 1, 512), dtype=np.float16),
    "current_pos":  np.array([0], dtype=np.int32),
    "update_mask":  np.zeros((1, 1, 512, 1), dtype=np.float16),
}, state=state)
all_logits = np.concatenate([out[f"logits{i}"][0, 0] for i in range(1, 17)])

ASR limitation (current)

This conversion exports the standard text-LLM interface (input_ids β†’ internal embed_tokens β†’ forward). End-to-end ASR requires scattering audio embeddings at <|audio_pad|> placeholder positions, which means the model needs to accept input_embeddings instead of input_ids.

That requires forking ANEMLL's qwen_model.py to expose pre-embedded hidden_states as the entry point, then re-running the conversion. (See aoiandroid/Qwen3-ASR-1.7B-CoreML for a prior community attempt of the same pattern; their decoder is named qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage and pairs with a separately stored qwen3_asr_embeddings.bin.)

Until the input_embeddings variant lands, this artifact is usable as:

  • A standalone Qwen3 1.7B CoreML LLM (e.g. text-only chat with the same prompt format the base model expects).
  • A starting point for building an ANE-targeted Mega-ASR ASR pipeline by re-converting with the embedding bypass.

Conversion details

# After cloning ANEMLL (https://github.com/Anemll/Anemll):
python -m anemll.ane_converter.qwen_converter \
  --model /path/to/Qwen3-ASR-1.7B-llm-only \
  --prefix mega-asr-llm --lut 4 \
  --context-length 512 --batch-size 64 --chunk 4 \
  --output /path/to/out

The Qwen3-ASR-1.7B LLM weights were first extracted from zhifeixie/Mega-ASR by stripping the thinker.model. prefix and dropping the tied lm_head (see Reza2kn/mega-asr-mlx for the extraction script).

Coremltools 9.0 needed one local patch: the _cast op handler in coremltools/converters/mil/frontend/torch/ops.py does not handle numpy arrays of size 1 β€” fixed by extracting the scalar via .flatten()[0].item() before the dtype coercion.

Companion repos

Credits