Add README

f90b5a2 verified about 11 hours ago

5.41 kB

license: apache-2.0
language:
  - en
  - zh
  - ja
  - ko
  - multilingual
library_name: coremltools
tags:
  - coreml
  - ane
  - apple-neural-engine
  - automatic-speech-recognition
  - asr
  - speech-recognition
  - robust-asr
  - quantized
  - int4
  - 4bit
  - lut
  - palettize
  - on-device
  - apple-silicon
  - ios
  - macos
  - qwen3
  - qwen3-asr
  - mega-asr
  - anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized

Mega-ASR — CoreML LUT-4 (Apple Neural Engine)

CoreML LUT-4 (4-bit lookup-table palettized) export of the LLM portion of zhifeixie/Mega-ASR (Qwen3-ASR-1.7B base), produced via ANEMLL — the Apple Neural Engine reference converter — with --chunk 4 --lut 4 --context-length 512.

The resulting .mlpackage is a stateful CoreML model with native ANE attention layouts, in-model KV cache state, and 16-way split LM head for efficient ANE residency.

What's in this repo

File	Size	Role
`coreml/mega-asr-llm_lut4.mlpackage/`	974 MB	Qwen3 1.7B LLM, ANE-targeted, LUT-4 palettized weights, stateful KV cache
`onnx/audio_encoder_fp32.onnx`	1.27 GB	24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime — CoreML port pending)
`tokenizer/*`	—	Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.)
`examples/*.wav`	~3 MB	8 noisy benchmark clips from Voices-in-the-Wild-Bench

Model I/O

The mega-asr-llm_lut4.mlpackage follows ANEMLL's stateful step-decoder layout:

Inputs (single-token step):

name	shape	dtype
`input_ids`	`(1, 1)`	int32
`position_ids`	`(1,)`	int32
`causal_mask`	`(1, 1, 1, 512)`	float16
`current_pos`	`(1,)`	int32
`update_mask`	`(1, 1, 512, 1)`	float16

Outputs: logits1 … logits16, each (1, 1, 9496) float16 — concat along last axis to get the 151936-dim vocabulary.

State: model_model_kv_cache_0 — shape (56, 8, 512, 128) float16 (28 layers × 2 (K/V) × 8 KV heads × 512 max context × 128 head dim). Create with model.make_state() and pass to every predict().

Quick run (Python)

import coremltools as ct
import numpy as np

m = ct.models.MLModel("coreml/mega-asr-llm_lut4.mlpackage",
                     compute_units=ct.ComputeUnit.CPU_AND_NE)
state = m.make_state()
out = m.predict({
    "input_ids":   np.array([[40]], dtype=np.int32),          # token 'I'
    "position_ids": np.array([0], dtype=np.int32),
    "causal_mask":  np.zeros((1, 1, 1, 512), dtype=np.float16),
    "current_pos":  np.array([0], dtype=np.int32),
    "update_mask":  np.zeros((1, 1, 512, 1), dtype=np.float16),
}, state=state)
all_logits = np.concatenate([out[f"logits{i}"][0, 0] for i in range(1, 17)])

ASR limitation (current)

This conversion exports the standard text-LLM interface (input_ids → internal embed_tokens → forward). End-to-end ASR requires scattering audio embeddings at <|audio_pad|> placeholder positions, which means the model needs to accept input_embeddings instead of input_ids.

That requires forking ANEMLL's qwen_model.py to expose pre-embedded hidden_states as the entry point, then re-running the conversion. (See aoiandroid/Qwen3-ASR-1.7B-CoreML for a prior community attempt of the same pattern; their decoder is named qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage and pairs with a separately stored qwen3_asr_embeddings.bin.)

Until the input_embeddings variant lands, this artifact is usable as:

A standalone Qwen3 1.7B CoreML LLM (e.g. text-only chat with the same prompt format the base model expects).
A starting point for building an ANE-targeted Mega-ASR ASR pipeline by re-converting with the embedding bypass.

Conversion details

# After cloning ANEMLL (https://github.com/Anemll/Anemll):
python -m anemll.ane_converter.qwen_converter \
  --model /path/to/Qwen3-ASR-1.7B-llm-only \
  --prefix mega-asr-llm --lut 4 \
  --context-length 512 --batch-size 64 --chunk 4 \
  --output /path/to/out

The Qwen3-ASR-1.7B LLM weights were first extracted from zhifeixie/Mega-ASR by stripping the thinker.model. prefix and dropping the tied lm_head (see Reza2kn/mega-asr-mlx for the extraction script).

Coremltools 9.0 needed one local patch: the _cast op handler in coremltools/converters/mil/frontend/torch/ops.py does not handle numpy arrays of size 1 — fixed by extracting the scalar via .flatten()[0].item() before the dtype coercion.

Companion repos

Reza2kn/mega-asr-onnx — full ONNX pipeline (GPTQ-INT4 decoder, 92.7% on VITW)
Reza2kn/mega-asr-mlx — MLX 4-bit (mixed8/4 attention/MLP, 92.2% on VITW)
Reza2kn/mega-asr-bench — live browser demo (WebGPU)

Credits

Original model: zhifeixie/Mega-ASR (1.7B params, Apache-2.0)
CoreML conversion via ANEMLL (Apple Neural Engine LLM port toolkit)
Benchmark: Voices-in-the-Wild-Bench