End-to-end CoreML ASR works (86.9% on VITW); document input_embeds fork + fp32 compute fix

Browse files

Files changed (1) hide show

README.md +112 -85

README.md CHANGED Viewed

@@ -33,114 +33,141 @@ base_model: zhifeixie/Mega-ASR
 base_model_relation: quantized
 ---
-# Mega-ASR — CoreML LUT-4 (Apple Neural Engine)
-CoreML LUT-4 (4-bit lookup-table palettized) export of the LLM portion of
-[zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (Qwen3-ASR-1.7B
-base), produced via [ANEMLL](https://github.com/Anemll/Anemll) — the Apple
-Neural Engine reference converter — with `--chunk 4 --lut 4 --context-length 512`.
-The resulting `.mlpackage` is a stateful CoreML model with native ANE
-attention layouts, in-model KV cache state, and 16-way split LM head for
-efficient ANE residency.
 ## What's in this repo
 | File | Size | Role |
 | --- | ---: | --- |
-| `coreml/mega-asr-llm_lut4.mlpackage/` | **974 MB** | Qwen3 1.7B LLM, ANE-targeted, LUT-4 palettized weights, stateful KV cache |
-| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime — CoreML port pending) |
 | `tokenizer/*` | — | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
 | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
-## Model I/O
-The `mega-asr-llm_lut4.mlpackage` follows ANEMLL's stateful step-decoder layout:
-**Inputs** (single-token step):
-| name | shape | dtype |
-| --- | --- | --- |
-| `input_ids` | `(1, 1)` | int32 |
-| `position_ids` | `(1,)` | int32 |
-| `causal_mask` | `(1, 1, 1, 512)` | float16 |
-| `current_pos` | `(1,)` | int32 |
-| `update_mask` | `(1, 1, 512, 1)` | float16 |
-**Outputs**: `logits1` … `logits16`, each `(1, 1, 9496)` float16 — concat
-along last axis to get the 151936-dim vocabulary.
-**State**: `model_model_kv_cache_0` — shape `(56, 8, 512, 128)` float16 (28
-layers × 2 (K/V) × 8 KV heads × 512 max context × 128 head dim). Create with
-`model.make_state()` and pass to every `predict()`.
-## Quick run (Python)
-```python
-import coremltools as ct
-import numpy as np
-m = ct.models.MLModel("coreml/mega-asr-llm_lut4.mlpackage",
-                     compute_units=ct.ComputeUnit.CPU_AND_NE)
-state = m.make_state()
-out = m.predict({
-    "input_ids":   np.array([[40]], dtype=np.int32),          # token 'I'
-    "position_ids": np.array([0], dtype=np.int32),
-    "causal_mask":  np.zeros((1, 1, 1, 512), dtype=np.float16),
-    "current_pos":  np.array([0], dtype=np.int32),
-    "update_mask":  np.zeros((1, 1, 512, 1), dtype=np.float16),
-}, state=state)
-all_logits = np.concatenate([out[f"logits{i}"][0, 0] for i in range(1, 17)])
 ```
-## ASR limitation (current)
-This conversion exports the **standard text-LLM interface** (`input_ids` →
-internal `embed_tokens` → forward). End-to-end ASR requires scattering
-**audio embeddings** at `<|audio_pad|>` placeholder positions, which means
-the model needs to accept `input_embeddings` *instead of* `input_ids`.
-That requires forking ANEMLL's `qwen_model.py` to expose pre-embedded
-hidden_states as the entry point, then re-running the conversion. (See
-[`aoiandroid/Qwen3-ASR-1.7B-CoreML`](https://huggingface.co/aoiandroid/Qwen3-ASR-1.7B-CoreML)
-for a prior community attempt of the same pattern; their decoder is named
-`qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage` and pairs with a
-separately stored `qwen3_asr_embeddings.bin`.)
-Until the input_embeddings variant lands, this artifact is usable as:
-- A standalone Qwen3 1.7B CoreML LLM (e.g. text-only chat with the same
-  prompt format the base model expects).
-- A starting point for building an ANE-targeted Mega-ASR ASR pipeline by
-  re-converting with the embedding bypass.
-## Conversion details
-```bash
-# After cloning ANEMLL (https://github.com/Anemll/Anemll):
-python -m anemll.ane_converter.qwen_converter \
-  --model /path/to/Qwen3-ASR-1.7B-llm-only \
-  --prefix mega-asr-llm --lut 4 \
-  --context-length 512 --batch-size 64 --chunk 4 \
-  --output /path/to/out
 ```
-The Qwen3-ASR-1.7B LLM weights were first extracted from `zhifeixie/Mega-ASR`
-by stripping the `thinker.model.` prefix and dropping the tied lm_head
-(see [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) for
-the extraction script).
-Coremltools 9.0 needed one local patch: the `_cast` op handler in
-`coremltools/converters/mil/frontend/torch/ops.py` does not handle numpy
-arrays of size 1 — fixed by extracting the scalar via `.flatten()[0].item()`
-before the dtype coercion.
 ## Companion repos
-- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) — full ONNX pipeline (GPTQ-INT4 decoder, 92.7% on VITW)
-- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) — MLX 4-bit (mixed8/4 attention/MLP, 92.2% on VITW)
-- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) — live browser demo (WebGPU)
 ## Credits
-- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
-- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) (Apple Neural Engine LLM port toolkit)
 - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)

 base_model_relation: quantized
 ---
+# Mega-ASR — CoreML LUT-4 (end-to-end ASR)
+CoreML LUT-4 (4-bit lookup-table palettized) deployment of
+[zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR), with an
+`input_embeds`-aware decoder so audio embeddings can be scattered at
+`<|audio_pad|>` positions to do real ASR — not just text generation.
+Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom
+`coreml_convert_embeds.py` that monkey-patches `QwenModel.forward` +
+`QwenForCausalLM.forward` to accept pre-embedded `hidden_states` (skipping the
+internal `embed_tokens` lookup). The model is single-token-step, stateful KV
+cache (28 layers × 2 × 8 KV heads × 512 ctx × 128 head_dim, fp16), LUT-4
+weights at `--per_channel 8`, and **fp32 compute precision** — `compute_precision=FLOAT16`
+overflows in Qwen3-ASR's RMSNorm/attention layers and produces NaN logits.
 ## What's in this repo
 | File | Size | Role |
 | --- | ---: | --- |
+| `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` | **826 MB** | **Recommended.** Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, LUT-4 weights. Pair with the ONNX audio encoder for end-to-end ASR. |
+| `coreml/mega-asr-llm_lut4.mlpackage/` | 974 MB | Original `input_ids` variant — standalone Qwen3 1.7B text LLM (no audio scatter). |
+| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
 | `tokenizer/*` | — | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
 | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
+| `inference_asr.py` | — | End-to-end ASR pipeline: ONNX encoder + CoreML LLM |
+| `convert_embeds.py` | — | The custom converter (use to reproduce / re-quantize) |
+## Quality (bench)
+8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
+agreement (1 − WER), prompt forced to `language English`, on M-series Mac
+CPU (CPU_AND_NE failed to compile for ANE due to model size + state):
+| Per-sample | Hyp ≈ Ref? | Agreement |
+| --- | --- | ---: |
+| distortion | exact match | 100% |
+| dropout | exact match | 100% |
+| far_field | exact match | 100% |
+| mixed | exact match | 100% |
+| noise | exact match | 100% |
+| obstructed | "i have forgotten" vs "i forgot" | 88.2% |
+| echo (hard, heavy reverb) | "size 25 stand not and the 125 walk" | 47.1% |
+| recording (hard, truncated audio) | "train stopped at the station" | 60.0% |
+| **AVERAGE** | | **86.9%** |
+For reference (same 8 samples, same audio encoder, same prompt):
+| Backend | Agreement |
+| --- | ---: |
+| ONNX recommended (GPTQ) | 92.7% |
+| MLX recommended (mixed 8/4) | 92.2% |
+| **CoreML LUT-4 (this repo)** | **86.9%** |
+| ONNX RTN INT4 baseline | 87.8% |
+LUT-4 k-means is a more aggressive quantization than ONNX GPTQ (which uses
+activation-aware error redistribution) or MLX mixed 8/4 (which keeps the
+4 attention projections at 8-bit). The roughly **6% gap** vs the leaders is
+concentrated on the 2 hard samples (`echo`, `recording`) and one near-miss
+on `obstructed`. Six of eight samples produce exact-match transcriptions.
+## Inference
+```bash
+pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
+git clone https://huggingface.co/Reza2kn/mega-asr-coreml
+cd mega-asr-coreml
+python inference_asr.py \
+    --mlpackage coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage \
+    --encoder-path onnx/audio_encoder_fp32.onnx \
+    --examples-dir examples \
+    --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir>
 ```
+The pipeline runs:
+1. **Mel features** via Qwen3-ASR's `WhisperFeatureExtractor`.
+2. **Audio encoder** (ONNX fp32) → audio embeddings `(F, 2048)`.
+3. **Prompt + scatter**: build the Qwen3-ASR chat template, expand the single
+   `<|audio_pad|>` placeholder to `F` slots, lookup text embeds via the
+   original HF model's `embed_tokens` weight, scatter audio embeds in.
+4. **CoreML prefill**: feed each token's embedding one-at-a-time to populate the
+   KV cache state.
+5. **CoreML decode**: greedy step-by-step until `<|im_end|>`.
+The KV cache lives inside the CoreML model as `state`. Call `model.make_state()`
+once per request, then pass the same state object to every `predict()` call.
+## Conversion details
+Two-step monkey-patch in `convert_embeds.py` lets ANEMLL's Qwen3 conversion
+accept pre-embedded inputs:
+```python
+# 1. QwenModel.forward — detect float input_ids and skip embed_tokens
+qm.QwenModel.forward = model_forward_or_embeds
+# 2. QwenForCausalLM.forward — relax the 2D assert; replicate lm_head logic
+qm.QwenForCausalLM.forward = causal_forward_or_embeds
 ```
+ANEMLL's CoreML conversion then traces with a `WrapperEmbeds` module whose
+inputs are `(inputs_embeds, position_ids, causal_mask, current_pos, update_mask)`.
+`coremltools.optimize.coreml.palettize_weights` applies LUT-4 with
+`per_grouped_channel` / `group_size=8`.
+**Key compute-precision tweak**: `compute_precision=ct.precision.FLOAT32`
+in `ct.convert`. fp16 compute produces all-NaN logits on Qwen3-ASR's
+RMSNorm + attention layers — same finding as the aoiandroid community
+CoreML port. Weights stay LUT-4 (4-bit storage); only activations run fp32.
+Also patched: `coremltools/converters/mil/frontend/torch/ops.py` `_cast` op
+handler (numpy array of size 1 → extract scalar via `.flatten()[0].item()`).
+Diff lives in `convert_embeds.py` setup notes.
+## Known limitations
+1. **CPU compute only** in practice. CoreML's ANE compiler rejects this model
+   (`MILCompilerForANE error: failed to compile ANE model using ANEF`) — likely
+   due to model size + stateful KV cache. CPU_AND_NE / ALL fail to load;
+   CPU_ONLY works and is correct. Per-token latency is ~1.5 s on CPU.
+2. **Audio encoder is ONNX**. The 24-layer Whisper-style encoder hasn't been
+   ported to CoreML (ANEMLL is LLM-only). End-to-end inference runs the
+   encoder via `onnxruntime` and the LLM via `coremltools`.
+3. **Quality below ONNX/MLX** at 4-bit due to LUT-4 k-means being weaker than
+   GPTQ on this architecture. Mitigations: use LUT-6 (`--lut 6` in the
+   converter) to recover ~3% at +50% size, or use the fp16 variant
+   (`mega-asr-llm-embeds_fp16.mlpackage`, ~3.2 GB) for full quality.
 ## Companion repos
+- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) — full ONNX pipeline (GPTQ-INT4, 92.7%)
+- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) — MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
+- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) — browser demo (WebGPU)
 ## Credits
+- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
+- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with a custom input_embeds patch
 - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)