Add mixed 8/4 CoreML (90.6% on VITW) — new recommended variant

Browse files

Files changed (1) hide show

README.md +102 -78

README.md CHANGED Viewed

@@ -18,6 +18,8 @@ tags:
 - quantized
 - int4
 - 4bit
 - lut
 - palettize
 - on-device
@@ -33,65 +35,78 @@ base_model: zhifeixie/Mega-ASR
 base_model_relation: quantized
 ---
-# Mega-ASR — CoreML LUT-4 (end-to-end ASR)
-CoreML LUT-4 (4-bit lookup-table palettized) deployment of
-[zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR), with an
-`input_embeds`-aware decoder so audio embeddings can be scattered at
-`<|audio_pad|>` positions to do real ASR — not just text generation.
 Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom
-`coreml_convert_embeds.py` that monkey-patches `QwenModel.forward` +
-`QwenForCausalLM.forward` to accept pre-embedded `hidden_states` (skipping the
-internal `embed_tokens` lookup). The model is single-token-step, stateful KV
-cache (28 layers × 2 × 8 KV heads × 512 ctx × 128 head_dim, fp16), LUT-4
-weights at `--per_channel 8`, and **fp32 compute precision** — `compute_precision=FLOAT16`
-overflows in Qwen3-ASR's RMSNorm/attention layers and produces NaN logits.
 ## What's in this repo
 | File | Size | Role |
 | --- | ---: | --- |
-| `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` | **826 MB** | **Recommended.** Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, LUT-4 weights. Pair with the ONNX audio encoder for end-to-end ASR. |
-| `coreml/mega-asr-llm_lut4.mlpackage/` | 974 MB | Original `input_ids` variant — standalone Qwen3 1.7B text LLM (no audio scatter). |
 | `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
 | `tokenizer/*` | — | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
 | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
-| `inference_asr.py` | — | End-to-end ASR pipeline: ONNX encoder + CoreML LLM |
-| `convert_embeds.py` | — | The custom converter (use to reproduce / re-quantize) |
 ## Quality (bench)
 8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
-agreement (1 − WER), prompt forced to `language English`, on M-series Mac
-CPU (CPU_AND_NE failed to compile for ANE due to model size + state):
-| Per-sample | Hyp ≈ Ref? | Agreement |
-| --- | --- | ---: |
-| distortion | exact match | 100% |
-| dropout | exact match | 100% |
-| far_field | exact match | 100% |
-| mixed | exact match | 100% |
-| noise | exact match | 100% |
-| obstructed | "i have forgotten" vs "i forgot" | 88.2% |
-| echo (hard, heavy reverb) | "size 25 stand not and the 125 walk" | 47.1% |
-| recording (hard, truncated audio) | "train stopped at the station" | 60.0% |
-| **AVERAGE** | | **86.9%** |
-For reference (same 8 samples, same audio encoder, same prompt):
 | Backend | Agreement |
 | --- | ---: |
-| ONNX recommended (GPTQ) | 92.7% |
 | MLX recommended (mixed 8/4) | 92.2% |
-| **CoreML LUT-4 (this repo)** | **86.9%** |
 | ONNX RTN INT4 baseline | 87.8% |
-LUT-4 k-means is a more aggressive quantization than ONNX GPTQ (which uses
-activation-aware error redistribution) or MLX mixed 8/4 (which keeps the
-4 attention projections at 8-bit). The roughly **6% gap** vs the leaders is
-concentrated on the 2 hard samples (`echo`, `recording`) and one near-miss
-on `obstructed`. Six of eight samples produce exact-match transcriptions.
 ## Inference
@@ -100,65 +115,74 @@ pip install coremltools onnxruntime soundfile transformers safetensors librosa n
 git clone https://huggingface.co/Reza2kn/mega-asr-coreml
 cd mega-asr-coreml
 python inference_asr.py \
-    --mlpackage coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage \
     --encoder-path onnx/audio_encoder_fp32.onnx \
     --examples-dir examples \
-    --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir>
 ```
-The pipeline runs:
 1. **Mel features** via Qwen3-ASR's `WhisperFeatureExtractor`.
 2. **Audio encoder** (ONNX fp32) → audio embeddings `(F, 2048)`.
-3. **Prompt + scatter**: build the Qwen3-ASR chat template, expand the single
-   `<|audio_pad|>` placeholder to `F` slots, lookup text embeds via the
-   original HF model's `embed_tokens` weight, scatter audio embeds in.
-4. **CoreML prefill**: feed each token's embedding one-at-a-time to populate the
-   KV cache state.
 5. **CoreML decode**: greedy step-by-step until `<|im_end|>`.
-The KV cache lives inside the CoreML model as `state`. Call `model.make_state()`
-once per request, then pass the same state object to every `predict()` call.
 ## Conversion details
-Two-step monkey-patch in `convert_embeds.py` lets ANEMLL's Qwen3 conversion
-accept pre-embedded inputs:
 ```python
-# 1. QwenModel.forward — detect float input_ids and skip embed_tokens
-qm.QwenModel.forward = model_forward_or_embeds
-# 2. QwenForCausalLM.forward — relax the 2D assert; replicate lm_head logic
-qm.QwenForCausalLM.forward = causal_forward_or_embeds
 ```
-ANEMLL's CoreML conversion then traces with a `WrapperEmbeds` module whose
-inputs are `(inputs_embeds, position_ids, causal_mask, current_pos, update_mask)`.
-`coremltools.optimize.coreml.palettize_weights` applies LUT-4 with
-`per_grouped_channel` / `group_size=8`.
-**Key compute-precision tweak**: `compute_precision=ct.precision.FLOAT32`
-in `ct.convert`. fp16 compute produces all-NaN logits on Qwen3-ASR's
-RMSNorm + attention layers — same finding as the aoiandroid community
-CoreML port. Weights stay LUT-4 (4-bit storage); only activations run fp32.
-Also patched: `coremltools/converters/mil/frontend/torch/ops.py` `_cast` op
-handler (numpy array of size 1 → extract scalar via `.flatten()[0].item()`).
-Diff lives in `convert_embeds.py` setup notes.
 ## Known limitations
-1. **CPU compute only** in practice. CoreML's ANE compiler rejects this model
-   (`MILCompilerForANE error: failed to compile ANE model using ANEF`) — likely
-   due to model size + stateful KV cache. CPU_AND_NE / ALL fail to load;
-   CPU_ONLY works and is correct. Per-token latency is ~1.5 s on CPU.
-2. **Audio encoder is ONNX**. The 24-layer Whisper-style encoder hasn't been
-   ported to CoreML (ANEMLL is LLM-only). End-to-end inference runs the
    encoder via `onnxruntime` and the LLM via `coremltools`.
-3. **Quality below ONNX/MLX** at 4-bit due to LUT-4 k-means being weaker than
-   GPTQ on this architecture. Mitigations: use LUT-6 (`--lut 6` in the
-   converter) to recover ~3% at +50% size, or use the fp16 variant
-   (`mega-asr-llm-embeds_fp16.mlpackage`, ~3.2 GB) for full quality.
 ## Companion repos
@@ -169,5 +193,5 @@ Diff lives in `convert_embeds.py` setup notes.
 ## Credits
 - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
-- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with a custom input_embeds patch
 - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)

 - quantized
 - int4
 - 4bit
+- 8bit
+- mixed-precision
 - lut
 - palettize
 - on-device
 base_model_relation: quantized
 ---
+# Mega-ASR — CoreML mixed 8/4 (end-to-end ASR)
+CoreML deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR)
+(Qwen3-ASR-1.7B base) with an **`input_embeds`-aware decoder** so audio
+embeddings can be scattered at `<|audio_pad|>` positions to do real ASR —
+not just text generation.
 Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom
+`convert_embeds_mixed.py` that:
+1. Monkey-patches `QwenModel.forward` + `QwenForCausalLM.forward` to accept
+   pre-embedded `hidden_states` (skipping the internal `embed_tokens`
+   lookup) so audio scatter works at inference.
+2. Enumerates the MIL program's const-weight ops by name pattern and applies
+   **LUT-8 palettization to attention projections** (q/k/v/o_proj) and
+   **LUT-4 to MLP projections** (gate/up/down_proj) — mirroring the MLX
+   `mixed8_4` recipe that closed the gap to GPTQ on the LLM portion.
+3. Runs `compute_precision=FLOAT32` — fp16 compute precision produces
+   all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid
+   community finding for the same base model).
 ## What's in this repo
 | File | Size | Role |
 | --- | ---: | --- |
+| `coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/` | **1.87 GB** | **Recommended.** Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg. |
+| `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` | 826 MB | Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed. |
+| `coreml/mega-asr-llm_lut4.mlpackage/` | 974 MB | Standalone Qwen3 text LLM with `input_ids` input (no audio scatter). |
 | `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
 | `tokenizer/*` | — | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
 | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
+| `inference_asr.py` | — | End-to-end ASR pipeline (ONNX encoder + CoreML LLM) |
+| `convert_embeds.py` / `convert_embeds_mixed.py` | — | The custom converters |
 ## Quality (bench)
 8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
+agreement (1 − WER), prompt forced to `language English`, ONNX fp32
+audio encoder + the CoreML LLM, ran with `compute_units=ALL` (Metal GPU
+since ANE compilation fails on this model size + stateful KV cache):
+| Per-sample | Mixed 8/4 (recommended) | Uniform LUT-4 |
+| --- | ---: | ---: |
+| distortion | 100% | 100% |
+| dropout | 100% | 100% |
+| echo (hard, heavy reverb) | **64.7%** | 47.1% |
+| far_field | 100% | 100% |
+| mixed | 100% | 100% |
+| noise | 100% | 100% |
+| obstructed | **100%** | 88.2% |
+| recording (hard, truncated audio) | 60.0% | 60.0% |
+| **AVERAGE** | **90.6%** | 86.9% |
+Mixed 8/4 lifts CoreML from 86.9% → 90.6% (+3.7) by allocating the 4
+attention projections per layer to LUT-8 (16 unique values for every 8
+channels) while keeping the 3 MLP projections at LUT-4 (16 unique values
+per 8 channels). Attention layers in Qwen3 are quality-critical — same
+result we found in the MLX port.
+Cross-backend leaderboard (same 8 samples, same audio encoder):
 | Backend | Agreement |
 | --- | ---: |
+| ONNX recommended (GPTQ INT4) | 92.7% |
 | MLX recommended (mixed 8/4) | 92.2% |
+| **CoreML recommended (mixed 8/4)** | **90.6%** |
+| CoreML LUT-4 baseline | 86.9% |
 | ONNX RTN INT4 baseline | 87.8% |
+The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference
+(k-means clustering vs activation-aware Hessian redistribution). The two
+hard samples (`echo`, `recording`) are audio-quality-limited and stay
+around 60-65% across all 4-bit backends.
 ## Inference
 git clone https://huggingface.co/Reza2kn/mega-asr-coreml
 cd mega-asr-coreml
 python inference_asr.py \
+    --mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \
     --encoder-path onnx/audio_encoder_fp32.onnx \
     --examples-dir examples \
+    --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \
+    --compute-unit ALL
 ```
+The pipeline:
 1. **Mel features** via Qwen3-ASR's `WhisperFeatureExtractor`.
 2. **Audio encoder** (ONNX fp32) → audio embeddings `(F, 2048)`.
+3. **Prompt + scatter**: build the Qwen3-ASR chat template with English
+   forcing, expand the single `<|audio_pad|>` placeholder to F slots,
+   lookup text embeds via the HF model's `embed_tokens` weight, scatter
+   audio embeds at the placeholder positions.
+4. **CoreML prefill**: feed each token's embedding one-at-a-time to
+   populate the in-model KV cache state.
 5. **CoreML decode**: greedy step-by-step until `<|im_end|>`.
+The KV cache lives inside the CoreML model as `state`. Call
+`model.make_state()` once per request, then thread the same state object
+through every `predict()` call.
 ## Conversion details
 ```python
+# Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4.
+prog = mlmodel._mil_program
+for op in prog.functions["main"].operations:
+    if op.op_type != "const": continue
+    n = op.name.lower()
+    if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")):
+        attn_ops.append(op.name)
+    elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")):
+        mlp_ops.append(op.name)
+config = OptimizationConfig(op_name_configs={
+    **{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops},
+    **{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops},
+})
+mlmodel = palettize_weights(mlmodel, config)
 ```
+The model exposes 84 attention weight ops (28 layers × 3 attention
+projections after the GQA-shared k/v gets clustered into k+v ops) and
+84 MLP weight ops (28 layers × 3 MLP projections).
+`compute_precision=FLOAT32` is mandatory — fp16 compute on Qwen3-ASR
+produces all-NaN logits (RMSNorm + attention score overflow).
+A `coremltools` local patch was needed in
+`coremltools/converters/mil/frontend/torch/ops.py` `_cast`: numpy arrays
+of size 1 need to be coerced to scalar via `.flatten()[0].item()` before
+the dtype call — see `convert_embeds_mixed.py` setup notes.
 ## Known limitations
+1. **ANE rejected**. CoreML's ANE compiler fails (`MILCompilerForANE
+   error: failed to compile ANE model using ANEF`) — likely due to model
+   size + stateful KV cache. `CPU_AND_NE` fails to load. `ALL` runs on
+   **Metal GPU** (correct + ~3-4× faster than `CPU_ONLY`), which is the
+   recommended setting.
+2. **Audio encoder is ONNX**. The 24-layer Whisper-style encoder isn't
+   ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the
    encoder via `onnxruntime` and the LLM via `coremltools`.
+3. **Quality below ONNX/MLX** by ~2% at 4-bit, due to LUT k-means being
+   weaker than GPTQ on this architecture. The uniform LUT-4 variant is
+   smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is
+   recommended for best quality.
 ## Companion repos
 ## Credits
 - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
+- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with custom input_embeds + mixed-precision patches
 - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)