Add README

Browse files

Files changed (1) hide show

README.md +146 -0

README.md ADDED Viewed

	@@ -0,0 +1,146 @@

+---
+license: apache-2.0
+language:
+- en
+- zh
+- ja
+- ko
+- multilingual
+library_name: coremltools
+tags:
+- coreml
+- ane
+- apple-neural-engine
+- automatic-speech-recognition
+- asr
+- speech-recognition
+- robust-asr
+- quantized
+- int4
+- 4bit
+- lut
+- palettize
+- on-device
+- apple-silicon
+- ios
+- macos
+- qwen3
+- qwen3-asr
+- mega-asr
+- anemll
+pipeline_tag: automatic-speech-recognition
+base_model: zhifeixie/Mega-ASR
+base_model_relation: quantized
+---
+# Mega-ASR — CoreML LUT-4 (Apple Neural Engine)
+CoreML LUT-4 (4-bit lookup-table palettized) export of the LLM portion of
+[zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (Qwen3-ASR-1.7B
+base), produced via [ANEMLL](https://github.com/Anemll/Anemll) — the Apple
+Neural Engine reference converter — with `--chunk 4 --lut 4 --context-length 512`.
+The resulting `.mlpackage` is a stateful CoreML model with native ANE
+attention layouts, in-model KV cache state, and 16-way split LM head for
+efficient ANE residency.
+## What's in this repo
+| File | Size | Role |
+| --- | ---: | --- |
+| `coreml/mega-asr-llm_lut4.mlpackage/` | **974 MB** | Qwen3 1.7B LLM, ANE-targeted, LUT-4 palettized weights, stateful KV cache |
+| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime — CoreML port pending) |
+| `tokenizer/*` | — | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
+| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
+## Model I/O
+The `mega-asr-llm_lut4.mlpackage` follows ANEMLL's stateful step-decoder layout:
+**Inputs** (single-token step):
+| name | shape | dtype |
+| --- | --- | --- |
+| `input_ids` | `(1, 1)` | int32 |
+| `position_ids` | `(1,)` | int32 |
+| `causal_mask` | `(1, 1, 1, 512)` | float16 |
+| `current_pos` | `(1,)` | int32 |
+| `update_mask` | `(1, 1, 512, 1)` | float16 |
+**Outputs**: `logits1` … `logits16`, each `(1, 1, 9496)` float16 — concat
+along last axis to get the 151936-dim vocabulary.
+**State**: `model_model_kv_cache_0` — shape `(56, 8, 512, 128)` float16 (28
+layers × 2 (K/V) × 8 KV heads × 512 max context × 128 head dim). Create with
+`model.make_state()` and pass to every `predict()`.
+## Quick run (Python)
+```python
+import coremltools as ct
+import numpy as np
+m = ct.models.MLModel("coreml/mega-asr-llm_lut4.mlpackage",
+                     compute_units=ct.ComputeUnit.CPU_AND_NE)
+state = m.make_state()
+out = m.predict({
+    "input_ids":   np.array([[40]], dtype=np.int32),          # token 'I'
+    "position_ids": np.array([0], dtype=np.int32),
+    "causal_mask":  np.zeros((1, 1, 1, 512), dtype=np.float16),
+    "current_pos":  np.array([0], dtype=np.int32),
+    "update_mask":  np.zeros((1, 1, 512, 1), dtype=np.float16),
+}, state=state)
+all_logits = np.concatenate([out[f"logits{i}"][0, 0] for i in range(1, 17)])
+```
+## ASR limitation (current)
+This conversion exports the **standard text-LLM interface** (`input_ids` →
+internal `embed_tokens` → forward). End-to-end ASR requires scattering
+**audio embeddings** at `<|audio_pad|>` placeholder positions, which means
+the model needs to accept `input_embeddings` *instead of* `input_ids`.
+That requires forking ANEMLL's `qwen_model.py` to expose pre-embedded
+hidden_states as the entry point, then re-running the conversion. (See
+[`aoiandroid/Qwen3-ASR-1.7B-CoreML`](https://huggingface.co/aoiandroid/Qwen3-ASR-1.7B-CoreML)
+for a prior community attempt of the same pattern; their decoder is named
+`qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage` and pairs with a
+separately stored `qwen3_asr_embeddings.bin`.)
+Until the input_embeddings variant lands, this artifact is usable as:
+- A standalone Qwen3 1.7B CoreML LLM (e.g. text-only chat with the same
+  prompt format the base model expects).
+- A starting point for building an ANE-targeted Mega-ASR ASR pipeline by
+  re-converting with the embedding bypass.
+## Conversion details
+```bash
+# After cloning ANEMLL (https://github.com/Anemll/Anemll):
+python -m anemll.ane_converter.qwen_converter \
+  --model /path/to/Qwen3-ASR-1.7B-llm-only \
+  --prefix mega-asr-llm --lut 4 \
+  --context-length 512 --batch-size 64 --chunk 4 \
+  --output /path/to/out
+```
+The Qwen3-ASR-1.7B LLM weights were first extracted from `zhifeixie/Mega-ASR`
+by stripping the `thinker.model.` prefix and dropping the tied lm_head
+(see [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) for
+the extraction script).
+Coremltools 9.0 needed one local patch: the `_cast` op handler in
+`coremltools/converters/mil/frontend/torch/ops.py` does not handle numpy
+arrays of size 1 — fixed by extracting the scalar via `.flatten()[0].item()`
+before the dtype coercion.
+## Companion repos
+- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) — full ONNX pipeline (GPTQ-INT4 decoder, 92.7% on VITW)
+- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) — MLX 4-bit (mixed8/4 attention/MLP, 92.2% on VITW)
+- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) — live browser demo (WebGPU)
+## Credits
+- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
+- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) (Apple Neural Engine LLM port toolkit)
+- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)