---
license: cc-by-nc-4.0
language:
  - en
  - ja
  - pl
  - mt
  - hu
  - fi
  - el
  - ta
library_name: mlx
pipeline_tag: automatic-speech-recognition
tags:
  - speech-recognition
  - phonetic-transcription
  - ipa
  - whisper
  - whisper-decoder-finetune
  - mlx
  - apple-silicon
  - multilingual
datasets:
  - mozilla-foundation/common_voice_16_1
metrics:
  - per
  - pfer
base_model: mlx-community/whisper-large-v3-mlx
model-index:
  - name: phonetic-whisper-mlx-broad-multi
    results:
      - task:
          type: automatic-speech-recognition
          name: Broad-IPA phonetic transcription (multilingual)
        dataset:
          name: Combined broad-IPA held-out validation
          type: custom
        metrics:
          - type: pfer
            value: 3.19
            name: Phone Feature Error Rate (PanPhon Hamming/24)
      - task:
          type: automatic-speech-recognition
          name: Broad-IPA phonetic transcription (TIMIT broad)
        dataset:
          name: TIMIT core test (broad)
          type: timit
        metrics:
          - type: pfer
            value: 4.70
            name: Phone Feature Error Rate
      - task:
          type: automatic-speech-recognition
          name: Zero-shot IPA transcription
        dataset:
          name: MultIPA zero-shot (Taguchi 2023)
          type: multipa
        metrics:
          - type: pfer
            value: 20.78
            name: Phone Feature Error Rate
      - task:
          type: automatic-speech-recognition
          name: Zero-shot IPA transcription (Tibeto-Burman)
        dataset:
          name: Tusom2021
          type: tusom2021
        metrics:
          - type: pfer
            value: 23.05
            name: Phone Feature Error Rate
---

# phonetic-whisper-mlx-broad-multi

Whisper-large-v3 decoder fine-tuned for **broad** International Phonetic
Alphabet (IPA) transcription across 8 languages, trained on a single
Apple Silicon machine with [MLX](https://github.com/ml-explore/mlx).

> **Companion variant:** [`phonetic-whisper-mlx-narrow-en`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-narrow-en)
> trains on TIMIT narrow English alone and emits TIMIT-narrow phonetic
> detail. Use this `broad-multi` variant for cross-lingual broad IPA;
> use `narrow-en` for English narrow IPA.
>
> **Code:** [`barathanaslan/phonetic-whisper-mlx`](https://github.com/barathanaslan/phonetic-whisper-mlx)

## Model description

`phonetic-whisper-mlx-broad-multi` is a decoder-only fine-tune of
[`mlx-community/whisper-large-v3-mlx`](https://huggingface.co/mlx-community/whisper-large-v3-mlx).
The encoder is frozen during training; only the decoder weights are
updated. The model takes 16 kHz audio and emits broad-phonemic IPA
strings (no diacritics, merged allophones).

**Output convention.** Broad IPA, NFC-normalized, with the
TIMIT-style closures (`bcl`, `dcl`, `gcl`, `pcl`, `tcl`, `kcl`) and
silences (`pau`, `epi`, `h#`) dropped, allophonic glottal stops
suppressed, and combining diacritics stripped (`m̩→m`, `n̩→n`, `l̩→l`,
`ɨ→ɪ`, `ʉ→u`, `ɦ→h`).

## Intended use

- Research on multilingual phonetic recognition under a uniform broad-IPA
  output convention.
- Linguistic-resource construction for the 8 trained languages
  (English, Japanese, Polish, Maltese, Hungarian, Finnish, Greek, Tamil).
- Cross-lingual zero-shot phonetic transcription as a baseline; expect
  degraded quality on languages outside the training set.

**Out of scope:** narrow phonetic transcription (use the companion
`narrow-en` for English narrow); orthographic ASR (this model emits
IPA, not text); commercial deployment without complying with the
upstream LDC TIMIT non-commercial licensing terms.

## How to use

### MLX (Apple Silicon)

```python
from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_whisper.load_models import load_model
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
from mlx_whisper.decoding import DecodingOptions, decode
from mlx.utils import tree_flatten, tree_unflatten

# Download checkpoint weights from HF.
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-broad-multi")

# Load Whisper-large-v3 architecture and overlay our decoder weights.
model = load_model("mlx-community/whisper-large-v3-mlx")
model.set_dtype(mx.float32)
trained = mx.load(f"{ckpt}/model.safetensors")
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
params = dict(tree_flatten(model.parameters()))
for k, v in decoder_weights.items():
    if k in params:
        params[k] = v
model.update(tree_unflatten(list(params.items())))

# Inference. ALWAYS pass language="en" — see Training-time language token.
audio = load_audio("your-audio.wav")
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
mel = mx.expand_dims(mel, 0).astype(mx.float32)
features = model.encoder(mel)
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
print(result[0].text.strip())
```

For training reproduction, see the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).

## Training data

| Source | Samples | Convention |
|---|---:|---|
| TIMIT broad (English, derived from `prepare_timit_dataset.py` + `simplify_timit_ipa.py`) | 4,158 | Broad |
| CommonVoice broad — 7 languages (ja, pl, mt, hu, fi, el, ta), Epitran-based G2P | 6,538 | Broad |
| **Total** | **10,696** | Broad |

Approximately ~30 hours of audio. Held-out validation: 924 utterances
(stratified 50/50 TIMIT/CommonVoice, seed=42).

TIMIT (LDC93S1) is licensed for non-commercial research only. The
trained weights are distributed under CC BY-NC 4.0 in accordance with
this restriction; see [License](#license).

## Training procedure

Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).

### Training-time language token

All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.**

## Evaluation

PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over
PanPhon's 24 articulatory features ÷ 24, with insertion/deletion
cost = 1 (Taguchi 2023 §4.2 / POWSM Table 4 rescoring convention).

| Benchmark | n | PFER (%) | Convention notes |
|---|---:|---:|---|
| Combined broad held-out validation (in-distribution) | 924 | **3.19** | TIMIT+CV stratified 50/50 |
| TIMIT broad core test (in-distribution) | 1,680 | **4.70** | Broad-on-broad |
| MultIPA zero-shot (Taguchi 2023) | — | **20.78** | Same test set as Taguchi 2023 (21.2 reported) |
| Tusom2021 (Tibeto-Burman, zero-shot) | 447 | **23.05** | Same convention as Wav2Vec2Phoneme rescored by POWSM Table 4 (31.92) |
| L2-ARCTIC PRiSM-cut | 3,599 | 14.22 | Convention-mismatched (broad model on narrow refs) |
| VoxAngeles (95 langs) | 5,446 | 19.42 | Convention-mismatched; cross-lingual stress |
| DoReCo subset (8 langs) | 3,898 | 25.18 | Convention-mismatched; cross-lingual stress |

Cross-lingual narrow benchmarks (L2-ARCTIC, VoxAngeles, DoReCo) are
not direct quality comparisons — they pair our broad-IPA output against
narrow human references, so the numbers reflect a known convention
penalty in addition to recognition difficulty.

## Limitations

- **Cross-lingual narrow generalization.** This model loses to
  encoder-CTC speech-to-IPA models trained on much larger corpora
  (POWSM, ZIPA, PhoneticXEUS, HuPER). The gap is structural — ~1000×
  data-scale gap and a uniform broad output convention vs. their
  language-specific narrow inventories.
- **AR-decoder repetition.** Whisper's autoregressive decoder
  occasionally produces severe repetition hallucinations on
  out-of-distribution languages with short utterances (e.g., Bengali
  on VoxAngeles, PFER ≈ 151%, n=40, contributing ~1 absolute point to
  the aggregate VoxAngeles PFER).
- **Language coverage.** Trained on 8 languages. Performance on any
  language outside that set is zero-shot; expect convention and
  inventory penalties.

## Citation

```bibtex
@software{aslan2026phonetic_whisper_mlx,
  author       = {Aslan, Barathan},
  title        = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
  year         = {2026},
  url          = {https://github.com/barathanaslan/phonetic-whisper-mlx},
  version      = {0.1.0},
  license      = {MIT (code), CC BY-NC 4.0 (weights)}
}
```

For training data:

> Garofolo, J. S., et al. *TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.* Web download. Philadelphia: Linguistic Data Consortium, 1993.
>
> Ardila, R., Branson, M., Davis, K., et al. *Common Voice: A Massively-Multilingual Speech Corpus.* LREC 2020.

For the per-phone Hamming/24 PFER convention:

> Taguchi, C. *Universal Automatic Phonetic Transcription into the IPA.* arXiv:2308.03917, 2023.
>
> Lu et al. *POWSM: A Phonetic Open Whisper-Style Speech Foundation Model.* arXiv:2510.24992, 2025.

## License

**Trained model weights:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
The non-commercial restriction reflects the TIMIT (LDC93S1) data terms
inherited via training data. Commercial deployment of derivative
products may require obtaining a TIMIT For-Profit Membership from LDC;
compliance with upstream training-data licenses is the deployer's
responsibility.

**Source code:** MIT, distributed via the GitHub repository.