phonetic-whisper-mlx-broad-multi

Whisper-large-v3 decoder fine-tuned for broad International Phonetic Alphabet (IPA) transcription across 8 languages, trained on a single Apple Silicon machine with MLX.

Companion variant: phonetic-whisper-mlx-narrow-en trains on TIMIT narrow English alone and emits TIMIT-narrow phonetic detail. Use this broad-multi variant for cross-lingual broad IPA; use narrow-en for English narrow IPA.

Code: barathanaslan/phonetic-whisper-mlx

Model description

phonetic-whisper-mlx-broad-multi is a decoder-only fine-tune of mlx-community/whisper-large-v3-mlx. The encoder is frozen during training; only the decoder weights are updated. The model takes 16 kHz audio and emits broad-phonemic IPA strings (no diacritics, merged allophones).

Output convention. Broad IPA, NFC-normalized, with the TIMIT-style closures (bcl, dcl, gcl, pcl, tcl, kcl) and silences (pau, epi, h#) dropped, allophonic glottal stops suppressed, and combining diacritics stripped (m̩→m, n̩→n, l̩→l, ɨ→ɪ, ʉ→u, ɦ→h).

Intended use

  • Research on multilingual phonetic recognition under a uniform broad-IPA output convention.
  • Linguistic-resource construction for the 8 trained languages (English, Japanese, Polish, Maltese, Hungarian, Finnish, Greek, Tamil).
  • Cross-lingual zero-shot phonetic transcription as a baseline; expect degraded quality on languages outside the training set.

Out of scope: narrow phonetic transcription (use the companion narrow-en for English narrow); orthographic ASR (this model emits IPA, not text); commercial deployment without complying with the upstream LDC TIMIT non-commercial licensing terms.

How to use

MLX (Apple Silicon)

from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_whisper.load_models import load_model
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
from mlx_whisper.decoding import DecodingOptions, decode
from mlx.utils import tree_flatten, tree_unflatten

# Download checkpoint weights from HF.
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-broad-multi")

# Load Whisper-large-v3 architecture and overlay our decoder weights.
model = load_model("mlx-community/whisper-large-v3-mlx")
model.set_dtype(mx.float32)
trained = mx.load(f"{ckpt}/model.safetensors")
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
params = dict(tree_flatten(model.parameters()))
for k, v in decoder_weights.items():
    if k in params:
        params[k] = v
model.update(tree_unflatten(list(params.items())))

# Inference. ALWAYS pass language="en" — see Training-time language token.
audio = load_audio("your-audio.wav")
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
mel = mx.expand_dims(mel, 0).astype(mx.float32)
features = model.encoder(mel)
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
print(result[0].text.strip())

For training reproduction, see the GitHub repository.

Training data

Source Samples Convention
TIMIT broad (English, derived from prepare_timit_dataset.py + simplify_timit_ipa.py) 4,158 Broad
CommonVoice broad — 7 languages (ja, pl, mt, hu, fi, el, ta), Epitran-based G2P 6,538 Broad
Total 10,696 Broad

Approximately ~30 hours of audio. Held-out validation: 924 utterances (stratified 50/50 TIMIT/CommonVoice, seed=42).

TIMIT (LDC93S1) is licensed for non-commercial research only. The trained weights are distributed under CC BY-NC 4.0 in accordance with this restriction; see License.

Training procedure

Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with MLX. Full hyperparameters, launchers, and reproduction commands are in the GitHub repository.

Training-time language token

All training samples use <|en|> as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. Pass language="en" at inference.

Evaluation

PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over PanPhon's 24 articulatory features ÷ 24, with insertion/deletion cost = 1 (Taguchi 2023 §4.2 / POWSM Table 4 rescoring convention).

Benchmark n PFER (%) Convention notes
Combined broad held-out validation (in-distribution) 924 3.19 TIMIT+CV stratified 50/50
TIMIT broad core test (in-distribution) 1,680 4.70 Broad-on-broad
MultIPA zero-shot (Taguchi 2023) — 20.78 Same test set as Taguchi 2023 (21.2 reported)
Tusom2021 (Tibeto-Burman, zero-shot) 447 23.05 Same convention as Wav2Vec2Phoneme rescored by POWSM Table 4 (31.92)
L2-ARCTIC PRiSM-cut 3,599 14.22 Convention-mismatched (broad model on narrow refs)
VoxAngeles (95 langs) 5,446 19.42 Convention-mismatched; cross-lingual stress
DoReCo subset (8 langs) 3,898 25.18 Convention-mismatched; cross-lingual stress

Cross-lingual narrow benchmarks (L2-ARCTIC, VoxAngeles, DoReCo) are not direct quality comparisons — they pair our broad-IPA output against narrow human references, so the numbers reflect a known convention penalty in addition to recognition difficulty.

Limitations

  • Cross-lingual narrow generalization. This model loses to encoder-CTC speech-to-IPA models trained on much larger corpora (POWSM, ZIPA, PhoneticXEUS, HuPER). The gap is structural — ~1000× data-scale gap and a uniform broad output convention vs. their language-specific narrow inventories.
  • AR-decoder repetition. Whisper's autoregressive decoder occasionally produces severe repetition hallucinations on out-of-distribution languages with short utterances (e.g., Bengali on VoxAngeles, PFER ≈ 151%, n=40, contributing ~1 absolute point to the aggregate VoxAngeles PFER).
  • Language coverage. Trained on 8 languages. Performance on any language outside that set is zero-shot; expect convention and inventory penalties.

Citation

@software{aslan2026phonetic_whisper_mlx,
  author       = {Aslan, Barathan},
  title        = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
  year         = {2026},
  url          = {https://github.com/barathanaslan/phonetic-whisper-mlx},
  version      = {0.1.0},
  license      = {MIT (code), CC BY-NC 4.0 (weights)}
}

For training data:

Garofolo, J. S., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Philadelphia: Linguistic Data Consortium, 1993.

Ardila, R., Branson, M., Davis, K., et al. Common Voice: A Massively-Multilingual Speech Corpus. LREC 2020.

For the per-phone Hamming/24 PFER convention:

Taguchi, C. Universal Automatic Phonetic Transcription into the IPA. arXiv:2308.03917, 2023.

Lu et al. POWSM: A Phonetic Open Whisper-Style Speech Foundation Model. arXiv:2510.24992, 2025.

License

Trained model weights: CC BY-NC 4.0. The non-commercial restriction reflects the TIMIT (LDC93S1) data terms inherited via training data. Commercial deployment of derivative products may require obtaining a TIMIT For-Profit Membership from LDC; compliance with upstream training-data licenses is the deployer's responsibility.

Source code: MIT, distributed via the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for barathanasln/phonetic-whisper-mlx-broad-multi

Finetuned
(3)
this model

Papers for barathanasln/phonetic-whisper-mlx-broad-multi

Evaluation results

  • Phone Feature Error Rate (PanPhon Hamming/24) on Combined broad-IPA held-out validation
    self-reported
    3.190
  • Phone Feature Error Rate on TIMIT core test (broad)
    self-reported
    4.700
  • Phone Feature Error Rate on MultIPA zero-shot (Taguchi 2023)
    self-reported
    20.780
  • Phone Feature Error Rate on Tusom2021
    self-reported
    23.050