phonetic-whisper-mlx-broad-multi
Whisper-large-v3 decoder fine-tuned for broad International Phonetic Alphabet (IPA) transcription across 8 languages, trained on a single Apple Silicon machine with MLX.
Companion variant:
phonetic-whisper-mlx-narrow-entrains on TIMIT narrow English alone and emits TIMIT-narrow phonetic detail. Use thisbroad-multivariant for cross-lingual broad IPA; usenarrow-enfor English narrow IPA.
Model description
phonetic-whisper-mlx-broad-multi is a decoder-only fine-tune of
mlx-community/whisper-large-v3-mlx.
The encoder is frozen during training; only the decoder weights are
updated. The model takes 16 kHz audio and emits broad-phonemic IPA
strings (no diacritics, merged allophones).
Output convention. Broad IPA, NFC-normalized, with the
TIMIT-style closures (bcl, dcl, gcl, pcl, tcl, kcl) and
silences (pau, epi, h#) dropped, allophonic glottal stops
suppressed, and combining diacritics stripped (m̩→m, n̩→n, l̩→l,
ɨ→ɪ, ʉ→u, ɦ→h).
Intended use
- Research on multilingual phonetic recognition under a uniform broad-IPA output convention.
- Linguistic-resource construction for the 8 trained languages (English, Japanese, Polish, Maltese, Hungarian, Finnish, Greek, Tamil).
- Cross-lingual zero-shot phonetic transcription as a baseline; expect degraded quality on languages outside the training set.
Out of scope: narrow phonetic transcription (use the companion
narrow-en for English narrow); orthographic ASR (this model emits
IPA, not text); commercial deployment without complying with the
upstream LDC TIMIT non-commercial licensing terms.
How to use
MLX (Apple Silicon)
from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_whisper.load_models import load_model
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
from mlx_whisper.decoding import DecodingOptions, decode
from mlx.utils import tree_flatten, tree_unflatten
# Download checkpoint weights from HF.
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-broad-multi")
# Load Whisper-large-v3 architecture and overlay our decoder weights.
model = load_model("mlx-community/whisper-large-v3-mlx")
model.set_dtype(mx.float32)
trained = mx.load(f"{ckpt}/model.safetensors")
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
params = dict(tree_flatten(model.parameters()))
for k, v in decoder_weights.items():
if k in params:
params[k] = v
model.update(tree_unflatten(list(params.items())))
# Inference. ALWAYS pass language="en" — see Training-time language token.
audio = load_audio("your-audio.wav")
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
mel = mx.expand_dims(mel, 0).astype(mx.float32)
features = model.encoder(mel)
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
print(result[0].text.strip())
For training reproduction, see the GitHub repository.
Training data
| Source | Samples | Convention |
|---|---|---|
TIMIT broad (English, derived from prepare_timit_dataset.py + simplify_timit_ipa.py) |
4,158 | Broad |
| CommonVoice broad — 7 languages (ja, pl, mt, hu, fi, el, ta), Epitran-based G2P | 6,538 | Broad |
| Total | 10,696 | Broad |
Approximately ~30 hours of audio. Held-out validation: 924 utterances (stratified 50/50 TIMIT/CommonVoice, seed=42).
TIMIT (LDC93S1) is licensed for non-commercial research only. The trained weights are distributed under CC BY-NC 4.0 in accordance with this restriction; see License.
Training procedure
Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with MLX. Full hyperparameters, launchers, and reproduction commands are in the GitHub repository.
Training-time language token
All training samples use <|en|> as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. Pass language="en" at inference.
Evaluation
PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over PanPhon's 24 articulatory features ÷ 24, with insertion/deletion cost = 1 (Taguchi 2023 §4.2 / POWSM Table 4 rescoring convention).
| Benchmark | n | PFER (%) | Convention notes |
|---|---|---|---|
| Combined broad held-out validation (in-distribution) | 924 | 3.19 | TIMIT+CV stratified 50/50 |
| TIMIT broad core test (in-distribution) | 1,680 | 4.70 | Broad-on-broad |
| MultIPA zero-shot (Taguchi 2023) | — | 20.78 | Same test set as Taguchi 2023 (21.2 reported) |
| Tusom2021 (Tibeto-Burman, zero-shot) | 447 | 23.05 | Same convention as Wav2Vec2Phoneme rescored by POWSM Table 4 (31.92) |
| L2-ARCTIC PRiSM-cut | 3,599 | 14.22 | Convention-mismatched (broad model on narrow refs) |
| VoxAngeles (95 langs) | 5,446 | 19.42 | Convention-mismatched; cross-lingual stress |
| DoReCo subset (8 langs) | 3,898 | 25.18 | Convention-mismatched; cross-lingual stress |
Cross-lingual narrow benchmarks (L2-ARCTIC, VoxAngeles, DoReCo) are not direct quality comparisons — they pair our broad-IPA output against narrow human references, so the numbers reflect a known convention penalty in addition to recognition difficulty.
Limitations
- Cross-lingual narrow generalization. This model loses to encoder-CTC speech-to-IPA models trained on much larger corpora (POWSM, ZIPA, PhoneticXEUS, HuPER). The gap is structural — ~1000× data-scale gap and a uniform broad output convention vs. their language-specific narrow inventories.
- AR-decoder repetition. Whisper's autoregressive decoder occasionally produces severe repetition hallucinations on out-of-distribution languages with short utterances (e.g., Bengali on VoxAngeles, PFER ≈ 151%, n=40, contributing ~1 absolute point to the aggregate VoxAngeles PFER).
- Language coverage. Trained on 8 languages. Performance on any language outside that set is zero-shot; expect convention and inventory penalties.
Citation
@software{aslan2026phonetic_whisper_mlx,
author = {Aslan, Barathan},
title = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
year = {2026},
url = {https://github.com/barathanaslan/phonetic-whisper-mlx},
version = {0.1.0},
license = {MIT (code), CC BY-NC 4.0 (weights)}
}
For training data:
Garofolo, J. S., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Philadelphia: Linguistic Data Consortium, 1993.
Ardila, R., Branson, M., Davis, K., et al. Common Voice: A Massively-Multilingual Speech Corpus. LREC 2020.
For the per-phone Hamming/24 PFER convention:
Taguchi, C. Universal Automatic Phonetic Transcription into the IPA. arXiv:2308.03917, 2023.
Lu et al. POWSM: A Phonetic Open Whisper-Style Speech Foundation Model. arXiv:2510.24992, 2025.
License
Trained model weights: CC BY-NC 4.0. The non-commercial restriction reflects the TIMIT (LDC93S1) data terms inherited via training data. Commercial deployment of derivative products may require obtaining a TIMIT For-Profit Membership from LDC; compliance with upstream training-data licenses is the deployer's responsibility.
Source code: MIT, distributed via the GitHub repository.
Quantized
Model tree for barathanasln/phonetic-whisper-mlx-broad-multi
Base model
mlx-community/whisper-large-v3-mlxPapers for barathanasln/phonetic-whisper-mlx-broad-multi
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
Evaluation results
- Phone Feature Error Rate (PanPhon Hamming/24) on Combined broad-IPA held-out validationself-reported3.190
- Phone Feature Error Rate on TIMIT core test (broad)self-reported4.700
- Phone Feature Error Rate on MultIPA zero-shot (Taguchi 2023)self-reported20.780
- Phone Feature Error Rate on Tusom2021self-reported23.050