phonetic-whisper-mlx-narrow-en
Whisper-large-v3 decoder fine-tuned for narrow International Phonetic Alphabet (IPA) transcription of English, trained on TIMIT alone using MLX on a single Apple Silicon machine.
Companion variant:
phonetic-whisper-mlx-broad-multitrains on TIMIT broad + CommonVoice broad in 7 languages and emits broad-phonemic IPA. Use thisnarrow-envariant for English narrow phonetic detail; usebroad-multifor cross-lingual broad IPA.
Model description
phonetic-whisper-mlx-narrow-en is a decoder-only fine-tune of
mlx-community/whisper-large-v3-mlx.
The encoder is frozen during training; only the decoder weights are
updated. The model takes 16 kHz English audio and emits TIMIT-narrow
IPA strings.
Output convention. TIMIT-narrow IPA, NFC-normalized, with the
TIMIT-style closures (bcl, dcl, gcl, pcl, tcl, kcl) and
silences (pau, epi, h#) dropped. The remaining 52-symbol
inventory preserves narrow distinctions such as the glottal stop ʔ,
the flap ɾ, syllabic consonants (m̩, n̩, l̩, ŋ̍),
r-coloured vowels (ɝ, ɚ), the reduced vowel ɨ, the devoiced
schwa ə̥, the fronted ʉ, the voiced glottal ɦ, and the nasal
flap ɾ̃.
Intended use
- Research on Whisper-decoder fine-tuning for narrow phonetic transcription of English.
- Generation of TIMIT-style IPA transcripts for English speech corpora.
- Comparison work against this checkpoint on TIMIT-narrow conventions.
Out of scope: broad-IPA transcription (use the companion
broad-multi variant); non-English input (this model has only seen
TIMIT-style English narrow); orthographic ASR; cross-lingual phonetic
recognition; commercial deployment without complying with the upstream
LDC TIMIT non-commercial licensing terms.
How to use
MLX (Apple Silicon)
from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_whisper.load_models import load_model
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
from mlx_whisper.decoding import DecodingOptions, decode
from mlx.utils import tree_flatten, tree_unflatten
# Download checkpoint weights from HF.
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-narrow-en")
# Load Whisper-large-v3 architecture and overlay our decoder weights.
model = load_model("mlx-community/whisper-large-v3-mlx")
model.set_dtype(mx.float32)
trained = mx.load(f"{ckpt}/model.safetensors")
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
params = dict(tree_flatten(model.parameters()))
for k, v in decoder_weights.items():
if k in params:
params[k] = v
model.update(tree_unflatten(list(params.items())))
# Inference. ALWAYS pass language="en" — see Training-time language token.
audio = load_audio("your-english-audio.wav")
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
mel = mx.expand_dims(mel, 0).astype(mx.float32)
features = model.encoder(mel)
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
print(result[0].text.strip())
For training reproduction, see the GitHub repository.
Training data
| Source | Samples | Convention |
|---|---|---|
TIMIT narrow (English, ARPABET → IPA via prepare_timit_dataset.py) |
4,620 | Narrow |
Approximately ~3 hours of English read speech.
TIMIT (LDC93S1) is licensed for non-commercial research only. The trained weights are distributed under CC BY-NC 4.0 in accordance with this restriction; see License.
Training procedure
Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with MLX. Training was set up with automatic early-stopping; full hyperparameters, launchers, and reproduction commands are in the GitHub repository.
Training-time language token
All training samples use <|en|> as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. Pass language="en" at inference.
Evaluation
PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over PanPhon's 24 articulatory features ÷ 24, with insertion/deletion cost = 1. PER is segment-level edit distance ÷ reference length.
| Benchmark | n | PFER (%) | PER (%) |
|---|---|---|---|
| TIMIT narrow core test (in-distribution) | 1,680 | 5.83 | 14.98 |
No fair peer comparison
There is no published Whisper-decoder fine-tune on TIMIT narrow at the per-phone Hamming/24 PFER convention used here; this is a standalone in-distribution result. The benchmark adapters in the GitHub repository can run this checkpoint on other narrow benchmarks, but the resulting numbers are dominated by inventory mismatch (this model emits TIMIT-narrow detail) and are not published as quality claims.
Limitations
- English-only. This checkpoint has only seen TIMIT-style English
narrow during training. For multilingual or broad-IPA transcription
use the companion
broad-multivariant. - Small training corpus. ~3 hours of audio; the in-training validation curve shows clear overfitting after step 4,000, which is why early stopping triggered at step 9,000.
- AR-decoder repetition. Whisper's autoregressive decoder can produce repetition hallucinations on out-of-distribution short utterances; this is a known structural property of AR decoders vs. CTC.
Citation
@software{aslan2026phonetic_whisper_mlx,
author = {Aslan, Barathan},
title = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
year = {2026},
url = {https://github.com/barathanaslan/phonetic-whisper-mlx},
version = {0.1.0},
license = {MIT (code), CC BY-NC 4.0 (weights)}
}
For training data:
Garofolo, J. S., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Philadelphia: Linguistic Data Consortium, 1993.
For the per-phone Hamming/24 PFER convention:
Taguchi, C. Universal Automatic Phonetic Transcription into the IPA. arXiv:2308.03917, 2023.
Lu et al. POWSM: A Phonetic Open Whisper-Style Speech Foundation Model. arXiv:2510.24992, 2025.
License
Trained model weights: CC BY-NC 4.0. The non-commercial restriction reflects the TIMIT (LDC93S1) data terms inherited via training data. Commercial deployment of derivative products may require obtaining a TIMIT For-Profit Membership from LDC; compliance with upstream training-data licenses is the deployer's responsibility.
Source code: MIT, distributed via the GitHub repository.
Quantized
Model tree for barathanasln/phonetic-whisper-mlx-narrow-en
Base model
mlx-community/whisper-large-v3-mlxDataset used to train barathanasln/phonetic-whisper-mlx-narrow-en
Papers for barathanasln/phonetic-whisper-mlx-narrow-en
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
Evaluation results
- Phone Feature Error Rate (PanPhon Hamming/24) on TIMIT core test (narrow)self-reported5.830
- Phone Error Rate (segment-level edit distance) on TIMIT core test (narrow)self-reported14.980