phonetic-whisper-mlx-narrow-en

Whisper-large-v3 decoder fine-tuned for narrow International Phonetic Alphabet (IPA) transcription of English, trained on TIMIT alone using MLX on a single Apple Silicon machine.

Companion variant: phonetic-whisper-mlx-broad-multi trains on TIMIT broad + CommonVoice broad in 7 languages and emits broad-phonemic IPA. Use this narrow-en variant for English narrow phonetic detail; use broad-multi for cross-lingual broad IPA.

Code: barathanaslan/phonetic-whisper-mlx

Model description

phonetic-whisper-mlx-narrow-en is a decoder-only fine-tune of mlx-community/whisper-large-v3-mlx. The encoder is frozen during training; only the decoder weights are updated. The model takes 16 kHz English audio and emits TIMIT-narrow IPA strings.

Output convention. TIMIT-narrow IPA, NFC-normalized, with the TIMIT-style closures (bcl, dcl, gcl, pcl, tcl, kcl) and silences (pau, epi, h#) dropped. The remaining 52-symbol inventory preserves narrow distinctions such as the glottal stop ʔ, the flap ɾ, syllabic consonants (, , , ŋ̍), r-coloured vowels (ɝ, ɚ), the reduced vowel ɨ, the devoiced schwa ə̥, the fronted ʉ, the voiced glottal ɦ, and the nasal flap ɾ̃.

Intended use

  • Research on Whisper-decoder fine-tuning for narrow phonetic transcription of English.
  • Generation of TIMIT-style IPA transcripts for English speech corpora.
  • Comparison work against this checkpoint on TIMIT-narrow conventions.

Out of scope: broad-IPA transcription (use the companion broad-multi variant); non-English input (this model has only seen TIMIT-style English narrow); orthographic ASR; cross-lingual phonetic recognition; commercial deployment without complying with the upstream LDC TIMIT non-commercial licensing terms.

How to use

MLX (Apple Silicon)

from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_whisper.load_models import load_model
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
from mlx_whisper.decoding import DecodingOptions, decode
from mlx.utils import tree_flatten, tree_unflatten

# Download checkpoint weights from HF.
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-narrow-en")

# Load Whisper-large-v3 architecture and overlay our decoder weights.
model = load_model("mlx-community/whisper-large-v3-mlx")
model.set_dtype(mx.float32)
trained = mx.load(f"{ckpt}/model.safetensors")
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
params = dict(tree_flatten(model.parameters()))
for k, v in decoder_weights.items():
    if k in params:
        params[k] = v
model.update(tree_unflatten(list(params.items())))

# Inference. ALWAYS pass language="en" — see Training-time language token.
audio = load_audio("your-english-audio.wav")
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
mel = mx.expand_dims(mel, 0).astype(mx.float32)
features = model.encoder(mel)
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
print(result[0].text.strip())

For training reproduction, see the GitHub repository.

Training data

Source Samples Convention
TIMIT narrow (English, ARPABET → IPA via prepare_timit_dataset.py) 4,620 Narrow

Approximately ~3 hours of English read speech.

TIMIT (LDC93S1) is licensed for non-commercial research only. The trained weights are distributed under CC BY-NC 4.0 in accordance with this restriction; see License.

Training procedure

Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with MLX. Training was set up with automatic early-stopping; full hyperparameters, launchers, and reproduction commands are in the GitHub repository.

Training-time language token

All training samples use <|en|> as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. Pass language="en" at inference.

Evaluation

PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over PanPhon's 24 articulatory features ÷ 24, with insertion/deletion cost = 1. PER is segment-level edit distance ÷ reference length.

Benchmark n PFER (%) PER (%)
TIMIT narrow core test (in-distribution) 1,680 5.83 14.98

No fair peer comparison

There is no published Whisper-decoder fine-tune on TIMIT narrow at the per-phone Hamming/24 PFER convention used here; this is a standalone in-distribution result. The benchmark adapters in the GitHub repository can run this checkpoint on other narrow benchmarks, but the resulting numbers are dominated by inventory mismatch (this model emits TIMIT-narrow detail) and are not published as quality claims.

Limitations

  • English-only. This checkpoint has only seen TIMIT-style English narrow during training. For multilingual or broad-IPA transcription use the companion broad-multi variant.
  • Small training corpus. ~3 hours of audio; the in-training validation curve shows clear overfitting after step 4,000, which is why early stopping triggered at step 9,000.
  • AR-decoder repetition. Whisper's autoregressive decoder can produce repetition hallucinations on out-of-distribution short utterances; this is a known structural property of AR decoders vs. CTC.

Citation

@software{aslan2026phonetic_whisper_mlx,
  author       = {Aslan, Barathan},
  title        = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
  year         = {2026},
  url          = {https://github.com/barathanaslan/phonetic-whisper-mlx},
  version      = {0.1.0},
  license      = {MIT (code), CC BY-NC 4.0 (weights)}
}

For training data:

Garofolo, J. S., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Philadelphia: Linguistic Data Consortium, 1993.

For the per-phone Hamming/24 PFER convention:

Taguchi, C. Universal Automatic Phonetic Transcription into the IPA. arXiv:2308.03917, 2023.

Lu et al. POWSM: A Phonetic Open Whisper-Style Speech Foundation Model. arXiv:2510.24992, 2025.

License

Trained model weights: CC BY-NC 4.0. The non-commercial restriction reflects the TIMIT (LDC93S1) data terms inherited via training data. Commercial deployment of derivative products may require obtaining a TIMIT For-Profit Membership from LDC; compliance with upstream training-data licenses is the deployer's responsibility.

Source code: MIT, distributed via the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for barathanasln/phonetic-whisper-mlx-narrow-en

Finetuned
(3)
this model

Dataset used to train barathanasln/phonetic-whisper-mlx-narrow-en

Papers for barathanasln/phonetic-whisper-mlx-narrow-en

Evaluation results

  • Phone Feature Error Rate (PanPhon Hamming/24) on TIMIT core test (narrow)
    self-reported
    5.830
  • Phone Error Rate (segment-level edit distance) on TIMIT core test (narrow)
    self-reported
    14.980