mHuBERT-147 — IPA Phone Classification Heads

This repository contains trained classification heads on top of a frozen utter-project/mHuBERT-147 encoder.

The backbone is not included and must be loaded separately.
lstm_v1 performs frame-level IPA phone classification: given the hidden states of mHuBERT-147, each frame is assigned one of 45 IPA symbols. phone_mask_v1 adds a second utility head that scores which frames should be kept for decoding. ctc_v1 is a BiLSTM CTC head that directly predicts a phone sequence with an extra blank symbol.

Available heads

Directory	Architecture	Buckeye PER ↓	Notes
`lstm_v1/`	BiLSTM (2 layers, hidden 256)	0.261	frame-level baseline
`phone_mask_v1/`	BiLSTM phone head + BiLSTM utility head	0.241	masked decoding, threshold = 0.50
`ctc_v1/`	BiLSTM CTC head	0.215	frozen backbone

Evaluated on Buckeye val speakers: s18, s20, s26, s29, s31–s33, s35. phone_mask_v1 predicts both frame-level phone logits and a frame-utility score. At inference time the utility score is thresholded and only the selected frames are kept for decoding. ctc_v1 is CTC-head, yields TIMIT TEST PER = 0.0957, Buckeye PER = 0.2150.

Phone set

45 IPA symbols (44 phones + silence). Full mapping in ipa_map.json.

sil aɪ aʊ b d dʒ eɪ f g h i j k l l̩ m m̩ n n̩ oʊ p r s t tʃ u v w z æ ð ŋ ɑ ɔ ɔɪ ə ɛ ɪ ɹ̩ ɾ ʃ ʊ ʒ ʔ θ

Phone mapping follows the TIMIT→IPA conventions from Wav2IPA.

Training data

TIMIT (train split) — remapped to IPA
Buckeye corpus (speakers s01–s17, s19, s21–s24, s28, s30) — remapped to IPA

Key mapping decisions:

Stop closure+release pairs (bcl+b, dcl+d, …) → merged into single release phone
er / axr → ɹ̩ (syllabic r); el → l̩; em → m̩; en → n̩
ah / ax / ax-h → ə; dx → ɾ (flap); q → ʔ
Buckeye nasalised vowels (aen, own, …) → merged with oral counterparts

Comparison

System	Test set	PER ↓	Approach
This work (ctc_v1)	Buckeye val	0.215	frozen-backbone CTC head
Wav2IPA	Buckeye val	0.2479	CTC fine-tuning

Usage

Minimal example for lstm_v1:

import torch
from transformers import AutoFeatureExtractor, AutoModel, AutoConfig
import json
import librosa
from huggingface_hub import snapshot_download


repo_id = "istomin9192/mHuBERT-147-ipa-head"
local_dir = snapshot_download(repo_id=repo_id)
head_dir = f"{local_dir}/lstm_v1"
ipa_map_path = f"{local_dir}/ipa_map.json"

config = AutoConfig.from_pretrained(head_dir, trust_remote_code=True)
backbone = AutoModel.from_pretrained(config.base_model)
feature_extractor = AutoFeatureExtractor.from_pretrained(config.base_model)
backbone.eval()
head = AutoModel.from_pretrained(head_dir, trust_remote_code=True)
head.eval()
with open(ipa_map_path, "r", encoding="utf-8") as f:
    id2phone = json.load(f)["id2phone"]

wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    emb = backbone(**inputs).last_hidden_state  # [1, T, 768]
    outputs = head(emb)
    logits = outputs.logits                     # [1, T, 45]
    pred_ids = logits.argmax(-1)[0].tolist()

phones = [id2phone[str(i)] for i in pred_ids]
print(phones)

Minimal example for phone_mask_v1:

import torch
from transformers import AutoFeatureExtractor, AutoModel, AutoConfig
import json
import librosa
from huggingface_hub import snapshot_download

repo_id = "istomin9192/mHuBERT-147-ipa-head"
local_dir = snapshot_download(repo_id=repo_id)
head_dir = f"{local_dir}/phone_mask_v1"
ipa_map_path = f"{local_dir}/ipa_map.json"

config = AutoConfig.from_pretrained(head_dir, trust_remote_code=True)
backbone = AutoModel.from_pretrained(config.base_model)
feature_extractor = AutoFeatureExtractor.from_pretrained(config.base_model)
backbone.eval()
phone_mask = AutoModel.from_pretrained(head_dir, trust_remote_code=True)
phone_mask.eval()

with open(ipa_map_path, "r", encoding="utf-8") as f:
    id2phone = json.load(f)["id2phone"]

wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    emb = backbone(**inputs).last_hidden_state  # [1, T, 768]
    outputs = phone_mask(emb)
    phone_logits = outputs.phone_logits         # [1, T, 45]
    utility_logits = outputs.utility_logits     # [1, T]

pred_ids = phone_logits.argmax(-1)[0]
utility = torch.sigmoid(utility_logits[0])
mask = utility > config.default_threshold

phones = [id2phone[str(i)] for i in pred_ids[mask].tolist()]
print(phones)

Minimal example for ctc_v1:

import json
import librosa
import torch
from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoFeatureExtractor, AutoModel


repo_id = "istomin9192/mHuBERT-147-ipa-head"
local_dir = snapshot_download(repo_id=repo_id)
head_dir = f"{local_dir}/ctc_v1"
ipa_map_path = f"{local_dir}/ipa_map.json"

config = AutoConfig.from_pretrained(head_dir, trust_remote_code=True)
backbone = AutoModel.from_pretrained(config.base_model)
feature_extractor = AutoFeatureExtractor.from_pretrained(config.base_model)
backbone.eval()
ctc_head = AutoModel.from_pretrained(head_dir, trust_remote_code=True)
ctc_head.eval()

with open(ipa_map_path, "r", encoding="utf-8") as f:
    id2phone = {int(k): v for k, v in json.load(f)["id2phone"].items()}

wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    emb = backbone(**inputs).last_hidden_state
    logits = ctc_head(emb).logits[0]

pred_ids = logits.argmax(dim=-1).tolist()
blank_id = config.architecture["blank_id"]

phones = []
prev = blank_id
for pid in pred_ids:
    if pid != blank_id and pid != prev:
        phones.append(id2phone[pid])
    prev = pid

print(phones)

This model uses the mHuBERT-147 backbone: https://huggingface.co/utter-project/mHuBERT-147

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for istomin9192/mHuBERT-147-ipa-head

Base model

utter-project/mHuBERT-147

Finetuned

(14)

this model

Dataset used to train istomin9192/mHuBERT-147-ipa-head

Spaces using istomin9192/mHuBERT-147-ipa-head 2

Evaluation results

Phone Error Rate on Buckeye
validation set self-reported

0.261
Phone Error Rate on Buckeye
validation set self-reported

0.241
Phone Error Rate on Buckeye
validation set self-reported

0.215