mHuBERT-147 — IPA Phone Classification Heads
This repository contains trained classification heads on top of a frozen utter-project/mHuBERT-147 encoder.
The backbone is not included and must be loaded separately.lstm_v1 performs frame-level IPA phone classification: given the hidden states
of mHuBERT-147, each frame is assigned one of 45 IPA symbols. phone_mask_v1
adds a second utility head that scores which frames should be kept for decoding.
ctc_v1 is a BiLSTM CTC head that directly predicts a phone sequence with an
extra blank symbol.
Available heads
| Directory | Architecture | Buckeye PER ↓ | Notes |
|---|---|---|---|
lstm_v1/ |
BiLSTM (2 layers, hidden 256) | 0.261 | frame-level baseline |
phone_mask_v1/ |
BiLSTM phone head + BiLSTM utility head | 0.241 | masked decoding, threshold = 0.50 |
ctc_v1/ |
BiLSTM CTC head | 0.215 | frozen backbone |
Evaluated on Buckeye val speakers: s18, s20, s26, s29, s31–s33, s35.
phone_mask_v1 predicts both frame-level phone logits and a frame-utility score. At inference time the utility score is thresholded and only the selected frames are kept for decoding.
ctc_v1 is CTC-head, yields TIMIT TEST PER = 0.0957, Buckeye PER = 0.2150.
Phone set
45 IPA symbols (44 phones + silence). Full mapping in ipa_map.json.
sil aɪ aʊ b d dʒ eɪ f g h i j k l l̩ m m̩ n n̩ oʊ p r s t tʃ u v w z æ ð ŋ ɑ ɔ ɔɪ ə ɛ ɪ ɹ̩ ɾ ʃ ʊ ʒ ʔ θ
Phone mapping follows the TIMIT→IPA conventions from Wav2IPA.
Training data
- TIMIT (train split) — remapped to IPA
- Buckeye corpus (speakers s01–s17, s19, s21–s24, s28, s30) — remapped to IPA
Key mapping decisions:
- Stop closure+release pairs (
bcl+b,dcl+d, …) → merged into single release phone er / axr→ɹ̩(syllabic r);el→l̩;em→m̩;en→n̩ah / ax / ax-h→ə;dx→ɾ(flap);q→ʔ- Buckeye nasalised vowels (
aen,own, …) → merged with oral counterparts
Comparison
| System | Test set | PER ↓ | Approach |
|---|---|---|---|
| This work (ctc_v1) | Buckeye val | 0.215 | frozen-backbone CTC head |
| Wav2IPA | Buckeye val | 0.2479 | CTC fine-tuning |
Usage
Minimal example for lstm_v1:
import torch
from transformers import AutoFeatureExtractor, AutoModel, AutoConfig
import json
import librosa
from huggingface_hub import snapshot_download
repo_id = "istomin9192/mHuBERT-147-ipa-head"
local_dir = snapshot_download(repo_id=repo_id)
head_dir = f"{local_dir}/lstm_v1"
ipa_map_path = f"{local_dir}/ipa_map.json"
config = AutoConfig.from_pretrained(head_dir, trust_remote_code=True)
backbone = AutoModel.from_pretrained(config.base_model)
feature_extractor = AutoFeatureExtractor.from_pretrained(config.base_model)
backbone.eval()
head = AutoModel.from_pretrained(head_dir, trust_remote_code=True)
head.eval()
with open(ipa_map_path, "r", encoding="utf-8") as f:
id2phone = json.load(f)["id2phone"]
wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
emb = backbone(**inputs).last_hidden_state # [1, T, 768]
outputs = head(emb)
logits = outputs.logits # [1, T, 45]
pred_ids = logits.argmax(-1)[0].tolist()
phones = [id2phone[str(i)] for i in pred_ids]
print(phones)
Minimal example for phone_mask_v1:
import torch
from transformers import AutoFeatureExtractor, AutoModel, AutoConfig
import json
import librosa
from huggingface_hub import snapshot_download
repo_id = "istomin9192/mHuBERT-147-ipa-head"
local_dir = snapshot_download(repo_id=repo_id)
head_dir = f"{local_dir}/phone_mask_v1"
ipa_map_path = f"{local_dir}/ipa_map.json"
config = AutoConfig.from_pretrained(head_dir, trust_remote_code=True)
backbone = AutoModel.from_pretrained(config.base_model)
feature_extractor = AutoFeatureExtractor.from_pretrained(config.base_model)
backbone.eval()
phone_mask = AutoModel.from_pretrained(head_dir, trust_remote_code=True)
phone_mask.eval()
with open(ipa_map_path, "r", encoding="utf-8") as f:
id2phone = json.load(f)["id2phone"]
wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
emb = backbone(**inputs).last_hidden_state # [1, T, 768]
outputs = phone_mask(emb)
phone_logits = outputs.phone_logits # [1, T, 45]
utility_logits = outputs.utility_logits # [1, T]
pred_ids = phone_logits.argmax(-1)[0]
utility = torch.sigmoid(utility_logits[0])
mask = utility > config.default_threshold
phones = [id2phone[str(i)] for i in pred_ids[mask].tolist()]
print(phones)
Minimal example for ctc_v1:
import json
import librosa
import torch
from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoFeatureExtractor, AutoModel
repo_id = "istomin9192/mHuBERT-147-ipa-head"
local_dir = snapshot_download(repo_id=repo_id)
head_dir = f"{local_dir}/ctc_v1"
ipa_map_path = f"{local_dir}/ipa_map.json"
config = AutoConfig.from_pretrained(head_dir, trust_remote_code=True)
backbone = AutoModel.from_pretrained(config.base_model)
feature_extractor = AutoFeatureExtractor.from_pretrained(config.base_model)
backbone.eval()
ctc_head = AutoModel.from_pretrained(head_dir, trust_remote_code=True)
ctc_head.eval()
with open(ipa_map_path, "r", encoding="utf-8") as f:
id2phone = {int(k): v for k, v in json.load(f)["id2phone"].items()}
wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
emb = backbone(**inputs).last_hidden_state
logits = ctc_head(emb).logits[0]
pred_ids = logits.argmax(dim=-1).tolist()
blank_id = config.architecture["blank_id"]
phones = []
prev = blank_id
for pid in pred_ids:
if pid != blank_id and pid != prev:
phones.append(id2phone[pid])
prev = pid
print(phones)
This model uses the mHuBERT-147 backbone: https://huggingface.co/utter-project/mHuBERT-147
Model tree for istomin9192/mHuBERT-147-ipa-head
Base model
utter-project/mHuBERT-147Dataset used to train istomin9192/mHuBERT-147-ipa-head
Spaces using istomin9192/mHuBERT-147-ipa-head 2
Evaluation results
- Phone Error Rate on Buckeyevalidation set self-reported0.261
- Phone Error Rate on Buckeyevalidation set self-reported0.241
- Phone Error Rate on Buckeyevalidation set self-reported0.215