Wav2Vec2-Large-XLSR for Hindi IPA Phoneme Recognition
This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 for Hindi speech to IPA phoneme recognition with CTC decoding.
Instead of transcribing to text, this model outputs individual IPA (International Phonetic Alphabet) phoneme tokens — making it useful for phonetic analysis, pronunciation assessment, forced alignment, and linguistic research on Hindi speech.
Model Details
- Architecture: Wav2Vec2ForCTC (24 transformer layers, 1024 hidden size, 16 attention heads)
- Base model:
facebook/wav2vec2-large-xlsr-53 - Fine-tuning dataset: AI4Bharat IndicVoices — Hindi subset
- Vocabulary: 64 IPA phoneme tokens (including special tokens)
- Sampling rate: 16 kHz
- Framework: PyTorch / HuggingFace Transformers
Phoneme Vocabulary
The model recognizes 61 Hindi IPA phonemes plus 3 special tokens:
| Category | Phonemes |
|---|---|
| Vowels | ə, ɑː, i, iː, u, uː, eː, oː, aːi, aːu |
| Plosives | p, pʰ, b, bʰ, t̪, t̪ʰ, d̪, d̪ʰ, ʈ, ʈʰ, ɖ, ɖʰ, k, kʰ, g, gʰ, q |
| Affricates | c, cʰ, ɟ, ɟʰ, ɕc |
| Fricatives | s, z, ɕ, ʂ, h, ɦ, f, x, ɣ |
| Nasals | m, n, ɲ, ɳ, ŋ, ⁿ |
| Liquids & Glides | l, r, ɾ, ɽ, ɽʱ, j, v |
| Clusters | kʃ, t̪ɾ, gj |
| Syllabic | l̩, l̩ː, ɹ̩, ɹ̩ː |
| Special | <pad> (CTC blank), <unk>, | (word delimiter) |
Usage
Quick Start
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio
# Load model and processor
model_name = "xnpx/wav2vec2-large-xlsr-ipa-phonemes"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()
# Load audio (must be 16kHz mono)
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
waveform = waveform.squeeze()
# Run inference
inputs = processor(waveform.numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
# Greedy CTC decode
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
print(transcription)
# Example output: "n ə m ə s t̪ eː"
With Timestamps (Greedy CTC)
import numpy as np
log_probs = torch.nn.functional.log_softmax(logits, dim=-1).cpu().numpy()[0]
pred_ids = np.argmax(log_probs, axis=-1)
# Load vocab for ID -> phoneme mapping
import json
vocab = json.loads(processor.tokenizer.backend_tokenizer.to_str()) if hasattr(processor.tokenizer, 'backend_tokenizer') else processor.tokenizer.get_vocab()
id_to_phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}
# Frame duration: product of conv_stride values / sampling_rate
# For this model: 5*2*2*2*2*2*2 = 320 samples per frame -> 20ms at 16kHz
frame_duration_s = 320 / 16000 # 0.02s per frame
phonemes, timestamps = [], []
prev_id = None
for frame_idx, token_id in enumerate(pred_ids):
if token_id == 0: # skip CTC blank
prev_id = None
continue
if token_id == prev_id: # skip CTC repeats
continue
prev_id = token_id
phoneme = id_to_phoneme.get(int(token_id), "<unk>")
if phoneme not in ("<pad>", "<unk>", "|"):
t = frame_idx * frame_duration_s
phonemes.append(phoneme)
timestamps.append(t)
for p, t in zip(phonemes, timestamps):
print(f" {t:.3f}s {p}")
Training Details
- Base model: facebook/wav2vec2-large-xlsr-53 (pre-trained on 53 languages)
- Dataset: AI4Bharat IndicVoices Hindi split
- Text-to-phoneme conversion: Devanagari script → IPA via rule-based transliteration
- Loss: CTC (Connectionist Temporal Classification)
- Optimizer: AdamW
- Training framework: HuggingFace Trainer
Limitations
- Designed specifically for Hindi speech; may not generalize well to other languages
- CTC-based — no language model or beam search (greedy decode only)
- Phoneme boundaries from greedy decoding are approximate; use CTC segmentation for more accurate alignment
- Performance may degrade on noisy or far-field audio
Citation
If you use this model, please cite the underlying wav2vec2-xlsr work:
@inproceedings{conneau2020unsupervised,
title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
author={Conneau, Alexis and Baevski, Alexei and Rothe, Henry and Araabi, Ali and Auli, Michael},
booktitle={Interspeech},
year={2020}
}
- Downloads last month
- 10
Model tree for xnpx/wav2vec2-large-xlsr-ipa-phonemes
Base model
facebook/wav2vec2-large-xlsr-53