Wav2Vec2-Large-XLSR for Hindi IPA Phoneme Recognition

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 for Hindi speech to IPA phoneme recognition with CTC decoding.

Instead of transcribing to text, this model outputs individual IPA (International Phonetic Alphabet) phoneme tokens — making it useful for phonetic analysis, pronunciation assessment, forced alignment, and linguistic research on Hindi speech.

Model Details

  • Architecture: Wav2Vec2ForCTC (24 transformer layers, 1024 hidden size, 16 attention heads)
  • Base model: facebook/wav2vec2-large-xlsr-53
  • Fine-tuning dataset: AI4Bharat IndicVoices — Hindi subset
  • Vocabulary: 64 IPA phoneme tokens (including special tokens)
  • Sampling rate: 16 kHz
  • Framework: PyTorch / HuggingFace Transformers

Phoneme Vocabulary

The model recognizes 61 Hindi IPA phonemes plus 3 special tokens:

Category Phonemes
Vowels ə, ɑː, i, , u, , , , aːi, aːu
Plosives p, , b, , , t̪ʰ, , d̪ʰ, ʈ, ʈʰ, ɖ, ɖʰ, k, , g, , q
Affricates c, , ɟ, ɟʰ, ɕc
Fricatives s, z, ɕ, ʂ, h, ɦ, f, x, ɣ
Nasals m, n, ɲ, ɳ, ŋ,
Liquids & Glides l, r, ɾ, ɽ, ɽʱ, j, v
Clusters , t̪ɾ, gj
Syllabic , l̩ː, ɹ̩, ɹ̩ː
Special <pad> (CTC blank), <unk>, | (word delimiter)

Usage

Quick Start

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model_name = "xnpx/wav2vec2-large-xlsr-ipa-phonemes"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

# Load audio (must be 16kHz mono)
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
waveform = waveform.squeeze()

# Run inference
inputs = processor(waveform.numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Greedy CTC decode
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
print(transcription)
# Example output: "n ə m ə s t̪ eː"

With Timestamps (Greedy CTC)

import numpy as np

log_probs = torch.nn.functional.log_softmax(logits, dim=-1).cpu().numpy()[0]
pred_ids = np.argmax(log_probs, axis=-1)

# Load vocab for ID -> phoneme mapping
import json
vocab = json.loads(processor.tokenizer.backend_tokenizer.to_str()) if hasattr(processor.tokenizer, 'backend_tokenizer') else processor.tokenizer.get_vocab()
id_to_phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}

# Frame duration: product of conv_stride values / sampling_rate
# For this model: 5*2*2*2*2*2*2 = 320 samples per frame -> 20ms at 16kHz
frame_duration_s = 320 / 16000  # 0.02s per frame

phonemes, timestamps = [], []
prev_id = None
for frame_idx, token_id in enumerate(pred_ids):
    if token_id == 0:  # skip CTC blank
        prev_id = None
        continue
    if token_id == prev_id:  # skip CTC repeats
        continue
    prev_id = token_id
    phoneme = id_to_phoneme.get(int(token_id), "<unk>")
    if phoneme not in ("<pad>", "<unk>", "|"):
        t = frame_idx * frame_duration_s
        phonemes.append(phoneme)
        timestamps.append(t)

for p, t in zip(phonemes, timestamps):
    print(f"  {t:.3f}s  {p}")

Training Details

  • Base model: facebook/wav2vec2-large-xlsr-53 (pre-trained on 53 languages)
  • Dataset: AI4Bharat IndicVoices Hindi split
  • Text-to-phoneme conversion: Devanagari script → IPA via rule-based transliteration
  • Loss: CTC (Connectionist Temporal Classification)
  • Optimizer: AdamW
  • Training framework: HuggingFace Trainer

Limitations

  • Designed specifically for Hindi speech; may not generalize well to other languages
  • CTC-based — no language model or beam search (greedy decode only)
  • Phoneme boundaries from greedy decoding are approximate; use CTC segmentation for more accurate alignment
  • Performance may degrade on noisy or far-field audio

Citation

If you use this model, please cite the underlying wav2vec2-xlsr work:

@inproceedings{conneau2020unsupervised,
  title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
  author={Conneau, Alexis and Baevski, Alexei and Rothe, Henry and Araabi, Ali and Auli, Michael},
  booktitle={Interspeech},
  year={2020}
}
Downloads last month
10
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xnpx/wav2vec2-large-xlsr-ipa-phonemes

Finetuned
(358)
this model

Dataset used to train xnpx/wav2vec2-large-xlsr-ipa-phonemes