Wav2Vec2-Large-XLSR for Hindi IPA Phoneme Recognition

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 for Hindi speech to IPA phoneme recognition with CTC decoding.

Instead of transcribing to text, this model outputs individual IPA (International Phonetic Alphabet) phoneme tokens — making it useful for phonetic analysis, pronunciation assessment, forced alignment, and linguistic research on Hindi speech.

Model Details

Architecture: Wav2Vec2ForCTC (24 transformer layers, 1024 hidden size, 16 attention heads)
Base model: facebook/wav2vec2-large-xlsr-53
Fine-tuning dataset: AI4Bharat IndicVoices — Hindi subset
Vocabulary: 64 IPA phoneme tokens (including special tokens)
Sampling rate: 16 kHz
Framework: PyTorch / HuggingFace Transformers

Phoneme Vocabulary

The model recognizes 61 Hindi IPA phonemes plus 3 special tokens:

Category	Phonemes
Vowels	`ə`, `ɑː`, `i`, `iː`, `u`, `uː`, `eː`, `oː`, `aːi`, `aːu`
Plosives	`p`, `pʰ`, `b`, `bʰ`, `t̪`, `t̪ʰ`, `d̪`, `d̪ʰ`, `ʈ`, `ʈʰ`, `ɖ`, `ɖʰ`, `k`, `kʰ`, `g`, `gʰ`, `q`
Affricates	`c`, `cʰ`, `ɟ`, `ɟʰ`, `ɕc`
Fricatives	`s`, `z`, `ɕ`, `ʂ`, `h`, `ɦ`, `f`, `x`, `ɣ`
Nasals	`m`, `n`, `ɲ`, `ɳ`, `ŋ`, `ⁿ`
Liquids & Glides	`l`, `r`, `ɾ`, `ɽ`, `ɽʱ`, `j`, `v`
Clusters	`kʃ`, `t̪ɾ`, `gj`
Syllabic	`l̩`, `l̩ː`, `ɹ̩`, `ɹ̩ː`
Special	`<pad>` (CTC blank), `<unk>`, `\|` (word delimiter)

Usage

Quick Start

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model_name = "xnpx/wav2vec2-large-xlsr-ipa-phonemes"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

# Load audio (must be 16kHz mono)
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
waveform = waveform.squeeze()

# Run inference
inputs = processor(waveform.numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Greedy CTC decode
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
print(transcription)
# Example output: "n ə m ə s t̪ eː"

With Timestamps (Greedy CTC)

import numpy as np

log_probs = torch.nn.functional.log_softmax(logits, dim=-1).cpu().numpy()[0]
pred_ids = np.argmax(log_probs, axis=-1)

# Load vocab for ID -> phoneme mapping
import json
vocab = json.loads(processor.tokenizer.backend_tokenizer.to_str()) if hasattr(processor.tokenizer, 'backend_tokenizer') else processor.tokenizer.get_vocab()
id_to_phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}

# Frame duration: product of conv_stride values / sampling_rate
# For this model: 5*2*2*2*2*2*2 = 320 samples per frame -> 20ms at 16kHz
frame_duration_s = 320 / 16000  # 0.02s per frame

phonemes, timestamps = [], []
prev_id = None
for frame_idx, token_id in enumerate(pred_ids):
    if token_id == 0:  # skip CTC blank
        prev_id = None
        continue
    if token_id == prev_id:  # skip CTC repeats
        continue
    prev_id = token_id
    phoneme = id_to_phoneme.get(int(token_id), "<unk>")
    if phoneme not in ("<pad>", "<unk>", "|"):
        t = frame_idx * frame_duration_s
        phonemes.append(phoneme)
        timestamps.append(t)

for p, t in zip(phonemes, timestamps):
    print(f"  {t:.3f}s  {p}")

Training Details

Base model: facebook/wav2vec2-large-xlsr-53 (pre-trained on 53 languages)
Dataset: AI4Bharat IndicVoices Hindi split
Text-to-phoneme conversion: Devanagari script → IPA via rule-based transliteration
Loss: CTC (Connectionist Temporal Classification)
Optimizer: AdamW
Training framework: HuggingFace Trainer

Limitations

Designed specifically for Hindi speech; may not generalize well to other languages
CTC-based — no language model or beam search (greedy decode only)
Phoneme boundaries from greedy decoding are approximate; use CTC segmentation for more accurate alignment
Performance may degrade on noisy or far-field audio

Citation

If you use this model, please cite the underlying wav2vec2-xlsr work:

@inproceedings{conneau2020unsupervised,
  title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
  author={Conneau, Alexis and Baevski, Alexei and Rothe, Henry and Araabi, Ali and Auli, Michael},
  booktitle={Interspeech},
  year={2020}
}

Downloads last month: 10

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for xnpx/wav2vec2-large-xlsr-ipa-phonemes

Base model

facebook/wav2vec2-large-xlsr-53

Finetuned

(358)

this model

xnpx
/

wav2vec2-large-xlsr-ipa-phonemes