LexiCore Wav2Vec2 XLS-R 300M CTC — শব্দতরী Bangla Dialect ASR

This model is a fine-tuned version of
arijitx/wav2vec2-xls-r-300m-bengali
for the “শব্দতরী: Where Dialects Flow into Bangla” competition.

Task: dialectal Bangla speech → standard Bangla text
Data: 3,350 audio clips from 20 regions of Bangladesh (competition dataset only)
Metric: Normalized Levenshtein Similarity (char-level)
Decoding: CTC + 5-gram KenLM (pyctcdecode) + small punctuation rule
Training:
- 20 epochs
- LR = 1e-4
- Batch size ≈ 8 (4 × 2 grad accumulation)
- Strong waveform augmentations (speed, gain, noise, time-drop)

Intended Use

Research and experimentation on Bangla ASR for low-resource and dialectal settings
Non-commercial applications, respecting the original competition and dataset license

Limitations

Trained only on short, scripted sentences from 20 Bangladeshi regions
May not generalize to very long utterances, noisy real-world audio, or code-switching
Output is in standard written Bangla, not dialect spelling

Usage (pseudo-code)

from transformers import Wav2Vec2Processor, AutoModelForCTC
import torch, torchaudio

processor = Wav2Vec2Processor.from_pretrained("your-username/your-repo")
model = AutoModelForCTC.from_pretrained("your-username/your-repo").to("cuda").eval()

waveform, sr = torchaudio.load("example.wav")
# resample to 16k if needed...

inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(inputs.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
transcript = processor.batch_decode(pred_ids)[0]
print(transcript)

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

mohas8
/

wav2vec2-xlsr300m-shobdotori-ctc-lexicore

LexiCore Wav2Vec2 XLS-R 300M CTC — শব্দতরী Bangla Dialect ASR

Intended Use

Limitations

Usage (pseudo-code)

Space using mohas8/wav2vec2-xlsr300m-shobdotori-ctc-lexicore 1