whisper-small-nepali
Fine-tuned version of openai/whisper-small on the amitpant7/nepali-speech-to-text dataset for Nepali automatic speech recognition.
Whisper-Small-Nepali is a fine-tuned automatic speech recognition (ASR) model based on OpenAIβs Whisper architecture, optimized for transcribing Nepali speech.
π Evaluation Results
| Metric | Value |
|---|---|
| WER (%) | 31.64 |
| CER (%) | 7.27 |
| Eval Loss | 0.2154 |
| Train Loss | 1.0033 |
| Epochs | 30 |
| Train time | 26.5 min (RTX PRO 6000 Blackwell) |
Evaluated on a 10% held-out validation split of
amitpant7/nepali-speech-to-text(seed=42). Best checkpoint selected by lowest validation WER across all steps.
π Training Curves
Train loss, validation loss, WER, CER, and LR cosine schedule across all 1140 steps.
π How to Use
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch, librosa
model = WhisperForConditionalGeneration.from_pretrained("devrahulbanjara/whisper-small-nepali")
processor = WhisperProcessor.from_pretrained("devrahulbanjara/whisper-small-nepali")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Load your audio (16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)
with torch.no_grad():
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
ποΈ Dataset
| Field | Value |
|---|---|
| Dataset | amitpant7/nepali-speech-to-text |
| Language | Nepali (ne) |
| Train split | 90% of dataset |
| Validation split | 10% held-out (seed=42) |
| Credit | amitpant7 |
βοΈ Training Configuration
| Hyperparameter | Value |
|---|---|
| Base model | openai/whisper-small |
| Epochs | 30 |
| Total steps | 1140 |
| Batch size (per device) | 32 |
| Gradient accumulation steps | 2 |
| Effective batch size | 64 |
| Learning rate | 3e-5 |
| LR scheduler | cosine (warmup + decay) |
| Warmup steps | 150 |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Dropout | 0.1 |
| Attention dropout | 0.1 |
| Early stopping patience | 5 evaluations |
| Early stopping threshold | 0.05 WER improvement |
| Eval every | 80 steps |
| Best model metric | WER (lower is better) |
| Precision | bf16 (Blackwell native) |
| Seed | 42 |
π‘οΈ Regularisation & Augmentation
| Technique | Details |
|---|---|
| Weight decay (L2) | Ξ»=0.01 on all non-bias, non-LayerNorm params |
| Dropout | p=0.1 on encoder + decoder residual connections |
| Attention dropout | p=0.1 inside multi-head attention layers |
| SpecAugment (time) | 2Γ time masks, max 30 consecutive frames each |
| SpecAugment (freq) | 2Γ frequency masks, max 15 mel-bins each |
| Gaussian noise | Amplitude U[0.002, 0.010], applied with p=0.4 |
| Early stopping | Patience=5 evals, min delta=0.05 WER |
| Gradient clipping | Max L2 norm = 1.0 |
ποΈ Training Infrastructure
| Field | Value |
|---|---|
| Hardware | NVIDIA RTX PRO 6000 Blackwell Server Edition (24 physical / 48 logical cores) |
| Framework | HuggingFace Transformers Seq2SeqTrainer |
| Experiment tracking | Weights & Biases |
| Train samples/sec | 44.97 |
| Total FLOPs | 2.06 Γ 10ΒΉβΉ |
| Best model selection | Lowest validation WER, restored via load_best_model_at_end=True |
π License
This model inherits the Apache 2.0 license from openai/whisper-small. The training dataset (amitpant7/nepali-speech-to-text) is credited to its original author β please review its own license before any commercial use.
π Acknowledgements
- OpenAI Whisper β base model architecture and weights
- amitpant7 β Nepali speech-to-text dataset
- HuggingFace Transformers β training framework
- Weights & Biases β experiment tracking
- Downloads last month
- 107
Model tree for devrahulbanjara/whisper-small-nepali
Base model
openai/whisper-smallDataset used to train devrahulbanjara/whisper-small-nepali
Evaluation results
- WER (%) on amitpant7/nepali-speech-to-textself-reported31.640
- CER (%) on amitpant7/nepali-speech-to-textself-reported7.270
