whisper-small-nepali

Fine-tuned version of openai/whisper-small on the amitpant7/nepali-speech-to-text dataset for Nepali automatic speech recognition.

Whisper-Small-Nepali is a fine-tuned automatic speech recognition (ASR) model based on OpenAI’s Whisper architecture, optimized for transcribing Nepali speech.


πŸ“Š Evaluation Results

Metric Value
WER (%) 31.64
CER (%) 7.27
Eval Loss 0.2154
Train Loss 1.0033
Epochs 30
Train time 26.5 min (RTX PRO 6000 Blackwell)

Evaluated on a 10% held-out validation split of amitpant7/nepali-speech-to-text (seed=42). Best checkpoint selected by lowest validation WER across all steps.


πŸ“ˆ Training Curves

Training Curves

Train loss, validation loss, WER, CER, and LR cosine schedule across all 1140 steps.


πŸš€ How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch, librosa

model     = WhisperForConditionalGeneration.from_pretrained("devrahulbanjara/whisper-small-nepali")
processor = WhisperProcessor.from_pretrained("devrahulbanjara/whisper-small-nepali")
device    = "cuda" if torch.cuda.is_available() else "cpu"
model     = model.to(device)

# Load your audio (16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)

with torch.no_grad():
    predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

πŸ—‚οΈ Dataset

Field Value
Dataset amitpant7/nepali-speech-to-text
Language Nepali (ne)
Train split 90% of dataset
Validation split 10% held-out (seed=42)
Credit amitpant7

βš™οΈ Training Configuration

Hyperparameter Value
Base model openai/whisper-small
Epochs 30
Total steps 1140
Batch size (per device) 32
Gradient accumulation steps 2
Effective batch size 64
Learning rate 3e-5
LR scheduler cosine (warmup + decay)
Warmup steps 150
Weight decay 0.01
Max grad norm 1.0
Dropout 0.1
Attention dropout 0.1
Early stopping patience 5 evaluations
Early stopping threshold 0.05 WER improvement
Eval every 80 steps
Best model metric WER (lower is better)
Precision bf16 (Blackwell native)
Seed 42

πŸ›‘οΈ Regularisation & Augmentation

Technique Details
Weight decay (L2) Ξ»=0.01 on all non-bias, non-LayerNorm params
Dropout p=0.1 on encoder + decoder residual connections
Attention dropout p=0.1 inside multi-head attention layers
SpecAugment (time) 2Γ— time masks, max 30 consecutive frames each
SpecAugment (freq) 2Γ— frequency masks, max 15 mel-bins each
Gaussian noise Amplitude U[0.002, 0.010], applied with p=0.4
Early stopping Patience=5 evals, min delta=0.05 WER
Gradient clipping Max L2 norm = 1.0

πŸ—οΈ Training Infrastructure

Field Value
Hardware NVIDIA RTX PRO 6000 Blackwell Server Edition (24 physical / 48 logical cores)
Framework HuggingFace Transformers Seq2SeqTrainer
Experiment tracking Weights & Biases
Train samples/sec 44.97
Total FLOPs 2.06 Γ— 10¹⁹
Best model selection Lowest validation WER, restored via load_best_model_at_end=True

πŸ“„ License

This model inherits the Apache 2.0 license from openai/whisper-small. The training dataset (amitpant7/nepali-speech-to-text) is credited to its original author β€” please review its own license before any commercial use.


πŸ™ Acknowledgements

Downloads last month
107
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for devrahulbanjara/whisper-small-nepali

Finetuned
(3438)
this model

Dataset used to train devrahulbanjara/whisper-small-nepali

Evaluation results