whisper-small-nepali

Fine-tuned version of openai/whisper-small on the amitpant7/nepali-speech-to-text dataset for Nepali automatic speech recognition.

Whisper-Small-Nepali is a fine-tuned automatic speech recognition (ASR) model based on OpenAI’s Whisper architecture, optimized for transcribing Nepali speech.

📊 Evaluation Results

Metric	Value
WER (%)	31.64
CER (%)	7.27
Eval Loss	0.2154
Train Loss	1.0033
Epochs	30
Train time	26.5 min (RTX PRO 6000 Blackwell)

Evaluated on a 10% held-out validation split of amitpant7/nepali-speech-to-text (seed=42). Best checkpoint selected by lowest validation WER across all steps.

📈 Training Curves

Train loss, validation loss, WER, CER, and LR cosine schedule across all 1140 steps.

🚀 How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch, librosa

model     = WhisperForConditionalGeneration.from_pretrained("devrahulbanjara/whisper-small-nepali")
processor = WhisperProcessor.from_pretrained("devrahulbanjara/whisper-small-nepali")
device    = "cuda" if torch.cuda.is_available() else "cpu"
model     = model.to(device)

# Load your audio (16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)

with torch.no_grad():
    predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

🗂️ Dataset

Field	Value
Dataset	amitpant7/nepali-speech-to-text
Language	Nepali (`ne`)
Train split	90% of dataset
Validation split	10% held-out (seed=42)
Credit	amitpant7

⚙️ Training Configuration

Hyperparameter	Value
Base model	openai/whisper-small
Epochs	30
Total steps	1140
Batch size (per device)	32
Gradient accumulation steps	2
Effective batch size	64
Learning rate	3e-5
LR scheduler	cosine (warmup + decay)
Warmup steps	150
Weight decay	0.01
Max grad norm	1.0
Dropout	0.1
Attention dropout	0.1
Early stopping patience	5 evaluations
Early stopping threshold	0.05 WER improvement
Eval every	80 steps
Best model metric	WER (lower is better)
Precision	bf16 (Blackwell native)
Seed	42

🛡️ Regularisation & Augmentation

Technique	Details
Weight decay (L2)	λ=0.01 on all non-bias, non-LayerNorm params
Dropout	p=0.1 on encoder + decoder residual connections
Attention dropout	p=0.1 inside multi-head attention layers
SpecAugment (time)	2× time masks, max 30 consecutive frames each
SpecAugment (freq)	2× frequency masks, max 15 mel-bins each
Gaussian noise	Amplitude U[0.002, 0.010], applied with p=0.4
Early stopping	Patience=5 evals, min delta=0.05 WER
Gradient clipping	Max L2 norm = 1.0

🏗️ Training Infrastructure

Field	Value
Hardware	NVIDIA RTX PRO 6000 Blackwell Server Edition (24 physical / 48 logical cores)
Framework	HuggingFace Transformers `Seq2SeqTrainer`
Experiment tracking	Weights & Biases
Train samples/sec	44.97
Total FLOPs	2.06 × 10¹⁹
Best model selection	Lowest validation WER, restored via `load_best_model_at_end=True`

📄 License

This model inherits the Apache 2.0 license from openai/whisper-small. The training dataset (amitpant7/nepali-speech-to-text) is credited to its original author — please review its own license before any commercial use.

🙏 Acknowledgements

OpenAI Whisper — base model architecture and weights
amitpant7 — Nepali speech-to-text dataset
HuggingFace Transformers — training framework
Weights & Biases — experiment tracking

Downloads last month: 107

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for devrahulbanjara/whisper-small-nepali

Base model

openai/whisper-small

Finetuned

(3438)

this model

Dataset used to train devrahulbanjara/whisper-small-nepali

Evaluation results

WER (%) on amitpant7/nepali-speech-to-text
self-reported

31.640
CER (%) on amitpant7/nepali-speech-to-text
self-reported

7.270