Whisper Large-v3-Turbo Nepali (Mixed-Dataset Fine-tuned)
This model is a fine-tuned version of openai/whisper-large-v3-turbo on a mixed dataset of Nepali speech to address common failures in base models, specifically Hindi-fication, Audio Truncation, and Hallucination Loops.
Model Description
- Developed by: tonibirat
- Language(s): Nepali (ne)
- Model Type: Automatic Speech Recognition (ASR)
- Base Model:
openai/whisper-large-v3-turbo - Fine-tuning Strategy: QLoRA (4-bit)
Key Features
- Anti-Hindi Bias: Fine-tuned to prefer Nepali vocabulary (e.g., using 'र' instead of 'और').
- Long-Form Stability: Trained on mixed-length segments to prevent the typical 5-10s truncation error.
- CTranslate2 Support: Includes a pre-quantized INT8 version in the
ct2/subfolder for high-performance production use.
Training Details
Training Data
The model was trained on a 50/50 interleaved mixture of:
- Firoj112/nepali-asr-whisper: High-quality short-form Nepali data.
- Google FLEURS (ne_np): Standardized long-form Nepali speech data.
Training Hyperparameters
The following hyperparameters were used during training:
- Learning Rate: 1e-3
- Batch Size: 8 (Train), 8 (Eval)
- Seed: 42
- Gradient Accumulation Steps: 4
- Total Train Batch Size: 32
- Optimizer: AdamW (Fused, betas=(0.9,0.999), epsilon=1e-08)
- LR Scheduler Type: Linear
- Warmup Steps: 50
- Total Training Steps: 1000
- Mixed Precision: Native AMP (FP16/BF16)
Framework Versions
- PEFT: 0.18.1
- Transformers: 4.47.0
- PyTorch: 2.5.1+cu121
- Datasets: 3.1.0
- Tokenizers: 0.21.0
Usage and Pipeline
Recommended Production Pipeline
For the best results on long-form audio (>30s), we recommend using a Sliding Window Chunking approach with a 15s window and 2s overlap, as implemented in our production stack.
Local Usage (Hugging Face Transformers)
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition", model="tonibirat/whisper-large-v3-turbo-ne-mixed")
result = transcriber("audio.mp3", generate_kwargs={"language": "nepali", "task": "transcribe"})
print(result["text"])
High-Performance Usage (CTranslate2)
The model includes a CTranslate2 INT8 version. Download the contents of the ct2/ folder to use with the ctranslate2 Python package.
- Speed-up: 3-5x faster than PyTorch.
- VRAM Savings: ~4x reduction via INT8 quantization.
Evaluation Results (Normalized)
Evaluated on 2079-11-27_248.wav using advanced normalization (NFC, Numeral Standardization, Spelling Variant Mapping):
| Metric | Score |
|---|---|
| WER | 0.82 |
| CER | 0.39 |
Note: High WER is primarily due to whitespace segmentation differences in Devanagari. CER is a more accurate reflection of the model's phonetic precision.
- Downloads last month
- 7
Model tree for tonibirat/whisper-large-v3-turbo-ne-mixed
Base model
openai/whisper-large-v3Evaluation results
- WER (Normalized) on Fleurs (Nepali) + Mixed Nepali ASRself-reported0.820
- CER (Normalized) on Fleurs (Nepali) + Mixed Nepali ASRself-reported0.390