Whisper Large-v3-Turbo Nepali (Mixed-Dataset Fine-tuned)

This model is a fine-tuned version of openai/whisper-large-v3-turbo on a mixed dataset of Nepali speech to address common failures in base models, specifically Hindi-fication, Audio Truncation, and Hallucination Loops.

Model Description

Developed by: tonibirat
Language(s): Nepali (ne)
Model Type: Automatic Speech Recognition (ASR)
Base Model: openai/whisper-large-v3-turbo
Fine-tuning Strategy: QLoRA (4-bit)

Key Features

Anti-Hindi Bias: Fine-tuned to prefer Nepali vocabulary (e.g., using 'र' instead of 'और').
Long-Form Stability: Trained on mixed-length segments to prevent the typical 5-10s truncation error.
CTranslate2 Support: Includes a pre-quantized INT8 version in the ct2/ subfolder for high-performance production use.

Training Details

Training Data

The model was trained on a 50/50 interleaved mixture of:

Firoj112/nepali-asr-whisper: High-quality short-form Nepali data.
Google FLEURS (ne_np): Standardized long-form Nepali speech data.

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 1e-3
Batch Size: 8 (Train), 8 (Eval)
Seed: 42
Gradient Accumulation Steps: 4
Total Train Batch Size: 32
Optimizer: AdamW (Fused, betas=(0.9,0.999), epsilon=1e-08)
LR Scheduler Type: Linear
Warmup Steps: 50
Total Training Steps: 1000
Mixed Precision: Native AMP (FP16/BF16)

Framework Versions

PEFT: 0.18.1
Transformers: 4.47.0
PyTorch: 2.5.1+cu121
Datasets: 3.1.0
Tokenizers: 0.21.0

Usage and Pipeline

Recommended Production Pipeline

For the best results on long-form audio (>30s), we recommend using a Sliding Window Chunking approach with a 15s window and 2s overlap, as implemented in our production stack.

Local Usage (Hugging Face Transformers)

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="tonibirat/whisper-large-v3-turbo-ne-mixed")
result = transcriber("audio.mp3", generate_kwargs={"language": "nepali", "task": "transcribe"})
print(result["text"])

High-Performance Usage (CTranslate2)

The model includes a CTranslate2 INT8 version. Download the contents of the ct2/ folder to use with the ctranslate2 Python package.

Speed-up: 3-5x faster than PyTorch.
VRAM Savings: ~4x reduction via INT8 quantization.

Evaluation Results (Normalized)

Evaluated on 2079-11-27_248.wav using advanced normalization (NFC, Numeral Standardization, Spelling Variant Mapping):

Metric	Score
WER	0.82
CER	0.39

Note: High WER is primarily due to whitespace segmentation differences in Devanagari. CER is a more accurate reflection of the model's phonetic precision.

Downloads last month: 7

Safetensors

Model size

0.8B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tonibirat/whisper-large-v3-turbo-ne-mixed

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(512)

this model

Evaluation results

WER (Normalized) on Fleurs (Nepali) + Mixed Nepali ASR
self-reported

0.820
CER (Normalized) on Fleurs (Nepali) + Mixed Nepali ASR
self-reported

0.390