Whisper Large-v3-Turbo Nepali (Mixed-Dataset Fine-tuned)

This model is a fine-tuned version of openai/whisper-large-v3-turbo on a mixed dataset of Nepali speech to address common failures in base models, specifically Hindi-fication, Audio Truncation, and Hallucination Loops.

Model Description

  • Developed by: tonibirat
  • Language(s): Nepali (ne)
  • Model Type: Automatic Speech Recognition (ASR)
  • Base Model: openai/whisper-large-v3-turbo
  • Fine-tuning Strategy: QLoRA (4-bit)

Key Features

  • Anti-Hindi Bias: Fine-tuned to prefer Nepali vocabulary (e.g., using 'र' instead of 'और').
  • Long-Form Stability: Trained on mixed-length segments to prevent the typical 5-10s truncation error.
  • CTranslate2 Support: Includes a pre-quantized INT8 version in the ct2/ subfolder for high-performance production use.

Training Details

Training Data

The model was trained on a 50/50 interleaved mixture of:

  1. Firoj112/nepali-asr-whisper: High-quality short-form Nepali data.
  2. Google FLEURS (ne_np): Standardized long-form Nepali speech data.

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning Rate: 1e-3
  • Batch Size: 8 (Train), 8 (Eval)
  • Seed: 42
  • Gradient Accumulation Steps: 4
  • Total Train Batch Size: 32
  • Optimizer: AdamW (Fused, betas=(0.9,0.999), epsilon=1e-08)
  • LR Scheduler Type: Linear
  • Warmup Steps: 50
  • Total Training Steps: 1000
  • Mixed Precision: Native AMP (FP16/BF16)

Framework Versions

  • PEFT: 0.18.1
  • Transformers: 4.47.0
  • PyTorch: 2.5.1+cu121
  • Datasets: 3.1.0
  • Tokenizers: 0.21.0

Usage and Pipeline

Recommended Production Pipeline

For the best results on long-form audio (>30s), we recommend using a Sliding Window Chunking approach with a 15s window and 2s overlap, as implemented in our production stack.

Local Usage (Hugging Face Transformers)

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="tonibirat/whisper-large-v3-turbo-ne-mixed")
result = transcriber("audio.mp3", generate_kwargs={"language": "nepali", "task": "transcribe"})
print(result["text"])

High-Performance Usage (CTranslate2)

The model includes a CTranslate2 INT8 version. Download the contents of the ct2/ folder to use with the ctranslate2 Python package.

  • Speed-up: 3-5x faster than PyTorch.
  • VRAM Savings: ~4x reduction via INT8 quantization.

Evaluation Results (Normalized)

Evaluated on 2079-11-27_248.wav using advanced normalization (NFC, Numeral Standardization, Spelling Variant Mapping):

Metric Score
WER 0.82
CER 0.39

Note: High WER is primarily due to whitespace segmentation differences in Devanagari. CER is a more accurate reflection of the model's phonetic precision.

Downloads last month
7
Safetensors
Model size
0.8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tonibirat/whisper-large-v3-turbo-ne-mixed

Finetuned
(512)
this model

Evaluation results

  • WER (Normalized) on Fleurs (Nepali) + Mixed Nepali ASR
    self-reported
    0.820
  • CER (Normalized) on Fleurs (Nepali) + Mixed Nepali ASR
    self-reported
    0.390