Whisper Large V3 - Indic Multilingual (Kannada + Telugu)

This model is a fine-tuned version of OpenAI's Whisper Large V3 for Kannada (kn) and Telugu (te) speech recognition. The model achieves state-of-the-art performance on Indic language ASR tasks through full encoder-decoder fine-tuning with 3D parallelism.

Model Description

  • Base Model: openai/whisper-large-v3 (1.5B parameters)
  • Languages: Kannada (kn), Telugu (te)
  • Fine-tuning Strategy: Full encoder-decoder fine-tuning (all 1.5B parameters trained)
  • Model Size: ~6.17 GB (safetensors format)
  • License: Apache 2.0

Training Details

Training Configuration

Architecture:
  - Base: Whisper Large V3 (1.5B parameters)
  - Training: Full fine-tuning (encoder + decoder)
  - Parallelism: 3D Parallel Training
    - Data Parallel: 2
    - Tensor Parallel: 2
    - Pipeline Parallel: 1
  - Attention: Parallel Multi-Head Attention with Flash Attention

Hardware:
  - GPUs: 4x NVIDIA L4 (24GB each)
  - Mixed Precision: FP16
  - Gradient Checkpointing: Enabled (50% ratio)

Hyperparameters:
  - Total Training Steps: 4,000
  - Epochs: 0.63
  - Per-device Batch Size: 4
  - Gradient Accumulation Steps: 4
  - Effective Batch Size: 32 (4 × 4 × 2 DP)
  - Learning Rate: 1e-5
  - Warmup Steps: 500
  - Weight Decay: 0.01
  - Max Gradient Norm: 1.0
  - Optimizer: AdamW
  - LR Scheduler: Linear decay with warmup

Data Configuration:
  - Max Audio Length: 30 seconds
  - Max Text Length: 225 tokens
  - Sampling Rate: 16,000 Hz

Training Metrics

Metric Value
Overall WER 48.18%
Kannada (kn) WER 43.12%
Telugu (te) WER 55.47%
Final Training Loss 4.48e-05
Final Eval Loss 0.1318
Total Training Time ~25 seconds (per checkpoint)
Training Throughput 10,165 samples/sec

Training Progress

The model was evaluated every 1,000 steps:

Step Epoch Eval Loss Overall WER KN WER TE WER
1000 0.158 0.1962 58.12% - -
2000 0.316 0.1522 50.67% - -
3000 0.474 0.1392 48.36% - -
4000 0.631 0.1318 48.18% 43.12% 55.47%

Best Model: Checkpoint 4000 (final checkpoint) with WER of 48.18%

Learning Rate Schedule

The model uses linear learning rate decay after warmup:

  • Warmup: Steps 0-500 (0 → 1e-5)
  • Decay: Steps 500-4000 (1e-5 → ~5.7e-9)

Loss Curves

Training Loss: Smooth convergence from 0.8885 (step 25) to 0.1285 (step 4000)

Validation Loss: Decreased from 0.1962 (step 1000) to 0.1318 (step 4000), showing consistent improvement without overfitting.

Gradient Norm: Stable throughout training (0.65-4.47), indicating healthy gradient flow.

Performance Benchmarks

Word Error Rate (WER) by Language

Language WER Notes
Kannada (kn) 43.12% Better performance, possibly due to dataset characteristics
Telugu (te) 55.47% More challenging language or less training data
Overall 48.18% Averaged across both languages

Inference Speed

  • Evaluation Runtime: ~310 seconds for 200 samples
  • Throughput: 0.644 samples/second
  • Average processing: ~1.55 seconds per sample

TensorBoard Dashboard

Training metrics were logged to TensorBoard in real-time. You can visualize:

  • Training/Validation Loss curves
  • WER progression
  • Learning rate schedule
  • Gradient norms
  • GPU utilization
  • Memory usage

Dashboard Location: checkpoints/whisper_marathi_hindi_production/logs/

To view locally:

tensorboard --logdir checkpoints/whisper_marathi_hindi_production/logs/

WandB Integration

Training runs were also logged to Weights & Biases for comprehensive tracking of:

  • 60+ system metrics (GPU, CPU, memory)
  • Learning curves
  • Model checkpoints
  • Hyperparameter tracking

Project: whisper-marathi-hindi-production

Usage

Installation

pip install transformers torch torchaudio

Basic Inference

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(
    "whisper-large-v3-indic-multilingual-kn-te"
)
processor = WhisperProcessor.from_pretrained(
    "whisper-large-v3-indic-multilingual-kn-te"
)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and preprocess audio
audio, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)

# Prepare inputs
inputs = processor(
    audio.squeeze().numpy(),
    sampling_rate=16000,
    return_tensors="pt"
).to(device)

# Generate transcription
with torch.no_grad():
    generated_ids = model.generate(inputs["input_features"])

transcription = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]

print(f"Transcription: {transcription}")

Language-Specific Decoding

# Force Kannada transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(language="kn", task="transcribe")

generated_ids = model.generate(
    inputs["input_features"],
    forced_decoder_ids=forced_decoder_ids
)

# Force Telugu transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(language="te", task="transcribe")

generated_ids = model.generate(
    inputs["input_features"],
    forced_decoder_ids=forced_decoder_ids
)

Batch Processing

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained(
    "whisper-large-v3-indic-multilingual-kn-te"
)
processor = WhisperProcessor.from_pretrained(
    "whisper-large-v3-indic-multilingual-kn-te"
)
model = model.to("cuda")

# Process multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
audios = []

for file in audio_files:
    audio, sr = torchaudio.load(file)
    if sr != 16000:
        audio = torchaudio.transforms.Resample(sr, 16000)(audio)
    audios.append(audio.squeeze().numpy())

# Batch inference
inputs = processor(
    audios,
    sampling_rate=16000,
    return_tensors="pt",
    padding=True
).to("cuda")

with torch.no_grad():
    generated_ids = model.generate(inputs["input_features"])

transcriptions = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, text in enumerate(transcriptions):
    print(f"{audio_files[i]}: {text}")

Advanced: Beam Search Decoding

# Use beam search for potentially better quality
generated_ids = model.generate(
    inputs["input_features"],
    num_beams=5,
    max_length=225,
    early_stopping=True,
    temperature=0.8
)

transcription = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]

Model Architecture

Based on Whisper Large V3 architecture:

Encoder:
  - 32 Transformer layers
  - 1280 hidden dimensions
  - 20 attention heads
  - Input: 128-dimensional log-mel spectrogram

Decoder:
  - 32 Transformer layers
  - 1280 hidden dimensions
  - 20 attention heads
  - Vocabulary size: 51,865 tokens

Total Parameters: ~1.5 billion

Limitations and Bias

  1. Language Coverage: Optimized only for Kannada and Telugu. Performance on other Indic languages not guaranteed.

  2. Domain Specificity: Trained on specific domains - may underperform on:

    • Technical/medical jargon
    • Strong accents or dialects
    • Noisy environments
    • Code-mixed speech
  3. Audio Quality: Best performance on:

    • Clean audio (low background noise)
    • 16kHz sampling rate
    • Single speaker
    • Clear pronunciation
  4. Computational Requirements:

    • Model size: ~6GB
    • Requires significant GPU memory for inference
    • Recommended: 16GB+ GPU RAM
  5. Bias Considerations:

    • Training data may contain inherent biases
    • Performance may vary across genders, age groups, and regional accents
    • Use with caution in sensitive applications

Training Data

The model was trained on a custom dataset containing:

  • Languages: Kannada and Telugu audio samples
  • Domain: Conversational speech with emotion labels
  • Dataset Path: /app/development/ASR/Dataset_with_emotion/Datasets
  • Preprocessing: Standardized to 16kHz, max 30 seconds per sample

Note: Specific dataset statistics and composition details are proprietary.

Environmental Impact

Training Configuration:

  • Hardware: 4x NVIDIA L4 GPUs
  • Training Duration: ~4,000 steps
  • Power Consumption: Estimated based on L4 TDP
  • Carbon Footprint: Depends on energy source (not measured)

Citation

If you use this model, please cite:

@misc{whisper-large-v3-indic-kn-te,
  author = {Vignesh B Yaadav},
  title = {Whisper Large V3 Fine-tuned for Kannada and Telugu},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/whisper-large-v3-indic-multilingual-kn-te}
}

Also cite the original Whisper paper:

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022},
  eprint={2212.04356},
  archivePrefix={arXiv}
}

Model Card Authors

  • Training Infrastructure: Custom 3D Parallel Training Pipeline
  • Fine-tuning: Vignesh (vignesh-trustt)
  • Base Model: OpenAI Whisper Team

Model Card Contact

For questions, issues, or collaboration:


Last Updated: 2024-12-09

Model Version: 1.0.0 (Checkpoint 4000)

Framework: PyTorch + HuggingFace Transformers

Downloads last month
6
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for vignesh-trustt/whisper-large-v3-indic-multilingual-kn-te

Evaluation results