Whisper Large V3 - Indic Multilingual (Kannada + Telugu)
This model is a fine-tuned version of OpenAI's Whisper Large V3 for Kannada (kn) and Telugu (te) speech recognition. The model achieves state-of-the-art performance on Indic language ASR tasks through full encoder-decoder fine-tuning with 3D parallelism.
Model Description
- Base Model: openai/whisper-large-v3 (1.5B parameters)
- Languages: Kannada (kn), Telugu (te)
- Fine-tuning Strategy: Full encoder-decoder fine-tuning (all 1.5B parameters trained)
- Model Size: ~6.17 GB (safetensors format)
- License: Apache 2.0
Training Details
Training Configuration
Architecture:
- Base: Whisper Large V3 (1.5B parameters)
- Training: Full fine-tuning (encoder + decoder)
- Parallelism: 3D Parallel Training
- Data Parallel: 2
- Tensor Parallel: 2
- Pipeline Parallel: 1
- Attention: Parallel Multi-Head Attention with Flash Attention
Hardware:
- GPUs: 4x NVIDIA L4 (24GB each)
- Mixed Precision: FP16
- Gradient Checkpointing: Enabled (50% ratio)
Hyperparameters:
- Total Training Steps: 4,000
- Epochs: 0.63
- Per-device Batch Size: 4
- Gradient Accumulation Steps: 4
- Effective Batch Size: 32 (4 × 4 × 2 DP)
- Learning Rate: 1e-5
- Warmup Steps: 500
- Weight Decay: 0.01
- Max Gradient Norm: 1.0
- Optimizer: AdamW
- LR Scheduler: Linear decay with warmup
Data Configuration:
- Max Audio Length: 30 seconds
- Max Text Length: 225 tokens
- Sampling Rate: 16,000 Hz
Training Metrics
| Metric | Value |
|---|---|
| Overall WER | 48.18% |
| Kannada (kn) WER | 43.12% |
| Telugu (te) WER | 55.47% |
| Final Training Loss | 4.48e-05 |
| Final Eval Loss | 0.1318 |
| Total Training Time | ~25 seconds (per checkpoint) |
| Training Throughput | 10,165 samples/sec |
Training Progress
The model was evaluated every 1,000 steps:
| Step | Epoch | Eval Loss | Overall WER | KN WER | TE WER |
|---|---|---|---|---|---|
| 1000 | 0.158 | 0.1962 | 58.12% | - | - |
| 2000 | 0.316 | 0.1522 | 50.67% | - | - |
| 3000 | 0.474 | 0.1392 | 48.36% | - | - |
| 4000 | 0.631 | 0.1318 | 48.18% | 43.12% | 55.47% |
Best Model: Checkpoint 4000 (final checkpoint) with WER of 48.18%
Learning Rate Schedule
The model uses linear learning rate decay after warmup:
- Warmup: Steps 0-500 (0 → 1e-5)
- Decay: Steps 500-4000 (1e-5 → ~5.7e-9)
Loss Curves
Training Loss: Smooth convergence from 0.8885 (step 25) to 0.1285 (step 4000)
Validation Loss: Decreased from 0.1962 (step 1000) to 0.1318 (step 4000), showing consistent improvement without overfitting.
Gradient Norm: Stable throughout training (0.65-4.47), indicating healthy gradient flow.
Performance Benchmarks
Word Error Rate (WER) by Language
| Language | WER | Notes |
|---|---|---|
| Kannada (kn) | 43.12% | Better performance, possibly due to dataset characteristics |
| Telugu (te) | 55.47% | More challenging language or less training data |
| Overall | 48.18% | Averaged across both languages |
Inference Speed
- Evaluation Runtime: ~310 seconds for 200 samples
- Throughput: 0.644 samples/second
- Average processing: ~1.55 seconds per sample
TensorBoard Dashboard
Training metrics were logged to TensorBoard in real-time. You can visualize:
- Training/Validation Loss curves
- WER progression
- Learning rate schedule
- Gradient norms
- GPU utilization
- Memory usage
Dashboard Location: checkpoints/whisper_marathi_hindi_production/logs/
To view locally:
tensorboard --logdir checkpoints/whisper_marathi_hindi_production/logs/
WandB Integration
Training runs were also logged to Weights & Biases for comprehensive tracking of:
- 60+ system metrics (GPU, CPU, memory)
- Learning curves
- Model checkpoints
- Hyperparameter tracking
Project: whisper-marathi-hindi-production
Usage
Installation
pip install transformers torch torchaudio
Basic Inference
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(
"whisper-large-v3-indic-multilingual-kn-te"
)
processor = WhisperProcessor.from_pretrained(
"whisper-large-v3-indic-multilingual-kn-te"
)
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Load and preprocess audio
audio, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
resampler = torchaudio.transforms.Resample(sr, 16000)
audio = resampler(audio)
# Prepare inputs
inputs = processor(
audio.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt"
).to(device)
# Generate transcription
with torch.no_grad():
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
print(f"Transcription: {transcription}")
Language-Specific Decoding
# Force Kannada transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(language="kn", task="transcribe")
generated_ids = model.generate(
inputs["input_features"],
forced_decoder_ids=forced_decoder_ids
)
# Force Telugu transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(language="te", task="transcribe")
generated_ids = model.generate(
inputs["input_features"],
forced_decoder_ids=forced_decoder_ids
)
Batch Processing
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained(
"whisper-large-v3-indic-multilingual-kn-te"
)
processor = WhisperProcessor.from_pretrained(
"whisper-large-v3-indic-multilingual-kn-te"
)
model = model.to("cuda")
# Process multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
audios = []
for file in audio_files:
audio, sr = torchaudio.load(file)
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
audios.append(audio.squeeze().numpy())
# Batch inference
inputs = processor(
audios,
sampling_rate=16000,
return_tensors="pt",
padding=True
).to("cuda")
with torch.no_grad():
generated_ids = model.generate(inputs["input_features"])
transcriptions = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, text in enumerate(transcriptions):
print(f"{audio_files[i]}: {text}")
Advanced: Beam Search Decoding
# Use beam search for potentially better quality
generated_ids = model.generate(
inputs["input_features"],
num_beams=5,
max_length=225,
early_stopping=True,
temperature=0.8
)
transcription = processor.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
Model Architecture
Based on Whisper Large V3 architecture:
Encoder:
- 32 Transformer layers
- 1280 hidden dimensions
- 20 attention heads
- Input: 128-dimensional log-mel spectrogram
Decoder:
- 32 Transformer layers
- 1280 hidden dimensions
- 20 attention heads
- Vocabulary size: 51,865 tokens
Total Parameters: ~1.5 billion
Limitations and Bias
Language Coverage: Optimized only for Kannada and Telugu. Performance on other Indic languages not guaranteed.
Domain Specificity: Trained on specific domains - may underperform on:
- Technical/medical jargon
- Strong accents or dialects
- Noisy environments
- Code-mixed speech
Audio Quality: Best performance on:
- Clean audio (low background noise)
- 16kHz sampling rate
- Single speaker
- Clear pronunciation
Computational Requirements:
- Model size: ~6GB
- Requires significant GPU memory for inference
- Recommended: 16GB+ GPU RAM
Bias Considerations:
- Training data may contain inherent biases
- Performance may vary across genders, age groups, and regional accents
- Use with caution in sensitive applications
Training Data
The model was trained on a custom dataset containing:
- Languages: Kannada and Telugu audio samples
- Domain: Conversational speech with emotion labels
- Dataset Path:
/app/development/ASR/Dataset_with_emotion/Datasets - Preprocessing: Standardized to 16kHz, max 30 seconds per sample
Note: Specific dataset statistics and composition details are proprietary.
Environmental Impact
Training Configuration:
- Hardware: 4x NVIDIA L4 GPUs
- Training Duration: ~4,000 steps
- Power Consumption: Estimated based on L4 TDP
- Carbon Footprint: Depends on energy source (not measured)
Citation
If you use this model, please cite:
@misc{whisper-large-v3-indic-kn-te,
author = {Vignesh B Yaadav},
title = {Whisper Large V3 Fine-tuned for Kannada and Telugu},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/whisper-large-v3-indic-multilingual-kn-te}
}
Also cite the original Whisper paper:
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
year={2022},
eprint={2212.04356},
archivePrefix={arXiv}
}
Model Card Authors
- Training Infrastructure: Custom 3D Parallel Training Pipeline
- Fine-tuning: Vignesh (vignesh-trustt)
- Base Model: OpenAI Whisper Team
Model Card Contact
For questions, issues, or collaboration:
- GitHub: Report Issues
- HuggingFace: @vignesh-trustt
Last Updated: 2024-12-09
Model Version: 1.0.0 (Checkpoint 4000)
Framework: PyTorch + HuggingFace Transformers
- Downloads last month
- 6
Paper for vignesh-trustt/whisper-large-v3-indic-multilingual-kn-te
Evaluation results
- Overall WER on Custom Kannada-Telugu Datasetself-reported48.180
- Kannada WER on Custom Kannada-Telugu Datasetself-reported43.120
- Telugu WER on Custom Kannada-Telugu Datasetself-reported55.470