Whisper Small - Tibetan Speech Translation
This model is a fine-tuned version of openai/whisper-small for Tibetan speech-to-text translation.
Model Description
- Model: Whisper Small (244M parameters)
- Language: Tibetan (བོད་སྐད།)
- Task: Speech Translation (Tibetan audio → Text transcription)
- Base Model: openai/whisper-small
- Dataset: lilgoose777/merged-tibetan-titung-goose
- Checkpoint: checkpoint-3000
Intended Uses
This model is designed to transcribe Tibetan speech into written text. It can be used for:
- 📝 Transcribing Tibetan audio recordings
- 🎙️ Building Tibetan speech recognition applications
- 📚 Creating subtitles for Tibetan audio/video content
- 🔬 Research in low-resource language ASR
- 📖 Preserving and digitizing Tibetan oral traditions
Training Details
Dataset
The model was trained on the Merged Tibetan Titung Goose dataset, which combines multiple Tibetan audio sources for improved coverage.
- Dataset:
lilgoose777/merged-tibetan-titung-goose - Train/Test Split: 90/10
- Sampling Rate: 16000 Hz
Training Hyperparameters
The model was fine-tuned with the following configuration:
| Hyperparameter | Value |
|---|---|
| Base Model | openai/whisper-small |
| Training Steps | 3,000 |
| Batch Size (Train) | 16 |
| Batch Size (Eval) | 8 |
| Learning Rate | 1.25e-05 |
| Warmup Steps | 300 |
| Gradient Accumulation | 1 |
| Gradient Checkpointing | True |
| Mixed Precision (FP16) | False |
| Max Generation Length | 225 |
| Evaluation Strategy | Every 100 steps |
| Save Strategy | Every 100 steps |
Training Infrastructure
- Framework: HuggingFace Transformers 4.56.2
- Training Framework: Seq2SeqTrainer
- Optimizer: AdamW
- Metric: Word Error Rate (WER)
Results
| Metric | Value |
|---|---|
| Best WER | XX.XX% |
| Final WER | XX.XX% |
Lower WER is better. WER (Word Error Rate) measures the percentage of words that are incorrectly transcribed.
Usage
Quick Start with Pipeline
from transformers import pipeline
# Create transcription pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="milanakdj/whisper-small-full-tibetan",
generate_kwargs={"language": "tibetan", "task": "translate"}
)
# Transcribe audio file
result = pipe("path/to/tibetan_audio.wav")
print(result["text"])
Using Processor and Model
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
# Load model and processor
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")
# Load your audio (16kHz sampling rate)
# audio_array = ... your audio array ...
# Process audio
input_features = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt"
).input_features
# Generate transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="tibetan",
task="translate"
)
with torch.no_grad():
predicted_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids
)
# Decode
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
print(transcription)
Using with Librosa
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")
# Load audio file (automatically resamples to 16kHz)
audio, sr = librosa.load("tibetan_audio.mp3", sr=16000)
# Process and transcribe
input_features = processor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="tibetan",
task="translate"
)
predicted_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids
)
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
print(transcription)
Batch Processing Multiple Files
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Process multiple files
audio_files = ["file1.wav", "file2.wav", "file3.wav"]
for audio_file in audio_files:
# Load audio
audio, _ = librosa.load(audio_file, sr=16000)
# Process
input_features = processor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features.to(device)
# Generate
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="tibetan",
task="translate"
)
with torch.no_grad():
predicted_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids
)
# Decode
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
print(f"{audio_file}: {transcription}")
Evaluation
To evaluate the model on your own Tibetan audio dataset:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import evaluate
# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")
# Load your dataset
dataset = load_dataset("your_tibetan_dataset")
# Initialize WER metric
wer_metric = evaluate.load("wer")
# Process and evaluate
# ... (see full evaluation code in documentation)
Limitations and Considerations
Known Limitations
- Domain Specificity: The model is trained on specific Tibetan dialects and domains present in the training data
- Audio Quality: Performance degrades with:
- Background noise
- Poor recording quality
- Multiple speakers
- Non-standard dialects
- Low-Resource Language: As Tibetan is a low-resource language, the model may have limited generalization compared to high-resource language models
- Code-Switching: May struggle with Tibetan-English or Tibetan-Chinese code-switching
Best Practices
- Audio Format: Use 16kHz mono audio for best results
- Clean Audio: Minimize background noise and ensure clear speech
- Standard Dialect: Model performs best on dialects similar to training data
- Audio Length: Optimal performance on audio clips under 30 seconds
Ethical Considerations
Intended Use
This model is intended for:
- ✅ Transcription of Tibetan speech
- ✅ Educational purposes
- ✅ Research in speech recognition
- ✅ Preservation of Tibetan language
Out-of-Scope Use
- ❌ Surveillance or monitoring without consent
- ❌ Generating misleading transcriptions
- ❌ Any use that violates privacy or human rights
Bias and Fairness
- The model's performance may vary across different Tibetan dialects
- Training data may not represent all Tibetan-speaking communities equally
- Users should evaluate the model on their specific use case before deployment
Citation
If you use this model in your research or application, please cite:
@misc{whisper-small-tibetan-2024,
author = {Milan Akdj},
title = {Whisper Small - Tibetan Speech Translation},
year = {2024},
publisher = {HuggingFace},
journal = {HuggingFace Model Hub},
howpublished = {\url{https://huggingface.co/milanakdj/whisper-small-full-tibetan}}
}
Also cite the original Whisper paper:
@article{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
Acknowledgements
- Base Model: OpenAI Whisper Small
- Dataset: Merged Tibetan Titung Goose
- Framework: HuggingFace Transformers
- Training: Fine-tuned using HuggingFace Seq2SeqTrainer
Model Card Authors
Milan Akdj
Contact
For questions or issues with this model, please open an issue on the model repository.
Additional Information
Model Architecture
Whisper Small uses a Transformer encoder-decoder architecture:
- Encoder: Processes audio features
- Decoder: Generates text transcription
- Parameters: ~244M total parameters
Training Environment
- Platform: RunPod GPU Instance
- Monitoring: TensorBoard logging enabled
- Evaluation: WER metric on held-out test set
Version History
- v1.0: Initial release with checkpoint-3000
Language: Tibetan (བོད་སྐད།)
License: Apache 2.0
Model Size: ~244M parameters
- Downloads last month
- 66
Model tree for milanakdj/whisper-small-full-tibetan
Base model
openai/whisper-smallDataset used to train milanakdj/whisper-small-full-tibetan
Paper for milanakdj/whisper-small-full-tibetan
Evaluation results
- Word Error Rate on Merged Tibetan Titung Gooseself-reportedXX.XX