Whisper Small - Tibetan Speech Translation

This model is a fine-tuned version of openai/whisper-small for Tibetan speech-to-text translation.

Model Description

Model: Whisper Small (244M parameters)
Language: Tibetan (བོད་སྐད།)
Task: Speech Translation (Tibetan audio → Text transcription)
Base Model: openai/whisper-small
Dataset: lilgoose777/merged-tibetan-titung-goose
Checkpoint: checkpoint-3000

Intended Uses

This model is designed to transcribe Tibetan speech into written text. It can be used for:

📝 Transcribing Tibetan audio recordings
🎙️ Building Tibetan speech recognition applications
📚 Creating subtitles for Tibetan audio/video content
🔬 Research in low-resource language ASR
📖 Preserving and digitizing Tibetan oral traditions

Training Details

Dataset

The model was trained on the Merged Tibetan Titung Goose dataset, which combines multiple Tibetan audio sources for improved coverage.

Dataset: lilgoose777/merged-tibetan-titung-goose
Train/Test Split: 90/10
Sampling Rate: 16000 Hz

Training Hyperparameters

The model was fine-tuned with the following configuration:

Hyperparameter	Value
Base Model	`openai/whisper-small`
Training Steps	3,000
Batch Size (Train)	16
Batch Size (Eval)	8
Learning Rate	1.25e-05
Warmup Steps	300
Gradient Accumulation	1
Gradient Checkpointing	True
Mixed Precision (FP16)	False
Max Generation Length	225
Evaluation Strategy	Every 100 steps
Save Strategy	Every 100 steps

Training Infrastructure

Framework: HuggingFace Transformers 4.56.2
Training Framework: Seq2SeqTrainer
Optimizer: AdamW
Metric: Word Error Rate (WER)

Results

Metric	Value
Best WER	XX.XX%
Final WER	XX.XX%

Lower WER is better. WER (Word Error Rate) measures the percentage of words that are incorrectly transcribed.

Usage

Quick Start with Pipeline

from transformers import pipeline

# Create transcription pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model="milanakdj/whisper-small-full-tibetan",
    generate_kwargs={"language": "tibetan", "task": "translate"}
)

# Transcribe audio file
result = pipe("path/to/tibetan_audio.wav")
print(result["text"])

Using Processor and Model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Load your audio (16kHz sampling rate)
# audio_array = ... your audio array ...

# Process audio
input_features = processor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# Generate transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="tibetan",
    task="translate"
)

with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        forced_decoder_ids=forced_decoder_ids
    )

# Decode
transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print(transcription)

Using with Librosa

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Load audio file (automatically resamples to 16kHz)
audio, sr = librosa.load("tibetan_audio.mp3", sr=16000)

# Process and transcribe
input_features = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt"
).input_features

forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="tibetan",
    task="translate"
)

predicted_ids = model.generate(
    input_features,
    forced_decoder_ids=forced_decoder_ids
)

transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print(transcription)

Batch Processing Multiple Files

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Process multiple files
audio_files = ["file1.wav", "file2.wav", "file3.wav"]

for audio_file in audio_files:
    # Load audio
    audio, _ = librosa.load(audio_file, sr=16000)
    
    # Process
    input_features = processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    ).input_features.to(device)
    
    # Generate
    forced_decoder_ids = processor.get_decoder_prompt_ids(
        language="tibetan",
        task="translate"
    )
    
    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            forced_decoder_ids=forced_decoder_ids
        )
    
    # Decode
    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]
    
    print(f"{audio_file}: {transcription}")

Evaluation

To evaluate the model on your own Tibetan audio dataset:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import evaluate

# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Load your dataset
dataset = load_dataset("your_tibetan_dataset")

# Initialize WER metric
wer_metric = evaluate.load("wer")

# Process and evaluate
# ... (see full evaluation code in documentation)

Limitations and Considerations

Known Limitations

Domain Specificity: The model is trained on specific Tibetan dialects and domains present in the training data
Audio Quality: Performance degrades with:
- Background noise
- Poor recording quality
- Multiple speakers
- Non-standard dialects
Low-Resource Language: As Tibetan is a low-resource language, the model may have limited generalization compared to high-resource language models
Code-Switching: May struggle with Tibetan-English or Tibetan-Chinese code-switching

Best Practices

Audio Format: Use 16kHz mono audio for best results
Clean Audio: Minimize background noise and ensure clear speech
Standard Dialect: Model performs best on dialects similar to training data
Audio Length: Optimal performance on audio clips under 30 seconds

Ethical Considerations

Intended Use

This model is intended for:

✅ Transcription of Tibetan speech
✅ Educational purposes
✅ Research in speech recognition
✅ Preservation of Tibetan language

Out-of-Scope Use

❌ Surveillance or monitoring without consent
❌ Generating misleading transcriptions
❌ Any use that violates privacy or human rights

Bias and Fairness

The model's performance may vary across different Tibetan dialects
Training data may not represent all Tibetan-speaking communities equally
Users should evaluate the model on their specific use case before deployment

Citation

If you use this model in your research or application, please cite:

@misc{whisper-small-tibetan-2024,
  author = {Milan Akdj},
  title = {Whisper Small - Tibetan Speech Translation},
  year = {2024},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/milanakdj/whisper-small-full-tibetan}}
}

Also cite the original Whisper paper:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Acknowledgements

Base Model: OpenAI Whisper Small
Dataset: Merged Tibetan Titung Goose
Framework: HuggingFace Transformers
Training: Fine-tuned using HuggingFace Seq2SeqTrainer

Model Card Authors

Milan Akdj

Contact

For questions or issues with this model, please open an issue on the model repository.

Additional Information

Model Architecture

Whisper Small uses a Transformer encoder-decoder architecture:

Encoder: Processes audio features
Decoder: Generates text transcription
Parameters: ~244M total parameters

Training Environment

Platform: RunPod GPU Instance
Monitoring: TensorBoard logging enabled
Evaluation: WER metric on held-out test set

Version History

v1.0: Initial release with checkpoint-3000

Language: Tibetan (བོད་སྐད།)
License: Apache 2.0
Model Size: ~244M parameters

Downloads last month: 66

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for milanakdj/whisper-small-full-tibetan

Base model

openai/whisper-small

Finetuned

(3443)

this model

Dataset used to train milanakdj/whisper-small-full-tibetan

Paper for milanakdj/whisper-small-full-tibetan

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 53

Evaluation results

Word Error Rate on Merged Tibetan Titung Goose
self-reported

XX.XX