Whisper Small - Tibetan Speech Translation

This model is a fine-tuned version of openai/whisper-small for Tibetan speech-to-text translation.

Model Description

  • Model: Whisper Small (244M parameters)
  • Language: Tibetan (བོད་སྐད།)
  • Task: Speech Translation (Tibetan audio → Text transcription)
  • Base Model: openai/whisper-small
  • Dataset: lilgoose777/merged-tibetan-titung-goose
  • Checkpoint: checkpoint-3000

Intended Uses

This model is designed to transcribe Tibetan speech into written text. It can be used for:

  • 📝 Transcribing Tibetan audio recordings
  • 🎙️ Building Tibetan speech recognition applications
  • 📚 Creating subtitles for Tibetan audio/video content
  • 🔬 Research in low-resource language ASR
  • 📖 Preserving and digitizing Tibetan oral traditions

Training Details

Dataset

The model was trained on the Merged Tibetan Titung Goose dataset, which combines multiple Tibetan audio sources for improved coverage.

  • Dataset: lilgoose777/merged-tibetan-titung-goose
  • Train/Test Split: 90/10
  • Sampling Rate: 16000 Hz

Training Hyperparameters

The model was fine-tuned with the following configuration:

Hyperparameter Value
Base Model openai/whisper-small
Training Steps 3,000
Batch Size (Train) 16
Batch Size (Eval) 8
Learning Rate 1.25e-05
Warmup Steps 300
Gradient Accumulation 1
Gradient Checkpointing True
Mixed Precision (FP16) False
Max Generation Length 225
Evaluation Strategy Every 100 steps
Save Strategy Every 100 steps

Training Infrastructure

  • Framework: HuggingFace Transformers 4.56.2
  • Training Framework: Seq2SeqTrainer
  • Optimizer: AdamW
  • Metric: Word Error Rate (WER)

Results

Metric Value
Best WER XX.XX%
Final WER XX.XX%

Lower WER is better. WER (Word Error Rate) measures the percentage of words that are incorrectly transcribed.

Usage

Quick Start with Pipeline

from transformers import pipeline

# Create transcription pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model="milanakdj/whisper-small-full-tibetan",
    generate_kwargs={"language": "tibetan", "task": "translate"}
)

# Transcribe audio file
result = pipe("path/to/tibetan_audio.wav")
print(result["text"])

Using Processor and Model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Load your audio (16kHz sampling rate)
# audio_array = ... your audio array ...

# Process audio
input_features = processor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# Generate transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="tibetan",
    task="translate"
)

with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        forced_decoder_ids=forced_decoder_ids
    )

# Decode
transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print(transcription)

Using with Librosa

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Load audio file (automatically resamples to 16kHz)
audio, sr = librosa.load("tibetan_audio.mp3", sr=16000)

# Process and transcribe
input_features = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt"
).input_features

forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="tibetan",
    task="translate"
)

predicted_ids = model.generate(
    input_features,
    forced_decoder_ids=forced_decoder_ids
)

transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print(transcription)

Batch Processing Multiple Files

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Process multiple files
audio_files = ["file1.wav", "file2.wav", "file3.wav"]

for audio_file in audio_files:
    # Load audio
    audio, _ = librosa.load(audio_file, sr=16000)
    
    # Process
    input_features = processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    ).input_features.to(device)
    
    # Generate
    forced_decoder_ids = processor.get_decoder_prompt_ids(
        language="tibetan",
        task="translate"
    )
    
    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            forced_decoder_ids=forced_decoder_ids
        )
    
    # Decode
    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]
    
    print(f"{audio_file}: {transcription}")

Evaluation

To evaluate the model on your own Tibetan audio dataset:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import evaluate

# Load model
processor = WhisperProcessor.from_pretrained("milanakdj/whisper-small-full-tibetan")
model = WhisperForConditionalGeneration.from_pretrained("milanakdj/whisper-small-full-tibetan")

# Load your dataset
dataset = load_dataset("your_tibetan_dataset")

# Initialize WER metric
wer_metric = evaluate.load("wer")

# Process and evaluate
# ... (see full evaluation code in documentation)

Limitations and Considerations

Known Limitations

  1. Domain Specificity: The model is trained on specific Tibetan dialects and domains present in the training data
  2. Audio Quality: Performance degrades with:
    • Background noise
    • Poor recording quality
    • Multiple speakers
    • Non-standard dialects
  3. Low-Resource Language: As Tibetan is a low-resource language, the model may have limited generalization compared to high-resource language models
  4. Code-Switching: May struggle with Tibetan-English or Tibetan-Chinese code-switching

Best Practices

  • Audio Format: Use 16kHz mono audio for best results
  • Clean Audio: Minimize background noise and ensure clear speech
  • Standard Dialect: Model performs best on dialects similar to training data
  • Audio Length: Optimal performance on audio clips under 30 seconds

Ethical Considerations

Intended Use

This model is intended for:

  • ✅ Transcription of Tibetan speech
  • ✅ Educational purposes
  • ✅ Research in speech recognition
  • ✅ Preservation of Tibetan language

Out-of-Scope Use

  • ❌ Surveillance or monitoring without consent
  • ❌ Generating misleading transcriptions
  • ❌ Any use that violates privacy or human rights

Bias and Fairness

  • The model's performance may vary across different Tibetan dialects
  • Training data may not represent all Tibetan-speaking communities equally
  • Users should evaluate the model on their specific use case before deployment

Citation

If you use this model in your research or application, please cite:

@misc{whisper-small-tibetan-2024,
  author = {Milan Akdj},
  title = {Whisper Small - Tibetan Speech Translation},
  year = {2024},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/milanakdj/whisper-small-full-tibetan}}
}

Also cite the original Whisper paper:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Acknowledgements

Model Card Authors

Milan Akdj

Contact

For questions or issues with this model, please open an issue on the model repository.


Additional Information

Model Architecture

Whisper Small uses a Transformer encoder-decoder architecture:

  • Encoder: Processes audio features
  • Decoder: Generates text transcription
  • Parameters: ~244M total parameters

Training Environment

  • Platform: RunPod GPU Instance
  • Monitoring: TensorBoard logging enabled
  • Evaluation: WER metric on held-out test set

Version History

  • v1.0: Initial release with checkpoint-3000

Language: Tibetan (བོད་སྐད།)
License: Apache 2.0
Model Size: ~244M parameters

Downloads last month
66
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for milanakdj/whisper-small-full-tibetan

Finetuned
(3443)
this model

Dataset used to train milanakdj/whisper-small-full-tibetan

Paper for milanakdj/whisper-small-full-tibetan

Evaluation results

  • Word Error Rate on Merged Tibetan Titung Goose
    self-reported
    XX.XX