Whisper Large-v3 Turbo Khmer ASR

Fine-tuned variant of openai/whisper-large-v3-turbo for Khmer automatic speech recognition. The model was trained with the utilities in whisper and is intended for transcription workloads that prioritize Khmer text normalization, including numerals, currency, and date expressions.

Model Card

Attribute Value
Base model openai/whisper-large-v3-turbo
Language Khmer (km-KH)
Task Automatic Speech Recognition (speech-to-text)
Sample rate 16 kHz audio, automatically resampled
Input length Up to 30 s clips (truncated during batching)
Finetuning data asr_mixed_dataset.txt (internal manifests, normalized through dataset_builder.segment_text)
Epochs 20
Batch size 4 (gradient accumulation 2)
Optimizer AdamW (managed by Seq2SeqTrainer)
Learning rate 1e-6 with cosine scheduler & 1k warmup steps
Normalization Khmer-specific regex and rule-based normalization (khmerspeech, khmercut)
Dataset Training with Mixed Khmer & English with 197K samples (221.36 hours)

Limitations: performance has been validated only on internal validation/test splits. Long-form audio, accents outside the training distribution, or noisy backgrounds may degrade accuracy.

Inference Examples

import torch
import torchaudio
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


AUDIO_PATH = "audio_path.wav" 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "metythorn/whisper-large-v3-turbo-mixed-20eps-clean-text-197k"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

speech_waveform, sr = torchaudio.load(AUDIO_PATH)

# Whisper expects 16kHz mono
if sr != 16000:
    speech_waveform = torchaudio.functional.resample(
        speech_waveform, 
        orig_freq=sr, 
        new_freq=16000
    )
speech_waveform = speech_waveform.squeeze().numpy()
result = pipe(speech_waveform)

print("Transcription:", result["text"])
Downloads last month
113
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support