Whisper Large-v3 Turbo Khmer ASR

Fine-tuned variant of openai/whisper-large-v3-turbo for Khmer automatic speech recognition. The model was trained with the utilities in whisper and is intended for transcription workloads that prioritize Khmer text normalization, including numerals, currency, and date expressions.

Model Card

Attribute	Value
Base model	`openai/whisper-large-v3-turbo`
Language	Khmer (`km-KH`)
Task	Automatic Speech Recognition (speech-to-text)
Sample rate	16 kHz audio, automatically resampled
Input length	Up to 30 s clips (truncated during batching)
Finetuning data	`asr_mixed_dataset.txt` (internal manifests, normalized through `dataset_builder.segment_text`)
Epochs	20
Batch size	4 (gradient accumulation 2)
Optimizer	AdamW (managed by `Seq2SeqTrainer`)
Learning rate	1e-6 with cosine scheduler & 1k warmup steps
Normalization	Khmer-specific regex and rule-based normalization (`khmerspeech`, `khmercut`)
Dataset	Training with Mixed Khmer & English with 197K samples (221.36 hours)

Limitations: performance has been validated only on internal validation/test splits. Long-form audio, accents outside the training distribution, or noisy backgrounds may degrade accuracy.

Inference Examples

import torch
import torchaudio
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


AUDIO_PATH = "audio_path.wav" 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "metythorn/whisper-large-v3-turbo-mixed-20eps-clean-text-197k"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

speech_waveform, sr = torchaudio.load(AUDIO_PATH)

# Whisper expects 16kHz mono
if sr != 16000:
    speech_waveform = torchaudio.functional.resample(
        speech_waveform, 
        orig_freq=sr, 
        new_freq=16000
    )
speech_waveform = speech_waveform.squeeze().numpy()
result = pipe(speech_waveform)

print("Transcription:", result["text"])

Downloads last month: 113

Safetensors

Model size

0.8B params

Tensor type

F32