Whisper Large V3 Turbo - Vietnamese Telephony

Fine-tuned openai/whisper-large-v3-turbo on Vietnamese telephony call center data.

Training Details

  • Base model: openai/whisper-large-v3-turbo (809M params)
  • Method: LoRA (r=32, alpha=64, 2.95% trainable params)
  • Dataset: 954 segments, 5.11 hours Vietnamese telephony audio
  • Audio: 8kHz telephony → 16kHz resampled, enhanced preprocessing
  • Epochs: 10
  • Learning rate: 1e-4 with cosine schedule

Results

Metric Value
Test WER 27.92%
Test CER 19.46%

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

processor = WhisperProcessor.from_pretrained("jason03/whisper-large-v3-turbo-vietnamese-telephony")
model = WhisperForConditionalGeneration.from_pretrained("jason03/whisper-large-v3-turbo-vietnamese-telephony")

pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer,
                feature_extractor=processor.feature_extractor, device="cuda")

result = pipe("audio.wav", generate_kwargs={"language": "vi", "task": "transcribe"})
print(result["text"])

Or with faster-whisper (CTranslate2)

ct2-whisper-converter --model jason03/whisper-large-v3-turbo-vietnamese-telephony --output_dir ./model-ct2 --quantization float16
from faster_whisper import WhisperModel
model = WhisperModel("./model-ct2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", language="vi")
for seg in segments:
    print(f"[{seg.start:.1f}s-{seg.end:.1f}s] {seg.text}")
Downloads last month
46
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jasong03/whisper-large-v3-turbo-vietnamese-telephony

Finetuned
(512)
this model