Whisper Large V3 Turbo - Vietnamese Telephony

Fine-tuned openai/whisper-large-v3-turbo on Vietnamese telephony call center data.

Training Details

Base model: openai/whisper-large-v3-turbo (809M params)
Method: LoRA (r=32, alpha=64, 2.95% trainable params)
Dataset: 954 segments, 5.11 hours Vietnamese telephony audio
Audio: 8kHz telephony → 16kHz resampled, enhanced preprocessing
Epochs: 10
Learning rate: 1e-4 with cosine schedule

Results

Metric	Value
Test WER	27.92%
Test CER	19.46%

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

processor = WhisperProcessor.from_pretrained("jason03/whisper-large-v3-turbo-vietnamese-telephony")
model = WhisperForConditionalGeneration.from_pretrained("jason03/whisper-large-v3-turbo-vietnamese-telephony")

pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer,
                feature_extractor=processor.feature_extractor, device="cuda")

result = pipe("audio.wav", generate_kwargs={"language": "vi", "task": "transcribe"})
print(result["text"])

Or with faster-whisper (CTranslate2)

ct2-whisper-converter --model jason03/whisper-large-v3-turbo-vietnamese-telephony --output_dir ./model-ct2 --quantization float16

from faster_whisper import WhisperModel
model = WhisperModel("./model-ct2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", language="vi")
for seg in segments:
    print(f"[{seg.start:.1f}s-{seg.end:.1f}s] {seg.text}")

Downloads last month: 46

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for jasong03/whisper-large-v3-turbo-vietnamese-telephony

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(512)

this model