wav2vec2-xls-r-1b-cantonese
Fine-tuned facebook/wav2vec2-xls-r-1b for Cantonese (yue) speech recognition on Common Voice.
Evaluation Results
| Metric | Value |
|---|---|
| CER (no punctuation) | 20.57% |
| CER (raw) | 20.85% |
| Eval Loss | 0.0328 |
| Best Step | 76000 |
| Best Epoch | 13.07 |
Training History
| Step | Epoch | Eval Loss | CER (nopunct) | CER (raw) |
|---|---|---|---|---|
| 1000 | 0.01 | 6.2552 | 100.00% | 100.00% |
| 2000 | 0.02 | 5.7134 | 100.00% | 100.00% |
| 3000 | 0.04 | 3.6000 | 77.21% | 77.30% |
| 4000 | 0.05 | 2.1981 | 60.83% | 61.40% |
| 5000 | 0.06 | 1.5810 | 51.66% | 51.91% |
| 6000 | 1.01 | 1.2162 | 46.42% | 46.65% |
| 7000 | 1.02 | 0.9619 | 42.77% | 42.95% |
| 8000 | 1.03 | 0.8133 | 40.52% | 40.69% |
| 9000 | 1.04 | 0.7011 | 38.55% | 38.66% |
| 10000 | 1.06 | 0.6233 | 39.21% | 39.38% |
| 11000 | 2.00 | 0.5601 | 36.76% | 37.02% |
| 12000 | 2.01 | 0.5020 | 34.19% | 36.47% |
| 13000 | 2.03 | 0.4461 | 33.06% | 34.10% |
| 14000 | 2.04 | 0.4118 | 32.24% | 32.40% |
| 15000 | 2.05 | 0.3762 | 32.04% | 32.08% |
| 16000 | 2.06 | 0.3530 | 31.14% | 31.15% |
| 17000 | 3.01 | 0.3313 | 29.82% | 29.86% |
| 18000 | 3.02 | 0.2990 | 28.93% | 28.94% |
| 19000 | 3.03 | 0.2784 | 28.18% | 28.23% |
| 20000 | 3.05 | 0.2498 | 27.20% | 28.12% |
| 21000 | 3.06 | 0.2302 | 26.85% | 27.22% |
| 22000 | 4.00 | 0.2149 | 26.30% | 26.57% |
| 23000 | 4.02 | 0.1964 | 25.74% | 26.10% |
| 24000 | 4.03 | 0.1865 | 25.42% | 26.37% |
| 25000 | 4.04 | 0.1725 | 24.88% | 25.10% |
| 26000 | 4.05 | 0.1585 | 24.54% | 24.57% |
| 27000 | 4.06 | 0.1444 | 24.05% | 24.16% |
| 28000 | 5.01 | 0.1598 | 24.70% | 25.07% |
| 29000 | 5.02 | 0.1485 | 24.73% | 25.41% |
| 30000 | 5.03 | 0.1385 | 24.49% | 25.39% |
| 31000 | 5.05 | 0.1337 | 23.35% | 23.96% |
| 32000 | 5.06 | 0.1239 | 23.45% | 23.60% |
| 33000 | 6.00 | 0.1136 | 23.13% | 23.22% |
| 34000 | 6.02 | 0.1122 | 23.82% | 25.76% |
| 35000 | 6.03 | 0.1258 | 23.44% | 23.93% |
| 36000 | 6.04 | 0.1071 | 22.83% | 23.13% |
| 37000 | 6.05 | 0.1087 | 22.78% | 23.22% |
| 38000 | 6.07 | 0.0987 | 22.70% | 22.83% |
| 39000 | 7.01 | 0.0961 | 22.52% | 24.59% |
| 40000 | 7.02 | 0.0850 | 22.20% | 22.33% |
| 41000 | 7.04 | 0.0839 | 22.04% | 22.22% |
| 42000 | 7.05 | 0.0873 | 22.25% | 22.74% |
| 43000 | 7.06 | 0.0769 | 22.02% | 23.37% |
| 44000 | 8.01 | 0.0777 | 22.12% | 27.00% |
| 45000 | 8.02 | 0.0663 | 21.65% | 24.92% |
| 46000 | 8.03 | 0.0683 | 21.76% | 21.81% |
| 47000 | 8.04 | 0.0654 | 21.50% | 21.55% |
| 48000 | 8.06 | 0.0619 | 21.48% | 21.52% |
| 49000 | 9.00 | 0.0640 | 21.36% | 22.33% |
| 50000 | 9.01 | 0.0593 | 22.24% | 24.59% |
| 51000 | 9.03 | 0.0588 | 21.34% | 21.36% |
| 52000 | 9.04 | 0.0579 | 21.25% | 22.04% |
| 53000 | 9.05 | 0.0614 | 22.27% | 24.85% |
| 54000 | 9.06 | 0.0544 | 21.07% | 21.08% |
| 55000 | 10.01 | 0.0525 | 21.02% | 22.75% |
| 56000 | 10.02 | 0.0524 | 21.06% | 21.13% |
| 57000 | 10.03 | 0.0497 | 20.92% | 20.97% |
| 58000 | 10.04 | 0.0468 | 20.84% | 20.84% |
| 59000 | 10.06 | 0.0449 | 20.78% | 20.80% |
| 60000 | 11.00 | 0.0488 | 20.94% | 20.93% |
| 61000 | 11.01 | 0.0501 | 20.87% | 21.45% |
| 62000 | 11.03 | 0.0504 | 21.02% | 21.54% |
| 63000 | 11.04 | 0.0452 | 20.87% | 21.00% |
| 64000 | 11.05 | 0.0440 | 20.83% | 20.96% |
| 65000 | 11.06 | 0.0407 | 20.70% | 20.79% |
| 66000 | 12.01 | 0.0443 | 20.88% | 21.01% |
| 67000 | 12.02 | 0.0417 | 20.85% | 21.02% |
| 68000 | 12.03 | 0.0434 | 21.03% | 21.10% |
| 69000 | 12.05 | 0.0420 | 20.88% | 21.01% |
| 70000 | 12.06 | 0.0425 | 21.88% | 21.99% |
| 71000 | 13.00 | 0.0390 | 21.99% | 22.28% |
| 72000 | 13.02 | 0.0379 | 20.65% | 20.83% |
| 73000 | 13.03 | 0.0353 | 21.02% | 21.24% |
| 74000 | 13.04 | 0.0397 | 21.25% | 21.55% |
| 75000 | 13.05 | 0.0332 | 20.61% | 20.85% |
| 76000 | 13.07 | 0.0328 | 20.57% | 20.85% |
| 77000 | 14.01 | 0.0316 | 20.69% | 20.92% |
| 78000 | 14.02 | 0.0331 | 20.68% | 20.95% |
| 79000 | 14.04 | 0.0329 | 20.66% | 20.97% |
| 80000 | 14.05 | 0.0321 | 20.57% | 20.80% |
| 81000 | 14.06 | 0.0322 | 20.58% | 20.82% |
Training Details
- Base model: facebook/wav2vec2-xls-r-1b
- Dataset: mozilla-foundation/common_voice_17_0 (yue)
- Language: Cantonese (yue)
- Task: Automatic Speech Recognition (ASR)
- Architecture: CTC (Connectionist Temporal Classification)
- Metric: Character Error Rate (CER)
- Total training steps: 81540
Training Metrics
TensorBoard logs are included in the runs/ directory of this repository.
# Clone and view locally
git clone https://huggingface.co/awong-dev/wav2vec2-xls-r-1b-cantonese
tensorboard --logdir wav2vec2-xls-r-1b-cantonese/runs
Usage
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio
import torch
processor = Wav2Vec2Processor.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")
model = Wav2Vec2ForCTC.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")
# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
inputs = processor(audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
- Downloads last month
- 76
Model tree for awong-dev/wav2vec2-xls-r-1b-cantonese
Base model
facebook/wav2vec2-xls-r-1bDataset used to train awong-dev/wav2vec2-xls-r-1b-cantonese
Evaluation results
- CER (no punctuation) on Common Voice (Cantonese)test set self-reported0.206
- CER (raw) on Common Voice (Cantonese)test set self-reported0.208