wav2vec2-xls-r-1b-cantonese

Fine-tuned facebook/wav2vec2-xls-r-1b for Cantonese (yue) speech recognition on Common Voice.

Evaluation Results

Metric Value
CER (no punctuation) 20.57%
CER (raw) 20.85%
Eval Loss 0.0328
Best Step 76000
Best Epoch 13.07

Training History

Step Epoch Eval Loss CER (nopunct) CER (raw)
1000 0.01 6.2552 100.00% 100.00%
2000 0.02 5.7134 100.00% 100.00%
3000 0.04 3.6000 77.21% 77.30%
4000 0.05 2.1981 60.83% 61.40%
5000 0.06 1.5810 51.66% 51.91%
6000 1.01 1.2162 46.42% 46.65%
7000 1.02 0.9619 42.77% 42.95%
8000 1.03 0.8133 40.52% 40.69%
9000 1.04 0.7011 38.55% 38.66%
10000 1.06 0.6233 39.21% 39.38%
11000 2.00 0.5601 36.76% 37.02%
12000 2.01 0.5020 34.19% 36.47%
13000 2.03 0.4461 33.06% 34.10%
14000 2.04 0.4118 32.24% 32.40%
15000 2.05 0.3762 32.04% 32.08%
16000 2.06 0.3530 31.14% 31.15%
17000 3.01 0.3313 29.82% 29.86%
18000 3.02 0.2990 28.93% 28.94%
19000 3.03 0.2784 28.18% 28.23%
20000 3.05 0.2498 27.20% 28.12%
21000 3.06 0.2302 26.85% 27.22%
22000 4.00 0.2149 26.30% 26.57%
23000 4.02 0.1964 25.74% 26.10%
24000 4.03 0.1865 25.42% 26.37%
25000 4.04 0.1725 24.88% 25.10%
26000 4.05 0.1585 24.54% 24.57%
27000 4.06 0.1444 24.05% 24.16%
28000 5.01 0.1598 24.70% 25.07%
29000 5.02 0.1485 24.73% 25.41%
30000 5.03 0.1385 24.49% 25.39%
31000 5.05 0.1337 23.35% 23.96%
32000 5.06 0.1239 23.45% 23.60%
33000 6.00 0.1136 23.13% 23.22%
34000 6.02 0.1122 23.82% 25.76%
35000 6.03 0.1258 23.44% 23.93%
36000 6.04 0.1071 22.83% 23.13%
37000 6.05 0.1087 22.78% 23.22%
38000 6.07 0.0987 22.70% 22.83%
39000 7.01 0.0961 22.52% 24.59%
40000 7.02 0.0850 22.20% 22.33%
41000 7.04 0.0839 22.04% 22.22%
42000 7.05 0.0873 22.25% 22.74%
43000 7.06 0.0769 22.02% 23.37%
44000 8.01 0.0777 22.12% 27.00%
45000 8.02 0.0663 21.65% 24.92%
46000 8.03 0.0683 21.76% 21.81%
47000 8.04 0.0654 21.50% 21.55%
48000 8.06 0.0619 21.48% 21.52%
49000 9.00 0.0640 21.36% 22.33%
50000 9.01 0.0593 22.24% 24.59%
51000 9.03 0.0588 21.34% 21.36%
52000 9.04 0.0579 21.25% 22.04%
53000 9.05 0.0614 22.27% 24.85%
54000 9.06 0.0544 21.07% 21.08%
55000 10.01 0.0525 21.02% 22.75%
56000 10.02 0.0524 21.06% 21.13%
57000 10.03 0.0497 20.92% 20.97%
58000 10.04 0.0468 20.84% 20.84%
59000 10.06 0.0449 20.78% 20.80%
60000 11.00 0.0488 20.94% 20.93%
61000 11.01 0.0501 20.87% 21.45%
62000 11.03 0.0504 21.02% 21.54%
63000 11.04 0.0452 20.87% 21.00%
64000 11.05 0.0440 20.83% 20.96%
65000 11.06 0.0407 20.70% 20.79%
66000 12.01 0.0443 20.88% 21.01%
67000 12.02 0.0417 20.85% 21.02%
68000 12.03 0.0434 21.03% 21.10%
69000 12.05 0.0420 20.88% 21.01%
70000 12.06 0.0425 21.88% 21.99%
71000 13.00 0.0390 21.99% 22.28%
72000 13.02 0.0379 20.65% 20.83%
73000 13.03 0.0353 21.02% 21.24%
74000 13.04 0.0397 21.25% 21.55%
75000 13.05 0.0332 20.61% 20.85%
76000 13.07 0.0328 20.57% 20.85%
77000 14.01 0.0316 20.69% 20.92%
78000 14.02 0.0331 20.68% 20.95%
79000 14.04 0.0329 20.66% 20.97%
80000 14.05 0.0321 20.57% 20.80%
81000 14.06 0.0322 20.58% 20.82%

Training Details

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/wav2vec2-xls-r-1b-cantonese
tensorboard --logdir wav2vec2-xls-r-1b-cantonese/runs

Usage

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio
import torch

processor = Wav2Vec2Processor.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")
model = Wav2Vec2ForCTC.from_pretrained("awong-dev/wav2vec2-xls-r-1b-cantonese")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

inputs = processor(audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
Downloads last month
76
Safetensors
Model size
1.0B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for awong-dev/wav2vec2-xls-r-1b-cantonese

Finetuned
(113)
this model

Dataset used to train awong-dev/wav2vec2-xls-r-1b-cantonese

Evaluation results