File size: 5,243 Bytes
674d977
 
 
3b1b958
 
 
 
 
 
 
 
 
 
 
 
 
674d977
 
3b1b958
674d977
3b1b958
674d977
7b13d18
674d977
3b1b958
674d977
3b1b958
 
 
 
 
 
674d977
3b1b958
674d977
7b13d18
674d977
7b13d18
674d977
7b13d18
 
 
 
 
 
 
3b1b958
7b13d18
3b1b958
7b13d18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1caa944
 
 
7b13d18
1caa944
 
7b13d18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b1b958
 
 
674d977
1caa944
7b13d18
 
 
 
 
 
 
 
 
 
 
 
 
 
674d977
 
1caa944
3b1b958
674d977
3b1b958
674d977
3b1b958
 
 
 
 
 
7b13d18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
language:
- ro
library_name: pytorch
pipeline_tag: video-text-to-text
tags:
- visual-speech-recognition
- lip-reading
- vsr
- romanian
- speech-recognition
- audio-visual
datasets:
- vsro200/vsro200
metrics:
- wer
---

# VSRo-200: Romanian Visual Speech Recognition Models

This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.

The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

## Checkpoints

All checkpoints follow the naming pattern `model_[hours]_[type].pt`:

- `_annot` — trained on human-annotated transcriptions
- `_auto` — trained on automatically generated pseudo-labels
- `_shuffle` — alternative data splits used for variance analysis (100h models)
- `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis

## Results

All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the **Test Unseen** and **Test Seen** splits. Lower is better.

#### Human Annotated Data

| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|:---:|:---:|:---:|:---:|:---:|
| 10h | 72.50 | 41.49 | 67.01 | 37.53 |
| 25h | 64.86 | 36.62 | 59.23 | 32.96 |
| 50h | 58.87 | 33.38 | 54.03 | 29.88 |
| 75h | 54.86 | 30.97 | 51.44 | 28.61 |
| 100h | **53.29** | **29.94** | **48.16** | **26.53** |

#### Whisper Pseudo Labels

| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|:---:|:---:|:---:|:---:|:---:|
| 10h | 74.61 | 42.09 | 68.41 | 38.22 |
| 25h | 66.27 | 37.05 | 60.40 | 33.36 |
| 50h | 59.28 | 33.15 | 55.39 | 30.65 |
| 75h | 56.25 | 31.18 | 51.56 | 28.33 |
| 100h | 53.63 | 30.12 | 49.61 | 27.22 |
| 125h | 51.71 | 29.04 | 48.68 | 26.58 |
| 150h | 51.25 | 28.40 | 47.05 | 25.64 |
| 175h | 49.84 | 27.66 | 46.44 | 25.30 |
| 200h | **48.75** | **27.05** | **44.54** | **24.51** |

A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.



### Out-of-distribution robustness

*   **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data, tested on our 200h-model.
*   **Vlogs:** Unconstrained videos shot in different camera angles, dynamic lighting, movement.
*   **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific). 
*   **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
*   **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
*   **Global OOD:** The aggregated metrics across all out-of-distribution subsets.

| Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
|:---|:---:|:---:|:---:|:---:|:---:|
| **Test Seen** | 386 | 44.54 | 24.51 | 1.67 | 6.93 |
| **Test Unseen** | 389 | 48.75 | 27.05 | 2.30 | 8.50 |
| **OOD: Vlogs** | 99 | 58.61 | 32.85 | 1.49 | 4.26 |
| **OOD: Specific domains** | 84 | 63.01 | 28.73 | 9.78 | 17.93 |
| **OOD: Noisy** | 100 | 68.96 | 33.68 | 6.19 | 12.88 |
| **OOD: Archival** | 92 | 87.97 | 50.44 | 5.24 | 10.96 |
| **Global OOD** | 375 | 68.46 | 35.99 | 5.08 | 14.75 |

#### Metrics Note
*   **Duration:** Each OOD category consists of 15 minutes of video content.
*   **OOV Token (%):** The percentage of *total words* in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
*   **OOV Type (%):** The percentage of *unique words* in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.

### Gender bias analysis (40h models)


To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets. 

#### Test Unseen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| Males Only | 62.15 | 35.23 | 61.32 | 34.51 | 62.97 | 35.95 |
| Females Only | **59.33** | **33.44** | **59.17** | **32.87** | **59.49** | **34.02** |
| Mixed Data | 59.52 | 33.74 | 59.19 | 33.26 | 59.85 | 34.22 |

#### Test Seen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| Males Only | 58.82 | 33.11 | **58.58** | **32.59** | 59.06 | 33.63 |
| Females Only | 59.10 | 33.30 | 67.26 | 38.67 | **51.20** | **27.99** |
| Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |



## Citation

If you use these models, please cite:

```bibtex
@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {...}
}
```