vsro200 commited on
Commit
3b1b958
·
verified ·
1 Parent(s): 5c78322

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -46
README.md CHANGED
@@ -1,69 +1,88 @@
1
  ---
2
  language:
3
  - ro
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
- # Romanian Visual Speech Recognition (VSR) Models
6
 
7
- This repository contains the model checkpoints for the paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
8
 
9
- These models are fine-tuned versions of MultiVSR, specifically trained for the Romanian language. We provide various checkpoints to demonstrate the impact of dataset size, annotation quality (human-annotated vs. automatically generated pseudo-labels), and gender distribution on VSR performance.
10
 
11
- The models were trained and evaluated on the **VSRo Dataset**, a 200-hour corpus of Romanian podcasts.
12
 
13
- ## 📂 Repository Structure
14
- All model checkpoints are stored in the `checkpoints/` directory. The naming convention follows the pattern: `model_[hours]_[type].pt`.
15
- * `_annot`: Models trained on human-annotated data.
16
- * `_auto`: Models trained on automatically transcribed data (pseudo-labels).
17
- * `_shuffle`: Alternative data splits for the 100h models to test variance.
18
- * `_males` / `_females` / `_mix`: Models trained specifically on gender-segregated or mixed 40-hour annotated subsets to evaluate gender bias.
19
 
20
- ---
 
 
 
 
 
21
 
22
- ## 🏆 Performance (Word Error Rate - WER)
23
 
24
- Below are the primary results evaluated on the **Test Unseen** set. Lower WER indicates better performance.
25
 
26
- ### 1. Annotated vs. Auto Data Scaling
27
- Comparison of models trained on perfectly annotated data versus those trained on automatically generated labels across different dataset sizes.
28
 
29
- | Training Hours | Human Annotated (`_annot`) (%) | Auto Generated (`_auto`) (%) |
30
  |:---:|:---:|:---:|
31
- | 10h | 72.50 | 74.61 |
32
- | 25h | 64.86 | 66.27 |
33
- | 50h | 58.87 | 59.28 |
34
- | 75h | 54.86 | 56.25 |
35
- | 100h | 53.29 | 53.63 |
36
- | 125h | — | 51.71 |
37
- | 150h | — | 51.25 |
38
- | 175h | — | 49.84 |
39
- | 200h | — | 48.75 |
40
-
41
- ### 2. Gender Bias Analysis (40h Models)
42
- Evaluation demonstrating the impact of gender representation in the training set.
43
-
44
- | Training Subset | Global WER (%) | WER Males (%) | WER Females (%) |
 
 
 
 
 
 
45
  |:---|:---:|:---:|:---:|
46
- | 40h Males | 62.15 | 61.32 | 62.97 |
47
  | 40h Females | 59.33 | 59.17 | 59.49 |
48
- | 40h Mix | 59.52 | 59.19 | 59.85 |
49
 
50
- ### 3. Out of Distribution (OOD) Robustness
51
- Evaluated using the `model_200h_auto.pt` checkpoint on different video degradation and domain shift scenarios.
52
 
53
- | OOD Category | WER (%) |
54
  |:---|:---:|
55
- | Vlogs | 58.61 |
56
  | Specific domains | 63.01 |
57
- | Noisy | 68.96 |
58
- | Archival | 87.97 |
59
- | **Global OOD** | **68.46** |
60
 
61
- ### 4. Stability and Variance Analysis
62
- Due to high computational resource requirements, comprehensive multiple-run variance testing was isolated to the 100-hour models. The models were trained across 3 different random data shuffles to observe stability and the true impact of human annotations versus auto-generated labels.
63
 
64
- | Data Type (100h) | Mean WER (%) | Standard Deviation (σ) (%) |
65
- |:---|:---:|:---:|
66
- | Human Annotated | 53.21 | ± 0.37 |
67
- | Auto Generated | 53.82 | ± 0.17 |
68
 
69
- ---
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - ro
4
+ license: cc-by-nc-4.0
5
+ library_name: pytorch
6
+ pipeline_tag: video-text-to-text
7
+ tags:
8
+ - visual-speech-recognition
9
+ - lip-reading
10
+ - vsr
11
+ - romanian
12
+ - speech-recognition
13
+ - audio-visual
14
+ datasets:
15
+ - vsro200/vsro200
16
+ metrics:
17
+ - wer
18
  ---
 
19
 
20
+ # VSRo-200: Romanian Visual Speech Recognition Models
21
 
22
+ This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
23
 
24
+ The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://https://github.com/vsro200/vsro200).
25
 
26
+ ## Checkpoints
 
 
 
 
 
27
 
28
+ All checkpoints follow the naming pattern `model_[hours]_[type].pt`:
29
+
30
+ - `_annot` — trained on human-annotated transcriptions
31
+ - `_auto` — trained on automatically generated pseudo-labels
32
+ - `_shuffle` — alternative data splits used for variance analysis (100h models)
33
+ - `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis
34
 
35
+ ## Results
36
 
37
+ All results are reported in Word Error Rate (WER, %) on the **Test Unseen** split. Lower is better.
38
 
39
+ ### Annotated vs. auto-labeled data scaling
 
40
 
41
+ | Hours | Annotated | Auto |
42
  |:---:|:---:|:---:|
43
+ | 10 | 72.50 | 74.61 |
44
+ | 25 | 64.86 | 66.27 |
45
+ | 50 | 58.87 | 59.28 |
46
+ | 75 | 54.86 | 56.25 |
47
+ | 100 | 53.29 | 53.63 |
48
+ | 125 || 51.71 |
49
+ | 150 || 51.25 |
50
+ | 175 || 49.84 |
51
+ | 200 || 48.75 |
52
+
53
+ ### Variance analysis (100h models, 3 random shuffles)
54
+
55
+ | Data type | Mean WER | Std. dev. |
56
+ |:---|:---:|:---:|
57
+ | Human annotated | 53.21 | ± 0.37 |
58
+ | Auto generated | 53.82 | ± 0.17 |
59
+
60
+ ### Gender bias analysis (40h models)
61
+
62
+ | Training subset | Global | Males | Females |
63
  |:---|:---:|:---:|:---:|
64
+ | 40h Males | 62.15 | 61.32 | 62.97 |
65
  | 40h Females | 59.33 | 59.17 | 59.49 |
66
+ | 40h Mix | 59.52 | 59.19 | 59.85 |
67
 
68
+ ### Out-of-distribution robustness (`model_200h_auto.pt`)
 
69
 
70
+ | OOD category | WER |
71
  |:---|:---:|
72
+ | Vlogs | 58.61 |
73
  | Specific domains | 63.01 |
74
+ | Noisy | 68.96 |
75
+ | Archival | 87.97 |
76
+ | Global OOD | 68.46 |
77
 
78
+ ## Citation
 
79
 
80
+ If you use these models, please cite:
 
 
 
81
 
82
+ ```bibtex
83
+ @inproceedings{vsro200,
84
+ title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
85
+ author = {...},
86
+ year = {...}
87
+ }
88
+ ```