vsro200 commited on
Commit
cc890b2
·
verified ·
1 Parent(s): 2062438

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ro
4
+ license: cc-by-nc-4.0
5
+ library_name: pytorch
6
+ pipeline_tag: video-classification
7
+ tags:
8
+ - visual-speech-recognition
9
+ - lip-reading
10
+ - word-classification
11
+ - romanian
12
+ - lrro
13
+ metrics:
14
+ - accuracy
15
+ ---
16
+
17
+ # Word Classification MLPs on LRRo
18
+
19
+ This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
20
+
21
+ To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for **isolated word classification** on the **LRRo** dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own.
22
+
23
+ For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
24
+
25
+ ## Configurations
26
+
27
+ We trained four MLP variants that differ only in the visual preprocessing applied before the encoder:
28
+
29
+ | Variant | Crop size | Region of interest |
30
+ |:---|:---:|:---|
31
+ | MLP v1 | 96 × 96 | Full-face resize |
32
+ | MLP v2 | 64 × 64 | Center-Middle |
33
+ | MLP v3 | 64 × 64 | Center-Bottom |
34
+
35
+ ## Results
36
+
37
+ Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better.
38
+
39
+ | Variant | Lab Acc@1 | Lab Acc@5 | Wild Acc@1 | Wild Acc@5 |
40
+ |:---|:---:|:---:|:---:|:---:|
41
+ | MLP v1 | 90.6 | 98.5 | 64.5 | 87.6 |
42
+ | MLP v2 | 91.4 | 99.0 | 68.6 | 89.3 |
43
+ | MLP v3 | **95.0** | **99.4** | **72.7** | **92.6** |
44
+
45
+ Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data.
46
+
47
+ ## Citation
48
+
49
+ If you use these models, please cite:
50
+
51
+ ```bibtex
52
+ @inproceedings{vsro200,
53
+ title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
54
+ author = {...},
55
+ year = {2026}
56
+ }
57
+ ```
58
+
59
+ ```bibtex
60
+ @inproceedings{jitaru2020lrro,
61
+ author = {Jitaru, A. C. and Abdulamit, \c{S}. and Ionescu, B.},
62
+ title = {{LRRo}: A Lip Reading Data Set for the Under-resourced Romanian Language},
63
+ booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)},
64
+ year = {2020},
65
+ month = {June},
66
+ address = {Istanbul, Turkey}
67
+ }
68
+ ```