File size: 2,492 Bytes
cc890b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3f3cdf
cc890b2
c3f3cdf
 
 
 
 
cc890b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5613182
 
cc890b2
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
language:
- ro
library_name: pytorch
pipeline_tag: video-classification
tags:
- visual-speech-recognition
- lip-reading
- word-classification
- romanian
- lrro
metrics:
- accuracy
---

# Word Classification MLPs on LRRo

This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.

To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for **isolated word classification** on the **LRRo** dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own.

For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

## Results

We trained four MLP variants that differ only in the visual preprocessing applied before the encoder. Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better.

| Variant | Crop size | Region of interest | Lab Acc@1 | Lab Acc@5 | Wild Acc@1 | Wild Acc@5 |
|:---|:---:|:---|:---:|:---:|:---:|:---:|
| MLP v1 | 96 × 96 | Full-face resize | 90.6 | 98.5 | 64.5 | 87.6 |
| MLP v2 | 64 × 64 | Center-Middle | 91.4 | 99.0 | 68.6 | 89.3 |
| MLP v3 | 64 × 64 | Center-Bottom | **95.0** | **99.4** | **72.7** | **92.6** |

Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data.

## Citation

If you use these models, please cite:

```bibtex
@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {2026}
}
```

```bibtex
@inproceedings{jitaru2020lrro,
  author    = {Jitaru, A. C. and Abdulamit, Ș. and Ionescu, B.},
  title     = {LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language},
  booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)},
  year      = {2020},
  month     = {June},
  address   = {Istanbul, Turkey}
}
```