mlp-lrro-vsro200 / README.md
vsro200's picture
Update README.md
c3f3cdf verified
---
language:
- ro
library_name: pytorch
pipeline_tag: video-classification
tags:
- visual-speech-recognition
- lip-reading
- word-classification
- romanian
- lrro
metrics:
- accuracy
---
# Word Classification MLPs on LRRo
This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for **isolated word classification** on the **LRRo** dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own.
For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
## Results
We trained four MLP variants that differ only in the visual preprocessing applied before the encoder. Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better.
| Variant | Crop size | Region of interest | Lab Acc@1 | Lab Acc@5 | Wild Acc@1 | Wild Acc@5 |
|:---|:---:|:---|:---:|:---:|:---:|:---:|
| MLP v1 | 96 × 96 | Full-face resize | 90.6 | 98.5 | 64.5 | 87.6 |
| MLP v2 | 64 × 64 | Center-Middle | 91.4 | 99.0 | 68.6 | 89.3 |
| MLP v3 | 64 × 64 | Center-Bottom | **95.0** | **99.4** | **72.7** | **92.6** |
Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data.
## Citation
If you use these models, please cite:
```bibtex
@inproceedings{vsro200,
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
author = {...},
year = {2026}
}
```
```bibtex
@inproceedings{jitaru2020lrro,
author = {Jitaru, A. C. and Abdulamit, Ș. and Ionescu, B.},
title = {LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language},
booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)},
year = {2020},
month = {June},
address = {Istanbul, Turkey}
}
```