--- language: - ro library_name: pytorch pipeline_tag: video-classification tags: - visual-speech-recognition - lip-reading - word-classification - romanian - lrro metrics: - accuracy --- # Word Classification MLPs on LRRo This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*. To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for **isolated word classification** on the **LRRo** dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own. For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200). ## Results We trained four MLP variants that differ only in the visual preprocessing applied before the encoder. Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better. | Variant | Crop size | Region of interest | Lab Acc@1 | Lab Acc@5 | Wild Acc@1 | Wild Acc@5 | |:---|:---:|:---|:---:|:---:|:---:|:---:| | MLP v1 | 96 × 96 | Full-face resize | 90.6 | 98.5 | 64.5 | 87.6 | | MLP v2 | 64 × 64 | Center-Middle | 91.4 | 99.0 | 68.6 | 89.3 | | MLP v3 | 64 × 64 | Center-Bottom | **95.0** | **99.4** | **72.7** | **92.6** | Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data. ## Citation If you use these models, please cite: ```bibtex @inproceedings{vsro200, title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness}, author = {...}, year = {2026} } ``` ```bibtex @inproceedings{jitaru2020lrro, author = {Jitaru, A. C. and Abdulamit, Ș. and Ionescu, B.}, title = {LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language}, booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)}, year = {2020}, month = {June}, address = {Istanbul, Turkey} } ```