| --- |
| language: |
| - ro |
| library_name: pytorch |
| pipeline_tag: video-classification |
| tags: |
| - visual-speech-recognition |
| - lip-reading |
| - word-classification |
| - romanian |
| - lrro |
| metrics: |
| - accuracy |
| --- |
| |
| # Word Classification MLPs on LRRo |
|
|
| This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*. |
|
|
| To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for **isolated word classification** on the **LRRo** dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own. |
|
|
| For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200). |
|
|
| ## Results |
|
|
| We trained four MLP variants that differ only in the visual preprocessing applied before the encoder. Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better. |
|
|
| | Variant | Crop size | Region of interest | Lab Acc@1 | Lab Acc@5 | Wild Acc@1 | Wild Acc@5 | |
| |:---|:---:|:---|:---:|:---:|:---:|:---:| |
| | MLP v1 | 96 × 96 | Full-face resize | 90.6 | 98.5 | 64.5 | 87.6 | |
| | MLP v2 | 64 × 64 | Center-Middle | 91.4 | 99.0 | 68.6 | 89.3 | |
| | MLP v3 | 64 × 64 | Center-Bottom | **95.0** | **99.4** | **72.7** | **92.6** | |
|
|
| Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data. |
|
|
| ## Citation |
|
|
| If you use these models, please cite: |
|
|
| ```bibtex |
| @inproceedings{vsro200, |
| title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness}, |
| author = {...}, |
| year = {2026} |
| } |
| ``` |
|
|
| ```bibtex |
| @inproceedings{jitaru2020lrro, |
| author = {Jitaru, A. C. and Abdulamit, Ș. and Ionescu, B.}, |
| title = {LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language}, |
| booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)}, |
| year = {2020}, |
| month = {June}, |
| address = {Istanbul, Turkey} |
| } |
| ``` |