mlp-lrro-vsro200 / README.md
vsro200's picture
Update README.md
c3f3cdf verified
metadata
language:
  - ro
library_name: pytorch
pipeline_tag: video-classification
tags:
  - visual-speech-recognition
  - lip-reading
  - word-classification
  - romanian
  - lrro
metrics:
  - accuracy

Word Classification MLPs on LRRo

This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for isolated word classification on the LRRo dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own.

For training code, preprocessing pipelines, and evaluation scripts, please refer to the GitHub repository.

Results

We trained four MLP variants that differ only in the visual preprocessing applied before the encoder. Top-1 and Top-5 word classification accuracy (%) on the LRRo Lab (controlled studio recordings) and Wild (in-the-wild) test sets. Higher is better.

Variant Crop size Region of interest Lab Acc@1 Lab Acc@5 Wild Acc@1 Wild Acc@5
MLP v1 96 × 96 Full-face resize 90.6 98.5 64.5 87.6
MLP v2 64 × 64 Center-Middle 91.4 99.0 68.6 89.3
MLP v3 64 × 64 Center-Bottom 95.0 99.4 72.7 92.6

Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data.

Citation

If you use these models, please cite:

@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {2026}
}
@inproceedings{jitaru2020lrro,
  author    = {Jitaru, A. C. and Abdulamit, Ș. and Ionescu, B.},
  title     = {LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language},
  booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)},
  year      = {2020},
  month     = {June},
  address   = {Istanbul, Turkey}
}