vsro200
/

mlp-lrro-vsro200

Video Classification

visual-speech-recognition

word-classification

Model card Files Files and versions

vsro200 commited on 7 days ago

Commit

cc890b2

·

verified ·

1 Parent(s): 2062438

Upload README.md

Files changed (1) hide show

README.md +68 -0

README.md ADDED Viewed

	@@ -0,0 +1,68 @@

+---
+language:
+- ro
+license: cc-by-nc-4.0
+library_name: pytorch
+pipeline_tag: video-classification
+tags:
+- visual-speech-recognition
+- lip-reading
+- word-classification
+- romanian
+- lrro
+metrics:
+- accuracy
+---
+# Word Classification MLPs on LRRo
+This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
+To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for **isolated word classification** on the **LRRo** dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own.
+For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
+## Configurations
+We trained four MLP variants that differ only in the visual preprocessing applied before the encoder:
+| Variant | Crop size | Region of interest |
+|:---|:---:|:---|
+| MLP v1 | 96 × 96 | Full-face resize |
+| MLP v2 | 64 × 64 | Center-Middle |
+| MLP v3 | 64 × 64 | Center-Bottom |
+## Results
+Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better.
+| Variant | Lab Acc@1 | Lab Acc@5 | Wild Acc@1 | Wild Acc@5 |
+|:---|:---:|:---:|:---:|:---:|
+| MLP v1 | 90.6 | 98.5 | 64.5 | 87.6 |
+| MLP v2 | 91.4 | 99.0 | 68.6 | 89.3 |
+| MLP v3 | **95.0** | **99.4** | **72.7** | **92.6** |
+Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data.
+## Citation
+If you use these models, please cite:
+```bibtex
+@inproceedings{vsro200,
+  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
+  author = {...},
+  year   = {2026}
+}
+```
+```bibtex
+@inproceedings{jitaru2020lrro,
+  author    = {Jitaru, A. C. and Abdulamit, \c{S}. and Ionescu, B.},
+  title     = {{LRRo}: A Lip Reading Data Set for the Under-resourced Romanian Language},
+  booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)},
+  year      = {2020},
+  month     = {June},
+  address   = {Istanbul, Turkey}
+}
+```