vsro200
/

mlp-lrro-vsro200

Video Classification

visual-speech-recognition

word-classification

Model card Files Files and versions

mlp-lrro-vsro200 / README.md

vsro200's picture

Update README.md

c3f3cdf verified 3 days ago

|

history blame contribute delete

2.49 kB

	---
	language:
	- ro
	library_name: pytorch
	pipeline_tag: video-classification
	tags:
	- visual-speech-recognition
	- lip-reading
	- word-classification
	- romanian
	- lrro
	metrics:
	- accuracy
	---

	# Word Classification MLPs on LRRo

	This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

	To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for isolated word classification on the LRRo dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own.

	For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

	## Results

	We trained four MLP variants that differ only in the visual preprocessing applied before the encoder. Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better.

	\| Variant \| Crop size \| Region of interest \| Lab Acc@1 \| Lab Acc@5 \| Wild Acc@1 \| Wild Acc@5 \|
	\|:---\|:---:\|:---\|:---:\|:---:\|:---:\|:---:\|
	\| MLP v1 \| 96 × 96 \| Full-face resize \| 90.6 \| 98.5 \| 64.5 \| 87.6 \|
	\| MLP v2 \| 64 × 64 \| Center-Middle \| 91.4 \| 99.0 \| 68.6 \| 89.3 \|
	\| MLP v3 \| 64 × 64 \| Center-Bottom \| 95.0 \| 99.4 \| 72.7 \| 92.6 \|

	Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data.

	## Citation

	If you use these models, please cite:

	```bibtex
	@inproceedings{vsro200,
	title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
	author = {...},
	year = {2026}
	}
	```

	```bibtex
	@inproceedings{jitaru2020lrro,
	author = {Jitaru, A. C. and Abdulamit, Ș. and Ionescu, B.},
	title = {LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language},
	booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)},
	year = {2020},
	month = {June},
	address = {Istanbul, Turkey}
	}
	```