Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ro
|
| 4 |
+
license: cc-by-nc-4.0
|
| 5 |
+
library_name: pytorch
|
| 6 |
+
pipeline_tag: video-classification
|
| 7 |
+
tags:
|
| 8 |
+
- visual-speech-recognition
|
| 9 |
+
- lip-reading
|
| 10 |
+
- word-classification
|
| 11 |
+
- romanian
|
| 12 |
+
- lrro
|
| 13 |
+
metrics:
|
| 14 |
+
- accuracy
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# Word Classification MLPs on LRRo
|
| 18 |
+
|
| 19 |
+
This repository hosts the MLP classifier checkpoints used in the isolated word recognition ablation from the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
|
| 20 |
+
|
| 21 |
+
To assess the representational quality of our trained VSR encoder independently of the autoregressive decoder, we replaced the decoder with lightweight Multi-Layer Perceptron (MLP) classification heads and fine-tuned them for **isolated word classification** on the **LRRo** dataset. Strong word-level accuracy with a non-recurrent head indicates that the spatio-temporal features produced by the VSR frontend are highly discriminative on their own.
|
| 22 |
+
|
| 23 |
+
For training code, preprocessing pipelines, and evaluation scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
|
| 24 |
+
|
| 25 |
+
## Configurations
|
| 26 |
+
|
| 27 |
+
We trained four MLP variants that differ only in the visual preprocessing applied before the encoder:
|
| 28 |
+
|
| 29 |
+
| Variant | Crop size | Region of interest |
|
| 30 |
+
|:---|:---:|:---|
|
| 31 |
+
| MLP v1 | 96 × 96 | Full-face resize |
|
| 32 |
+
| MLP v2 | 64 × 64 | Center-Middle |
|
| 33 |
+
| MLP v3 | 64 × 64 | Center-Bottom |
|
| 34 |
+
|
| 35 |
+
## Results
|
| 36 |
+
|
| 37 |
+
Top-1 and Top-5 word classification accuracy (%) on the LRRo `Lab` (controlled studio recordings) and `Wild` (in-the-wild) test sets. Higher is better.
|
| 38 |
+
|
| 39 |
+
| Variant | Lab Acc@1 | Lab Acc@5 | Wild Acc@1 | Wild Acc@5 |
|
| 40 |
+
|:---|:---:|:---:|:---:|:---:|
|
| 41 |
+
| MLP v1 | 90.6 | 98.5 | 64.5 | 87.6 |
|
| 42 |
+
| MLP v2 | 91.4 | 99.0 | 68.6 | 89.3 |
|
| 43 |
+
| MLP v3 | **95.0** | **99.4** | **72.7** | **92.6** |
|
| 44 |
+
|
| 45 |
+
Restricting the visual input to the lower half of the face (Center-Bottom crops) consistently outperforms full-face resizing, with the 64 × 64 crop (MLP v3) yielding the largest improvement on both Lab and Wild data.
|
| 46 |
+
|
| 47 |
+
## Citation
|
| 48 |
+
|
| 49 |
+
If you use these models, please cite:
|
| 50 |
+
|
| 51 |
+
```bibtex
|
| 52 |
+
@inproceedings{vsro200,
|
| 53 |
+
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
|
| 54 |
+
author = {...},
|
| 55 |
+
year = {2026}
|
| 56 |
+
}
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
```bibtex
|
| 60 |
+
@inproceedings{jitaru2020lrro,
|
| 61 |
+
author = {Jitaru, A. C. and Abdulamit, \c{S}. and Ionescu, B.},
|
| 62 |
+
title = {{LRRo}: A Lip Reading Data Set for the Under-resourced Romanian Language},
|
| 63 |
+
booktitle = {Proceedings of the ACM Multimedia Systems Conference (MMSys)},
|
| 64 |
+
year = {2020},
|
| 65 |
+
month = {June},
|
| 66 |
+
address = {Istanbul, Turkey}
|
| 67 |
+
}
|
| 68 |
+
```
|