vsro200
/

models-vsro200

@@ -1,69 +1,88 @@
 ---
 language:
 - ro
 ---
-# Romanian Visual Speech Recognition (VSR) Models
-This repository contains the model checkpoints for the paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
-These models are fine-tuned versions of MultiVSR, specifically trained for the Romanian language. We provide various checkpoints to demonstrate the impact of dataset size, annotation quality (human-annotated vs. automatically generated pseudo-labels), and gender distribution on VSR performance.
-The models were trained and evaluated on the **VSRo Dataset**, a 200-hour corpus of Romanian podcasts.
-## 📂 Repository Structure
-All model checkpoints are stored in the `checkpoints/` directory. The naming convention follows the pattern: `model_[hours]_[type].pt`.
-* `_annot`: Models trained on human-annotated data.
-* `_auto`: Models trained on automatically transcribed data (pseudo-labels).
-* `_shuffle`: Alternative data splits for the 100h models to test variance.
-* `_males` / `_females` / `_mix`: Models trained specifically on gender-segregated or mixed 40-hour annotated subsets to evaluate gender bias.
----
-## 🏆 Performance (Word Error Rate - WER)
-Below are the primary results evaluated on the **Test Unseen** set. Lower WER indicates better performance.
-### 1. Annotated vs. Auto Data Scaling
-Comparison of models trained on perfectly annotated data versus those trained on automatically generated labels across different dataset sizes.
-| Training Hours | Human Annotated (`_annot`) (%) | Auto Generated (`_auto`) (%) |
 |:---:|:---:|:---:|
-| 10h | 72.50 | 74.61 |
-| 25h | 64.86 | 66.27 |
-| 50h | 58.87 | 59.28 |
-| 75h | 54.86 | 56.25 |
-| 100h | 53.29 | 53.63 |
-| 125h | &mdash; | 51.71 |
-| 150h | &mdash; | 51.25 |
-| 175h | &mdash; | 49.84 |
-| 200h | &mdash; | 48.75 |
-### 2. Gender Bias Analysis (40h Models)
-Evaluation demonstrating the impact of gender representation in the training set.
-| Training Subset | Global WER (%) | WER Males (%) | WER Females (%) |
 |:---|:---:|:---:|:---:|
-| 40h Males | 62.15 | 61.32 | 62.97 |
 | 40h Females | 59.33 | 59.17 | 59.49 |
-| 40h Mix | 59.52 | 59.19 | 59.85 |
-### 3. Out of Distribution (OOD) Robustness
-Evaluated using the `model_200h_auto.pt` checkpoint on different video degradation and domain shift scenarios.
-| OOD Category | WER (%) |
 |:---|:---:|
-| Vlogs | 58.61 |
 | Specific domains | 63.01 |
-| Noisy | 68.96 |
-| Archival | 87.97 |
-| **Global OOD** | **68.46** |
-### 4. Stability and Variance Analysis
-Due to high computational resource requirements, comprehensive multiple-run variance testing was isolated to the 100-hour models. The models were trained across 3 different random data shuffles to observe stability and the true impact of human annotations versus auto-generated labels.
-| Data Type (100h) | Mean WER (%) | Standard Deviation (σ) (%) |
-|:---|:---:|:---:|
-| Human Annotated | 53.21 | ± 0.37 |
-| Auto Generated | 53.82 | ± 0.17 |
----

 ---
 language:
 - ro
+license: cc-by-nc-4.0
+library_name: pytorch
+pipeline_tag: video-text-to-text
+tags:
+- visual-speech-recognition
+- lip-reading
+- vsr
+- romanian
+- speech-recognition
+- audio-visual
+datasets:
+- vsro200/vsro200
+metrics:
+- wer
 ---
+# VSRo-200: Romanian Visual Speech Recognition Models
+This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
+The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://https://github.com/vsro200/vsro200).
+## Checkpoints
+All checkpoints follow the naming pattern `model_[hours]_[type].pt`:
+- `_annot` — trained on human-annotated transcriptions
+- `_auto` — trained on automatically generated pseudo-labels
+- `_shuffle` — alternative data splits used for variance analysis (100h models)
+- `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis
+## Results
+All results are reported in Word Error Rate (WER, %) on the **Test Unseen** split. Lower is better.
+### Annotated vs. auto-labeled data scaling
+| Hours | Annotated | Auto |
 |:---:|:---:|:---:|
+| 10  | 72.50 | 74.61 |
+| 25  | 64.86 | 66.27 |
+| 50  | 58.87 | 59.28 |
+| 75  | 54.86 | 56.25 |
+| 100 | 53.29 | 53.63 |
+| 125 |   —   | 51.71 |
+| 150 |   —   | 51.25 |
+| 175 |   —   | 49.84 |
+| 200 |   —   | 48.75 |
+### Variance analysis (100h models, 3 random shuffles)
+| Data type | Mean WER | Std. dev. |
+|:---|:---:|:---:|
+| Human annotated | 53.21 | ± 0.37 |
+| Auto generated  | 53.82 | ± 0.17 |
+### Gender bias analysis (40h models)
+| Training subset | Global | Males | Females |
 |:---|:---:|:---:|:---:|
+| 40h Males   | 62.15 | 61.32 | 62.97 |
 | 40h Females | 59.33 | 59.17 | 59.49 |
+| 40h Mix     | 59.52 | 59.19 | 59.85 |
+### Out-of-distribution robustness (`model_200h_auto.pt`)
+| OOD category     | WER   |
 |:---|:---:|
+| Vlogs            | 58.61 |
 | Specific domains | 63.01 |
+| Noisy            | 68.96 |
+| Archival         | 87.97 |
+| Global OOD       | 68.46 |
+## Citation
+If you use these models, please cite:
+```bibtex
+@inproceedings{vsro200,
+  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
+  author = {...},
+  year   = {...}
+}
+```