vsro200
/

whisper-small-vsro200

@@ -1,69 +1,109 @@
 ---
 library_name: transformers
 base_model: alexandradiaconu/whisper-small-echo-34
 tags:
 - generated_from_trainer
 model-index:
 - name: whisper-small-vsro200
   results: []
 ---
-# Romanian Noisy Whisper Small (AVSR Audio Component)
-This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
-## 🎙️ Model Overview
-This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
-In real-world scenarios (like podcasts or vlogs), the audio track is rarely clean. To make our audio processing robust and suitable for AVSR, we fine-tuned this Whisper model on our custom dataset with **artificially added noise** from the MUSAN library. The core purpose of this model is to serve as the acoustic counterpart to our Romanian VSR model.
-## 🔗 The Shallow Fusion (AVSR) Approach
-Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
-We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search), by combining the output probabilities of this fine-tuned noisy Whisper model (Acoustic) with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) (Visual).
-## 🏆 AVSR Performance (Word Error Rate)
-The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
-*Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the VSRo-200 dataset.*
-### 🔊 Performance under Noise Degradation
-| Noise Type | SNR (dB) | Whisper Zero-Shot (%) | Whisper Fine-Tuned *(This Model)* (%) | MultiVSR *(Visual Only)* (%) | Fusion (Zero-Shot + VSR) (%) | **Fusion (Fine-Tuned + VSR)** (%) |
-|:---|:---:|:---:|:---:|:---:|:---:|:---:|
-| Gaussian | -5 | 89.43 | 83.16 | 50.52 | 78.68 | **40.92** |
-| Gaussian | 0 | 67.32 | 42.78 | 50.52 | 46.04 | **25.98** |
-| Gaussian | 5 | 48.37 | 24.79 | 50.52 | 37.48 | **19.63** |
-| Gaussian | 10 | 32.43 | 17.34 | 50.52 | 22.93 | **15.66** |
-| Gaussian | 15 | 23.97 | 13.87 | 50.52 | 18.92 | **12.50** |
-| Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
-| Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
-| Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
-| Babble | 10 | 29.24 | **14.22** | 50.52 | 21.86 | 15.16 |
-| Babble | 15 | 22.36 | **11.68** | 50.52 | 18.77 | 12.65 |
-**Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
-## 📊 Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 8
-- eval_batch_size: 8
-- seed: 42
-- gradient_accumulation_steps: 4
-- total_train_batch_size: 32
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 100
-- num_epochs: 3
-- mixed_precision_training: Native AMP
 ### Framework versions
-- Transformers 5.0.0
-- Pytorch 2.10.0+cu128
-- Datasets 4.0.0
-- Tokenizers 0.22.2

 ---
+language:
+- ro
+license: apache-2.0
 library_name: transformers
+pipeline_tag: automatic-speech-recognition
 base_model: alexandradiaconu/whisper-small-echo-34
 tags:
 - generated_from_trainer
+- whisper
+- speech-recognition
+- romanian
+- noisy-speech
+- avsr
+- audio-visual-speech-recognition
+datasets:
+- vsro200/vsro200
+metrics:
+- wer
 model-index:
 - name: whisper-small-vsro200
   results: []
 ---
+# Noisy Whisper Small for Romanian AVSR
+This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
+It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.
+The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search.
+For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
+## Results
+Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.
+### Gaussian noise
+| SNR (dB) | Whisper zero-shot | Whisper fine-tuned | MultiVSR (visual) | Fusion (zero-shot + VSR) | Fusion (fine-tuned + VSR) |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+| -5  | 89.43 | 83.16 | 50.52 | 78.68 | 40.92 |
+|  0  | 67.32 | 42.78 | 50.52 | 46.04 | 25.98 |
+|  5  | 48.37 | 24.79 | 50.52 | 37.48 | 19.63 |
+| 10  | 32.43 | 17.34 | 50.52 | 22.93 | 15.66 |
+| 15  | 23.97 | 13.87 | 50.52 | 18.92 | 12.50 |
+### Babble noise
+| SNR (dB) | Whisper zero-shot | Whisper fine-tuned | MultiVSR (visual) | Fusion (zero-shot + VSR) | Fusion (fine-tuned + VSR) |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+| -5  | 94.95 | 141.78 | 50.52 | 79.65 | 45.61 |
+|  0  | 73.38 |  49.77 | 50.52 | 56.11 | 29.77 |
+|  5  | 46.40 |  21.50 | 50.52 | 31.96 | 18.63 |
+| 10  | 29.24 |  14.22 | 50.52 | 21.86 | 15.16 |
+| 15  | 22.36 |  11.68 | 50.52 | 18.77 | 12.65 |
+At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (141.78% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 45.61% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.
+## Training hyperparameters
+| Parameter | Value |
+|:---|:---|
+| Learning rate | 1e-05 |
+| Train batch size | 8 |
+| Eval batch size | 8 |
+| Gradient accumulation steps | 4 |
+| Effective batch size | 32 |
+| Optimizer | AdamW (torch fused), β=(0.9, 0.999), ε=1e-08 |
+| LR scheduler | Linear, 100 warmup steps |
+| Epochs | 3 |
+| Mixed precision | Native AMP |
+| Seed | 42 |
 ### Framework versions
+Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2
+## Citation
+If you use this model, please cite:
+```bibtex
+@inproceedings{vsro200,
+  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
+  author = {...},
+  year   = {2026}
+}
+```
+```bibtex
+@article{diaconu2026ron3ws,
+  title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
+  author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
+  journal={arXiv preprint arXiv:2603.02368},
+  year={2026}
+}
+```
+```bibtex
+@article{radford2022whisper,
+  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
+  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
+  journal = {arXiv preprint arXiv:2212.04356},
+  year    = {2022}
+}
+```