vsro200's picture
Update README.md
52a5189 verified
---
language:
- ro
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: alexandradiaconu/whisper-small-echo-34
tags:
- generated_from_trainer
- whisper
- speech-recognition
- romanian
- noisy-speech
- avsr
- audio-visual-speech-recognition
datasets:
- vsro200/vsro200
metrics:
- wer
model-index:
- name: whisper-small-vsro200
results: []
---
# Noisy Whisper Small for Romanian AVSR
This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.
The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search.
For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
## Results
Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.
### Gaussian noise
| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5 | 90.11 | 73.49 | 47.80 | 80.26 | **38.87** |
| 0 | 69.54 | 40.99 | 47.80 | 51.67 | **26.76** |
| 5 | 47.40 | 24.40 | 47.80 | 34.72 | **17.63** |
| 10 | 33.68 | 15.69 | 47.80 | 23.43 | **14.55** |
| 15 | 25.73 | 13.08 | 47.80 | 18.77 | **12.11** |
### Babble noise
| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5 | 93.91 | 137.05 | 47.80 | 83.48 | **38.59** |
| 0 | 73.09 | 50.23 | 47.80 | 56.36 | **34.90** |
| 5 | 45.86 | 19.92 | 47.80 | 30.85 | **18.85** |
| 10 | 28.63 | 14.19 | 47.80 | 23.04 | **13.36** |
| 15 | 22.36 | **12.11** | 47.80 | 17.31 | **12.11** |
At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.
## Training hyperparameters
| Parameter | Value |
|:---|:---|
| Learning rate | 1e-05 |
| Train batch size | 8 |
| Eval batch size | 8 |
| Gradient accumulation steps | 4 |
| Effective batch size | 32 |
| Optimizer | AdamW (torch fused), 尾=(0.9, 0.999), 蔚=1e-08 |
| LR scheduler | Linear, 100 warmup steps |
| Epochs | 3 |
| Mixed precision | Native AMP |
| Seed | 42 |
### Framework versions
Transformers 5.0.0 路 PyTorch 2.10.0+cu128 路 Datasets 4.0.0 路 Tokenizers 0.22.2
## Citation
If you use this model, please cite:
```bibtex
@inproceedings{vsro200,
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
author = {...},
year = {2026}
}
```
```bibtex
@article{diaconu2026ron3ws,
title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
author={Diaconu, Alexandra and V卯naga, M膬d膬lina and Alexe, Bogdan},
journal={arXiv preprint arXiv:2603.02368},
year={2026}
}
```