File size: 3,735 Bytes
ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 09a5fb7 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 52a5189 b8b25e5 52a5189 0ec7156 b8b25e5 0ec7156 52a5189 b8b25e5 52a5189 ec9aec7 52a5189 ec9aec7 b8b25e5 ec9aec7 b8b25e5 ec9aec7 b8b25e5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | ---
language:
- ro
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: alexandradiaconu/whisper-small-echo-34
tags:
- generated_from_trainer
- whisper
- speech-recognition
- romanian
- noisy-speech
- avsr
- audio-visual-speech-recognition
datasets:
- vsro200/vsro200
metrics:
- wer
model-index:
- name: whisper-small-vsro200
results: []
---
# Noisy Whisper Small for Romanian AVSR
This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.
The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search.
For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
## Results
Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.
### Gaussian noise
| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5 | 90.11 | 73.49 | 47.80 | 80.26 | **38.87** |
| 0 | 69.54 | 40.99 | 47.80 | 51.67 | **26.76** |
| 5 | 47.40 | 24.40 | 47.80 | 34.72 | **17.63** |
| 10 | 33.68 | 15.69 | 47.80 | 23.43 | **14.55** |
| 15 | 25.73 | 13.08 | 47.80 | 18.77 | **12.11** |
### Babble noise
| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5 | 93.91 | 137.05 | 47.80 | 83.48 | **38.59** |
| 0 | 73.09 | 50.23 | 47.80 | 56.36 | **34.90** |
| 5 | 45.86 | 19.92 | 47.80 | 30.85 | **18.85** |
| 10 | 28.63 | 14.19 | 47.80 | 23.04 | **13.36** |
| 15 | 22.36 | **12.11** | 47.80 | 17.31 | **12.11** |
At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.
## Training hyperparameters
| Parameter | Value |
|:---|:---|
| Learning rate | 1e-05 |
| Train batch size | 8 |
| Eval batch size | 8 |
| Gradient accumulation steps | 4 |
| Effective batch size | 32 |
| Optimizer | AdamW (torch fused), 尾=(0.9, 0.999), 蔚=1e-08 |
| LR scheduler | Linear, 100 warmup steps |
| Epochs | 3 |
| Mixed precision | Native AMP |
| Seed | 42 |
### Framework versions
Transformers 5.0.0 路 PyTorch 2.10.0+cu128 路 Datasets 4.0.0 路 Tokenizers 0.22.2
## Citation
If you use this model, please cite:
```bibtex
@inproceedings{vsro200,
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
author = {...},
year = {2026}
}
```
```bibtex
@article{diaconu2026ron3ws,
title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
author={Diaconu, Alexandra and V卯naga, M膬d膬lina and Alexe, Bogdan},
journal={arXiv preprint arXiv:2603.02368},
year={2026}
}
```
|