| --- |
| language: |
| - ro |
| library_name: transformers |
| pipeline_tag: automatic-speech-recognition |
| base_model: alexandradiaconu/whisper-small-echo-34 |
| tags: |
| - generated_from_trainer |
| - whisper |
| - speech-recognition |
| - romanian |
| - noisy-speech |
| - avsr |
| - audio-visual-speech-recognition |
| datasets: |
| - vsro200/vsro200 |
| metrics: |
| - wer |
| model-index: |
| - name: whisper-small-vsro200 |
| results: [] |
| --- |
| |
| # Noisy Whisper Small for Romanian AVSR |
|
|
| This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*. |
|
|
| It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus. |
|
|
| The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search. |
|
|
| For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200). |
|
|
| ## Results |
|
|
| Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better. |
|
|
| ### Gaussian noise |
|
|
| | SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) | |
| |:---:|:---:|:---:|:---:|:---:|:---:| |
| | -5 | 90.11 | 73.49 | 47.80 | 80.26 | **38.87** | |
| | 0 | 69.54 | 40.99 | 47.80 | 51.67 | **26.76** | |
| | 5 | 47.40 | 24.40 | 47.80 | 34.72 | **17.63** | |
| | 10 | 33.68 | 15.69 | 47.80 | 23.43 | **14.55** | |
| | 15 | 25.73 | 13.08 | 47.80 | 18.77 | **12.11** | |
|
|
| ### Babble noise |
|
|
| | SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) | |
| |:---:|:---:|:---:|:---:|:---:|:---:| |
| | -5 | 93.91 | 137.05 | 47.80 | 83.48 | **38.59** | |
| | 0 | 73.09 | 50.23 | 47.80 | 56.36 | **34.90** | |
| | 5 | 45.86 | 19.92 | 47.80 | 30.85 | **18.85** | |
| | 10 | 28.63 | 14.19 | 47.80 | 23.04 | **13.36** | |
| | 15 | 22.36 | **12.11** | 47.80 | 17.31 | **12.11** | |
|
|
| At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions. |
|
|
| ## Training hyperparameters |
|
|
| | Parameter | Value | |
| |:---|:---| |
| | Learning rate | 1e-05 | |
| | Train batch size | 8 | |
| | Eval batch size | 8 | |
| | Gradient accumulation steps | 4 | |
| | Effective batch size | 32 | |
| | Optimizer | AdamW (torch fused), 尾=(0.9, 0.999), 蔚=1e-08 | |
| | LR scheduler | Linear, 100 warmup steps | |
| | Epochs | 3 | |
| | Mixed precision | Native AMP | |
| | Seed | 42 | |
|
|
| ### Framework versions |
|
|
| Transformers 5.0.0 路 PyTorch 2.10.0+cu128 路 Datasets 4.0.0 路 Tokenizers 0.22.2 |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @inproceedings{vsro200, |
| title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness}, |
| author = {...}, |
| year = {2026} |
| } |
| |
| ``` |
|
|
| ```bibtex |
| @article{diaconu2026ron3ws, |
| title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks}, |
| author={Diaconu, Alexandra and V卯naga, M膬d膬lina and Alexe, Bogdan}, |
| journal={arXiv preprint arXiv:2603.02368}, |
| year={2026} |
| } |
| ``` |
|
|