--- language: - ro library_name: transformers pipeline_tag: automatic-speech-recognition base_model: alexandradiaconu/whisper-small-echo-34 tags: - generated_from_trainer - whisper - speech-recognition - romanian - noisy-speech - avsr - audio-visual-speech-recognition datasets: - vsro200/vsro200 metrics: - wer model-index: - name: whisper-small-vsro200 results: [] --- # Noisy Whisper Small for Romanian AVSR This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*. It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus. The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search. For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200). ## Results Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better. ### Gaussian noise | SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) | |:---:|:---:|:---:|:---:|:---:|:---:| | -5 | 90.11 | 73.49 | 47.80 | 80.26 | **38.87** | | 0 | 69.54 | 40.99 | 47.80 | 51.67 | **26.76** | | 5 | 47.40 | 24.40 | 47.80 | 34.72 | **17.63** | | 10 | 33.68 | 15.69 | 47.80 | 23.43 | **14.55** | | 15 | 25.73 | 13.08 | 47.80 | 18.77 | **12.11** | ### Babble noise | SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) | |:---:|:---:|:---:|:---:|:---:|:---:| | -5 | 93.91 | 137.05 | 47.80 | 83.48 | **38.59** | | 0 | 73.09 | 50.23 | 47.80 | 56.36 | **34.90** | | 5 | 45.86 | 19.92 | 47.80 | 30.85 | **18.85** | | 10 | 28.63 | 14.19 | 47.80 | 23.04 | **13.36** | | 15 | 22.36 | **12.11** | 47.80 | 17.31 | **12.11** | At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions. ## Training hyperparameters | Parameter | Value | |:---|:---| | Learning rate | 1e-05 | | Train batch size | 8 | | Eval batch size | 8 | | Gradient accumulation steps | 4 | | Effective batch size | 32 | | Optimizer | AdamW (torch fused), β=(0.9, 0.999), ε=1e-08 | | LR scheduler | Linear, 100 warmup steps | | Epochs | 3 | | Mixed precision | Native AMP | | Seed | 42 | ### Framework versions Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2 ## Citation If you use this model, please cite: ```bibtex @inproceedings{vsro200, title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness}, author = {...}, year = {2026} } ``` ```bibtex @article{diaconu2026ron3ws, title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks}, author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan}, journal={arXiv preprint arXiv:2603.02368}, year={2026} } ```