Update README.md

52a5189 verified 3 days ago

3.74 kB

	---
	language:
	- ro
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	base_model: alexandradiaconu/whisper-small-echo-34
	tags:
	- generated_from_trainer
	- whisper
	- speech-recognition
	- romanian
	- noisy-speech
	- avsr
	- audio-visual-speech-recognition
	datasets:
	- vsro200/vsro200
	metrics:
	- wer
	model-index:
	- name: whisper-small-vsro200
	results: []
	---

	# Noisy Whisper Small for Romanian AVSR

	This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

	It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.

	The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through shallow fusion at decoding time, combining acoustic and visual probabilities during beam search.

	For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

	## Results

	Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.

	### Gaussian noise

	\| SNR (dB) \| Whisper zero-shot (%) \| Whisper fine-tuned (%) \| MultiVSR (visual) (%) \| Fusion (zero-shot + VSR) (%) \| Fusion (fine-tuned + VSR) (%) \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| -5 \| 90.11 \| 73.49 \| 47.80 \| 80.26 \| 38.87 \|
	\| 0 \| 69.54 \| 40.99 \| 47.80 \| 51.67 \| 26.76 \|
	\| 5 \| 47.40 \| 24.40 \| 47.80 \| 34.72 \| 17.63 \|
	\| 10 \| 33.68 \| 15.69 \| 47.80 \| 23.43 \| 14.55 \|
	\| 15 \| 25.73 \| 13.08 \| 47.80 \| 18.77 \| 12.11 \|

	### Babble noise

	\| SNR (dB) \| Whisper zero-shot (%) \| Whisper fine-tuned (%) \| MultiVSR (visual) (%) \| Fusion (zero-shot + VSR) (%) \| Fusion (fine-tuned + VSR) (%) \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| -5 \| 93.91 \| 137.05 \| 47.80 \| 83.48 \| 38.59 \|
	\| 0 \| 73.09 \| 50.23 \| 47.80 \| 56.36 \| 34.90 \|
	\| 5 \| 45.86 \| 19.92 \| 47.80 \| 30.85 \| 18.85 \|
	\| 10 \| 28.63 \| 14.19 \| 47.80 \| 23.04 \| 13.36 \|
	\| 15 \| 22.36 \| 12.11 \| 47.80 \| 17.31 \| 12.11 \|

	At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.

	## Training hyperparameters

	\| Parameter \| Value \|
	\|:---\|:---\|
	\| Learning rate \| 1e-05 \|
	\| Train batch size \| 8 \|
	\| Eval batch size \| 8 \|
	\| Gradient accumulation steps \| 4 \|
	\| Effective batch size \| 32 \|
	\| Optimizer \| AdamW (torch fused), β=(0.9, 0.999), ε=1e-08 \|
	\| LR scheduler \| Linear, 100 warmup steps \|
	\| Epochs \| 3 \|
	\| Mixed precision \| Native AMP \|
	\| Seed \| 42 \|

	### Framework versions

	Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2

	## Citation

	If you use this model, please cite:

	```bibtex
	@inproceedings{vsro200,
	title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
	author = {...},
	year = {2026}
	}

	```

	```bibtex
	@article{diaconu2026ron3ws,
	title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
	author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
	journal={arXiv preprint arXiv:2603.02368},
	year={2026}
	}
	```