AI & ML interests

None defined yet.

yuriyvnvย 
posted an update 1 day ago
view post
Post
82
๐ŸŽ™๏ธParakeet-TDT Fine Tuning: 4 New ASR Models

Four fine-tuned versions of NVIDIA's Parakeet-TDT-0.6B-v3 for Dutch, Portuguese, Estonian, and Slovenian โ€” among the first community fine-tunes of this architecture for the aforementioned languages

๐Ÿ“Š Results on Common Voice 17 test sets:

๐Ÿ‡ธ๐Ÿ‡ฎ Slovenian: 50.49% โ†’ 11.56% WER (-77%)
๐Ÿ‡ต๐Ÿ‡น Portuguese: 15.86% โ†’ 10.71% WER (-32%)
๐Ÿ‡ช๐Ÿ‡ช Estonian: 27.15% โ†’ 21.03% WER (-23%)
๐Ÿ‡ณ๐Ÿ‡ฑ Dutch: 5.99% โ†’ 5.33% WER (-11%)

All models output cased text with punctuation.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained(
    "yuriyvnv/parakeet-tdt-0.6b-dutch"
)
output = model.transcribe(["audio.wav"])
print(output[0].text)




๐Ÿ”— Models:
๐Ÿ‡ณ๐Ÿ‡ฑ yuriyvnv/parakeet-tdt-0.6b-dutch
๐Ÿ‡ต๐Ÿ‡น yuriyvnv/parakeet-tdt-0.6b-portuguese
๐Ÿ‡ช๐Ÿ‡ช yuriyvnv/parakeet-tdt-0.6b-estonian
๐Ÿ‡ธ๐Ÿ‡ฎ yuriyvnv/parakeet-tdt-0.6b-slovenian

๐Ÿ—๏ธ Training: Common Voice 17 + synthetic speech (OpenAI TTS), filtered with WAVe (yuriyvnv/WAVe-1B-Multimodal-PT) for quality. AdamW + cosine annealing, bf16-mixed precision, early stopping on val WER. Timestamps and long-form audio supported.

@hf-audio @NVIDIADev

#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice
yuriyvnvย 
posted an update 2 months ago
view post
Post
449
๐ŸŽฏ WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch

Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe โ€” a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models.

Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.
Resources

- Dutch model: yuriyvnv/WAVe-1B-Multimodal-NL
- Portuguese model: yuriyvnv/WAVe-1B-Multimodal-PT
- Code: https://github.com/yuriyvnv/WAVe

This model builds on CommonVoice Dutch data โ€” thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible.

Would be great to hear from the Dutch NLP community โ€” @BramVanroy @GroNLP โ€” especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.
yuriyvnvย 
posted an update 3 months ago
view post
Post
2270
๐ŸŽฏ WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality

Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.

๐Ÿ“Š Impact on Portuguese ASR:
โ€ข 34% reduction in training steps
โ€ข 50% better cross-domain generalization
โ€ข 30% less synthetic data needed
โ€ข Word-aligned attention finds errors other methods miss

๐Ÿ—๏ธ Architecture:
โ€ข Text: XLM-RoBERTa (278M params)
โ€ข Audio: Wav2Vec2-BERT 2.0 (581M params)
โ€ข Word Alignment: Multi-head attention + GLU (14M params)
โ€ข Total: 1B parameters

from transformers import AutoModel, AutoProcessor

  processor = AutoProcessor.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )
  model = AutoModel.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )



# Assess speech-transcript alignment

inputs = processor(text="Olรก, como estรก?", audio=audio_array, sampling_rate=16000, return_tensors="pt")
  quality = model(**inputs).quality_score.item()


Perfect for filtering synthetic speech datasets before ASR training.

Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment
  • 2 replies
ยท