Parakeet-TDT-0.6b-v3-French-TV-Media
This model is a fine-tuned version of NVIDIA Parakeet-TDT-0.6b-v3 optimized for French spoken in media contexts (Radio, Television, Podcasts).
It leverages the Token-and-Duration Transducer (TDT) architecture, which provides an exceptional balance between the high accuracy of classic Transducers and the inference speed of non-autoregressive models, thanks to its efficient prediction of token durations.
Usage with NVIDIA NeMo
To use this model, you must have the NVIDIA NeMo toolkit installed.
Installation
pip install nemo_toolkit['all']==2.6.1
import nemo.collections.asr as nemo_asr
# Load the model from Hugging Face
model = nemo_asr.models.ASRModel.from_pretrained("Archime/parakeet-tdt-0.6b-v3-fr-tv-media")
# Transcribe an audio file
transcription = model.transcribe(["path/to/your_audio.wav"])
print(transcription)
Performances (WER/CER)
Note: Results depend on audio quality and the specific media domain.
The fine-tuning process specifically targeted the nuances of French media speech, including spontaneous dialogue, interviews, and broadcast-quality audio.
The following table compares the Base Model vs. this Fine-tuned version. All scores are in decimal format (e.g., 0.0445 = 4.45% WER).
| Dataset | Mode | Base WER | Fine-tuned WER | Improvement |
|---|---|---|---|---|
| News (Info) | Processed | 0.1042 | 0.0445 | -57% |
| Documentaries | Processed | 0.0863 | 0.0380 | -56% |
| Society (Talks) | Processed | 0.1224 | 0.0488 | -60% |
| Sports | Processed | 0.1187 | 0.0845 | -29% |
| Entertainment | Processed | 0.1174 | 0.0662 | -43% |
| Fleurs (FR) | Processed | 0.0871 | 0.0874 | Stable |
| Fleurs (EN) | Processed | 0.0753 | 0.1716 | Degradation |
⚠️ Limitations & Bias
While this model excels in media contexts, users should be aware of the following limitations:
- Cross-lingual Performance: This model was optimized exclusively for French. A significant performance degradation has been observed in other languages compared to the original Parakeet-TDT-0.6b-v3. It is not recommended for multilingual or non-French transcription tasks.
- Acoustic Noise: Although robust to TV studio environments, extreme background noise (e.g., loud music festivals or heavy wind during outdoor reporting) may still affect accuracy.
- Domain Specificity: The model is tuned for "Media" French. Technical jargon from very specific fields (e.g., advanced medical or legal proceedings) not covered in the 2026 dataset might result in higher WER.
- Overlapping Speech: Like most ASR models, performance may decrease during heated debates where multiple speakers talk simultaneously.
Fine-tuning Methodology: Selective Transfer Learning
To optimize this model for French media while preserving the robust acoustic grounding of the original model, a parameter freezing strategy was applied.
Freezing Strategy
- Encoder (Frozen): The Conformer block (608M parameters) remained in
evalmode. This preserves universal sound and phoneme understanding, preventing catastrophic forgetting. - Decoder & Joint (Trainable): Only the linguistic prediction and joining modules were updated (~18M parameters). This is where the model learned specific vocabulary and syntax patterns for 2026 French media.
Training Architecture Summary
| Component | Type | Parameters | Mode |
|---|---|---|---|
| Preprocessor | AudioToMelSpectrogram | 0 | Train |
| Encoder | ConformerEncoder | 608 M | EVAL (Frozen) |
| Decoder | RNNTDecoder | 11.8 M | TRAIN |
| Joint | RNNTJoint | 6.3 M | TRAIN |
| Loss | RNNTLoss | 0 | Train |
| Spec_augmentation | SpectrogramAugmentation | 0 | Train |
Parameter Statistics
- Trainable Parameters: 18.1 M (2.9%)
- Non-trainable Parameters: 608 M (97.1%)
- Total Model Size:
627 M parameters (2.5 GB)
Training Data
This model was specifically fine-tuned on the Archime/french_tv_media_dataset_2026 dataset.
The dataset consists of 5 major thematic streams, which were carefully balanced during the fine-tuning process to ensure broad coverage of the media landscape:
- News (Info): Television news bulletins and hourly news flashes.
- Society: Debates, talk shows, and panel discussions.
- Entertainment: Game shows, variety shows, and live entertainment.
- Documentaries: Voice-over narrations and on-location field interviews.
- Sports: Play-by-play commentary and post-match interviews.
This diversity enables the model to be robust across various linguistic registers—ranging from the formal, scripted language of documentaries to the spontaneous (and often noisy) speech found in entertainment and live sports.
Reproducing Results
To reproduce the metrics displayed in the performance table, ensure you have NVIDIA NeMo installed and use the official evaluation scripts.
Evaluation Command (Example: Media Domains)
You can adjust the text processing flags to toggle between Raw and Normalized scores.
- Install dependencies:
pip install -r script/requirements.txt
- Prepare the test datasets:
python script/prepare_datasets_test_NeMo.py
- Run the evaluation:
python script/speech_to_text_eval_manifests.py \
+models="{parakeet-tdt-0.6b-v3:'nvidia/parakeet-tdt-0.6b-v3',parakeet-tdt-0.6b-v3-fr-tv-media:'Archime/parakeet-tdt-0.6b-v3-fr-tv-media'}" \
+dataset_manifests.fleurs_fr_fr="pathto/nemo_datasets/fleurs/fleurs_test_manifest.json" \
+dataset_manifests.info="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_info_manifest.json" \
+dataset_manifests.societe="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_societe_manifest.json" \
+dataset_manifests.divertissements="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_divertissements_manifest.json" \
+dataset_manifests.documentaires="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_documentaires_manifest.json" \
+dataset_manifests.sports="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_sports_manifest.json" \
use_cer=True \
batch_size=32
Citation
If you use this model in your research or product, please cite the original Parakeet-TDT work and this fine-tuned version:
Base model:
@misc{parakeet-tdt-0.6b-v3,
author = {NVIDIA},
title = {Parakeet-TDT-0.6B},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
@misc{parakeet_tdt_french_tv_2026,
author = {Archime},
title = {Parakeet-TDT-0.6b-French-TV-Media},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{[https://huggingface.co/Archime/parakeet-tdt-0.6b-v3-fr-tv-media](https://huggingface.co/Archime/parakeet-tdt-0.6b-v3-fr-tv-media)}}
}
Contact
For questions or issues, please open an issue on the repository.
- Downloads last month
- 153
Model tree for Archime/parakeet-tdt-0.6b-v3-fr-tv-media
Base model
nvidia/parakeet-tdt-0.6b-v3Dataset used to train Archime/parakeet-tdt-0.6b-v3-fr-tv-media
Space using Archime/parakeet-tdt-0.6b-v3-fr-tv-media 1
Evaluation results
- Average Media WER (Processed) on TV & Media French (Global Average)test set self-reported5.640
- WER (Documentaries) on TV & Media French (Global Average)test set self-reported3.800
- WER (News) on TV & Media French (Global Average)test set self-reported4.450
- CER (Documentaries) on TV & Media French (Global Average)test set self-reported1.480
