Parakeet-TDT-0.6b-v3-French-TV-Media

This model is a fine-tuned version of NVIDIA Parakeet-TDT-0.6b-v3 optimized for French spoken in media contexts (Radio, Television, Podcasts).

It leverages the Token-and-Duration Transducer (TDT) architecture, which provides an exceptional balance between the high accuracy of classic Transducers and the inference speed of non-autoregressive models, thanks to its efficient prediction of token durations.

Usage with NVIDIA NeMo

To use this model, you must have the NVIDIA NeMo toolkit installed.

Installation

pip install nemo_toolkit['all']==2.6.1

import nemo.collections.asr as nemo_asr

# Load the model from Hugging Face
model = nemo_asr.models.ASRModel.from_pretrained("Archime/parakeet-tdt-0.6b-v3-fr-tv-media")

# Transcribe an audio file
transcription = model.transcribe(["path/to/your_audio.wav"])
print(transcription)

Performances (WER/CER)

Note: Results depend on audio quality and the specific media domain.

The fine-tuning process specifically targeted the nuances of French media speech, including spontaneous dialogue, interviews, and broadcast-quality audio.

The following table compares the Base Model vs. this Fine-tuned version. All scores are in decimal format (e.g., 0.0445 = 4.45% WER).

Dataset	Mode	Base WER	Fine-tuned WER	Improvement
News (Info)	Processed	0.1042	0.0445	-57%
Documentaries	Processed	0.0863	0.0380	-56%
Society (Talks)	Processed	0.1224	0.0488	-60%
Sports	Processed	0.1187	0.0845	-29%
Entertainment	Processed	0.1174	0.0662	-43%
Fleurs (FR)	Processed	0.0871	0.0874	Stable
Fleurs (EN)	Processed	0.0753	0.1716	Degradation

⚠️ Limitations & Bias

While this model excels in media contexts, users should be aware of the following limitations:

Cross-lingual Performance: This model was optimized exclusively for French. A significant performance degradation has been observed in other languages compared to the original Parakeet-TDT-0.6b-v3. It is not recommended for multilingual or non-French transcription tasks.
Acoustic Noise: Although robust to TV studio environments, extreme background noise (e.g., loud music festivals or heavy wind during outdoor reporting) may still affect accuracy.
Domain Specificity: The model is tuned for "Media" French. Technical jargon from very specific fields (e.g., advanced medical or legal proceedings) not covered in the 2026 dataset might result in higher WER.
Overlapping Speech: Like most ASR models, performance may decrease during heated debates where multiple speakers talk simultaneously.

Fine-tuning Methodology: Selective Transfer Learning

To optimize this model for French media while preserving the robust acoustic grounding of the original model, a parameter freezing strategy was applied.

Freezing Strategy

Encoder (Frozen): The Conformer block (608M parameters) remained in eval mode. This preserves universal sound and phoneme understanding, preventing catastrophic forgetting.
Decoder & Joint (Trainable): Only the linguistic prediction and joining modules were updated (~18M parameters). This is where the model learned specific vocabulary and syntax patterns for 2026 French media.

Training Architecture Summary

Component	Type	Parameters	Mode
Preprocessor	AudioToMelSpectrogram	0	Train
Encoder	ConformerEncoder	608 M	EVAL (Frozen)
Decoder	RNNTDecoder	11.8 M	TRAIN
Joint	RNNTJoint	6.3 M	TRAIN
Loss	RNNTLoss	0	Train
Spec_augmentation	SpectrogramAugmentation	0	Train

Parameter Statistics

Trainable Parameters: 18.1 M (2.9%)
Non-trainable Parameters: 608 M (97.1%)
Total Model Size: ~~627 M parameters (~~2.5 GB)

Training Data

This model was specifically fine-tuned on the Archime/french_tv_media_dataset_2026 dataset.

The dataset consists of 5 major thematic streams, which were carefully balanced during the fine-tuning process to ensure broad coverage of the media landscape:

News (Info): Television news bulletins and hourly news flashes.
Society: Debates, talk shows, and panel discussions.
Entertainment: Game shows, variety shows, and live entertainment.
Documentaries: Voice-over narrations and on-location field interviews.
Sports: Play-by-play commentary and post-match interviews.

This diversity enables the model to be robust across various linguistic registers—ranging from the formal, scripted language of documentaries to the spontaneous (and often noisy) speech found in entertainment and live sports.

Reproducing Results

To reproduce the metrics displayed in the performance table, ensure you have NVIDIA NeMo installed and use the official evaluation scripts.

Evaluation Command (Example: Media Domains)

You can adjust the text processing flags to toggle between Raw and Normalized scores.

Install dependencies:

pip install -r script/requirements.txt

Prepare the test datasets:

python script/prepare_datasets_test_NeMo.py

Run the evaluation:

python script/speech_to_text_eval_manifests.py \
+models="{parakeet-tdt-0.6b-v3:'nvidia/parakeet-tdt-0.6b-v3',parakeet-tdt-0.6b-v3-fr-tv-media:'Archime/parakeet-tdt-0.6b-v3-fr-tv-media'}" \
+dataset_manifests.fleurs_fr_fr="pathto/nemo_datasets/fleurs/fleurs_test_manifest.json" \
+dataset_manifests.info="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_info_manifest.json" \
+dataset_manifests.societe="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_societe_manifest.json" \
+dataset_manifests.divertissements="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_divertissements_manifest.json" \
+dataset_manifests.documentaires="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_documentaires_manifest.json" \
+dataset_manifests.sports="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_sports_manifest.json" \
use_cer=True \
batch_size=32

Citation

If you use this model in your research or product, please cite the original Parakeet-TDT work and this fine-tuned version:

Base model:

@misc{parakeet-tdt-0.6b-v3,
  author = {NVIDIA},
  title = {Parakeet-TDT-0.6B},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}

@misc{parakeet_tdt_french_tv_2026,
  author = {Archime},
  title = {Parakeet-TDT-0.6b-French-TV-Media},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/Archime/parakeet-tdt-0.6b-v3-fr-tv-media](https://huggingface.co/Archime/parakeet-tdt-0.6b-v3-fr-tv-media)}}
}

Contact

For questions or issues, please open an issue on the repository.

Downloads last month: 153

Model tree for Archime/parakeet-tdt-0.6b-v3-fr-tv-media

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

(35)

this model

Dataset used to train Archime/parakeet-tdt-0.6b-v3-fr-tv-media

Space using Archime/parakeet-tdt-0.6b-v3-fr-tv-media 1

Evaluation results

Average Media WER (Processed) on TV & Media French (Global Average)
test set self-reported

5.640
WER (Documentaries) on TV & Media French (Global Average)
test set self-reported

3.800
WER (News) on TV & Media French (Global Average)
test set self-reported

4.450
CER (Documentaries) on TV & Media French (Global Average)
test set self-reported

1.480