Parakeet-TDT-0.6b-v3-French-TV-Media

This model is a fine-tuned version of NVIDIA Parakeet-TDT-0.6b-v3 optimized for French spoken in media contexts (Radio, Television, Podcasts).

It leverages the Token-and-Duration Transducer (TDT) architecture, which provides an exceptional balance between the high accuracy of classic Transducers and the inference speed of non-autoregressive models, thanks to its efficient prediction of token durations.


Usage with NVIDIA NeMo

To use this model, you must have the NVIDIA NeMo toolkit installed.

Installation

pip install nemo_toolkit['all']==2.6.1
import nemo.collections.asr as nemo_asr

# Load the model from Hugging Face
model = nemo_asr.models.ASRModel.from_pretrained("Archime/parakeet-tdt-0.6b-v3-fr-tv-media")

# Transcribe an audio file
transcription = model.transcribe(["path/to/your_audio.wav"])
print(transcription)

Performances (WER/CER)

Note: Results depend on audio quality and the specific media domain.

The fine-tuning process specifically targeted the nuances of French media speech, including spontaneous dialogue, interviews, and broadcast-quality audio.

image

The following table compares the Base Model vs. this Fine-tuned version. All scores are in decimal format (e.g., 0.0445 = 4.45% WER).

Dataset Mode Base WER Fine-tuned WER Improvement
News (Info) Processed 0.1042 0.0445 -57%
Documentaries Processed 0.0863 0.0380 -56%
Society (Talks) Processed 0.1224 0.0488 -60%
Sports Processed 0.1187 0.0845 -29%
Entertainment Processed 0.1174 0.0662 -43%
Fleurs (FR) Processed 0.0871 0.0874 Stable
Fleurs (EN) Processed 0.0753 0.1716 Degradation

⚠️ Limitations & Bias

While this model excels in media contexts, users should be aware of the following limitations:

  • Cross-lingual Performance: This model was optimized exclusively for French. A significant performance degradation has been observed in other languages compared to the original Parakeet-TDT-0.6b-v3. It is not recommended for multilingual or non-French transcription tasks.
  • Acoustic Noise: Although robust to TV studio environments, extreme background noise (e.g., loud music festivals or heavy wind during outdoor reporting) may still affect accuracy.
  • Domain Specificity: The model is tuned for "Media" French. Technical jargon from very specific fields (e.g., advanced medical or legal proceedings) not covered in the 2026 dataset might result in higher WER.
  • Overlapping Speech: Like most ASR models, performance may decrease during heated debates where multiple speakers talk simultaneously.

Fine-tuning Methodology: Selective Transfer Learning

To optimize this model for French media while preserving the robust acoustic grounding of the original model, a parameter freezing strategy was applied.

Freezing Strategy

  • Encoder (Frozen): The Conformer block (608M parameters) remained in eval mode. This preserves universal sound and phoneme understanding, preventing catastrophic forgetting.
  • Decoder & Joint (Trainable): Only the linguistic prediction and joining modules were updated (~18M parameters). This is where the model learned specific vocabulary and syntax patterns for 2026 French media.

Training Architecture Summary

Component Type Parameters Mode
Preprocessor AudioToMelSpectrogram 0 Train
Encoder ConformerEncoder 608 M EVAL (Frozen)
Decoder RNNTDecoder 11.8 M TRAIN
Joint RNNTJoint 6.3 M TRAIN
Loss RNNTLoss 0 Train
Spec_augmentation SpectrogramAugmentation 0 Train

Parameter Statistics

  • Trainable Parameters: 18.1 M (2.9%)
  • Non-trainable Parameters: 608 M (97.1%)
  • Total Model Size: 627 M parameters (2.5 GB)

Training Data

This model was specifically fine-tuned on the Archime/french_tv_media_dataset_2026 dataset.

The dataset consists of 5 major thematic streams, which were carefully balanced during the fine-tuning process to ensure broad coverage of the media landscape:

  • News (Info): Television news bulletins and hourly news flashes.
  • Society: Debates, talk shows, and panel discussions.
  • Entertainment: Game shows, variety shows, and live entertainment.
  • Documentaries: Voice-over narrations and on-location field interviews.
  • Sports: Play-by-play commentary and post-match interviews.

This diversity enables the model to be robust across various linguistic registers—ranging from the formal, scripted language of documentaries to the spontaneous (and often noisy) speech found in entertainment and live sports.

Reproducing Results

To reproduce the metrics displayed in the performance table, ensure you have NVIDIA NeMo installed and use the official evaluation scripts.

Evaluation Command (Example: Media Domains)

You can adjust the text processing flags to toggle between Raw and Normalized scores.

  1. Install dependencies:
pip install -r script/requirements.txt
  1. Prepare the test datasets:
python script/prepare_datasets_test_NeMo.py
  1. Run the evaluation:
python script/speech_to_text_eval_manifests.py \
+models="{parakeet-tdt-0.6b-v3:'nvidia/parakeet-tdt-0.6b-v3',parakeet-tdt-0.6b-v3-fr-tv-media:'Archime/parakeet-tdt-0.6b-v3-fr-tv-media'}" \
+dataset_manifests.fleurs_fr_fr="pathto/nemo_datasets/fleurs/fleurs_test_manifest.json" \
+dataset_manifests.info="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_info_manifest.json" \
+dataset_manifests.societe="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_societe_manifest.json" \
+dataset_manifests.divertissements="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_divertissements_manifest.json" \
+dataset_manifests.documentaires="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_documentaires_manifest.json" \
+dataset_manifests.sports="pathto/nemo_datasets/french_tv_media_dataset_2026/archime_test_sports_manifest.json" \
use_cer=True \
batch_size=32

Citation

If you use this model in your research or product, please cite the original Parakeet-TDT work and this fine-tuned version:

Base model:

@misc{parakeet-tdt-0.6b-v3,
  author = {NVIDIA},
  title = {Parakeet-TDT-0.6B},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
@misc{parakeet_tdt_french_tv_2026,
  author = {Archime},
  title = {Parakeet-TDT-0.6b-French-TV-Media},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/Archime/parakeet-tdt-0.6b-v3-fr-tv-media](https://huggingface.co/Archime/parakeet-tdt-0.6b-v3-fr-tv-media)}}
}

Contact

For questions or issues, please open an issue on the repository.

Downloads last month
153
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Archime/parakeet-tdt-0.6b-v3-fr-tv-media

Finetuned
(35)
this model

Dataset used to train Archime/parakeet-tdt-0.6b-v3-fr-tv-media

Space using Archime/parakeet-tdt-0.6b-v3-fr-tv-media 1

Evaluation results