Training information/script

by mitkaj - opened Jul 28, 2025

Jul 28, 2025

•

edited Jul 28, 2025

Hi,
I was wondering if you could share how exactly you have trained the model. I am trying to replicate your results with a fine-tuning script I have created, but I am only able to achieve around 20% WER on the same training data and eval data (using the train+validation for training and test for eval from the Czech CommonVoice). I think I have the exact model setup as you have
{
"activation_dropout": 0.0,
"adapter_act": "relu",
"adapter_kernel_size": 3,
"adapter_stride": 2,
"add_adapter": true,
"apply_spec_augment": false,
"architectures": [
"Wav2Vec2BertForCTC"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"classifier_proj_size": 768,
"codevector_dim": 768,
"conformer_conv_dropout": 0.1,
"contrastive_logits_temperature": 0.1,
"conv_depthwise_kernel_size": 31,
"ctc_loss_reduction": "mean",
"ctc_zero_infinity": false,
"diversity_loss_weight": 0.1,
"eos_token_id": 2,
"feat_proj_dropout": 0.0,
"feat_quantizer_dropout": 0.0,
"feature_projection_input_dim": 160,
"final_dropout": 0.1,
"hidden_act": "swish",
"hidden_dropout": 0.0,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"layerdrop": 0.0,
"left_max_position_embeddings": 64,
"mask_feature_length": 10,
"mask_feature_min_masks": 0,
"mask_feature_prob": 0.0,
"mask_time_length": 10,
"mask_time_min_masks": 2,
"mask_time_prob": 0.0,
"max_source_positions": 5000,
"model_type": "wav2vec2-bert",
"num_adapter_layers": 1,
"num_attention_heads": 16,
"num_codevector_groups": 2,
"num_codevectors_per_group": 320,
"num_hidden_layers": 24,
"num_negatives": 100,
"output_hidden_size": 1024,
"pad_token_id": 2,
"position_embeddings_type": "relative_key",
"proj_codevector_dim": 768,
"right_max_position_embeddings": 8,
"rotary_embedding_base": 10000,
"tdnn_dilation": [
1,
2,
3,
1,
1
],
"tdnn_dim": [
512,
512,
512,
512,
1500
],
"tdnn_kernel": [
5,
3,
3,
1,
1
],
"torch_dtype": "float32",
"transformers_version": "4.53.3",
"use_intermediate_ffn_before_adapter": false,
"use_weighted_layer_sum": false,
"vocab_size": 51,
"xvector_output_dim": 512
}

I also set up the Trainer to the same settings as you have. Also, I set up the Tokenizer in the same fashion.

Have you done some specific preprocessing (apart from the resampling and feature extraction) on the training data? Or freeze some parts of the pretrained model?

Thank you for your time :)

sabroo

Owner Aug 4, 2025

Hi, sorry for the late response.

You can find the code here:
https://github.com/sabrooh/nlp-hf/blob/main/nlp_wav2vec_trainval_sep.ipynb.
This version is adapted for Slovak (and partially Czech), based on an existing implementation. Most of the core code isn’t originally mine, and I only made minor changes to support Slovak and Czech datasets. It's not fully optimized, as the goal was mainly to run some preliminary tests.
I also have code available for processing the Common Voice 22.0 dataset, if that would be helpful.

If you have any questions, ideas, or would be interested in collaborating on research, feel free to reach out.

mitkaj

Aug 5, 2025

Thank you for sharing your training code!
It turned out that some of the downloaded dependencies were interfering with training. After reinstalling Transformers at the exact version you specified, everything ran as expected. I couldn’t reproduce the original error, so it was most likely caused by one of those extra dependencies.

One idea to potentially boost performance is to implement a beam-search decoder during generation; it often yields better sequence quality than greedy decoding.

I appreciate your offer to collaborate. I’m currently focused on wrapping up my experiments, but I’ll reach out if I decide to extend this work further.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment