Questions about training techniques

#36

by funnyice - opened 23 days ago

I successfully implemented the training code of voxtral realtime. However, during the training process, I found that the training results for short audio were relatively normal. But when training long audio with more silent segments, even though I added z-loss, the model tended to collapse to only output pad tokens. After I reduced the weight of the loss of the pad token, although there was some relief, the model became easy to start outputting the token of the previous text at the position where the pad token should have been output. This ratio is very difficult to adjust. Even if I use focal loss to dynamically adjust it, it still doesn't work. May I ask if there are any special techniques for this situation during the training process.

teeofftechnologies

20 days ago

can u share the finetuning code

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment