Questions about training techniques

#36
by funnyice - opened

I successfully implemented the training code of voxtral realtime. However, during the training process, I found that the training results for short audio were relatively normal. But when training long audio with more silent segments, even though I added z-loss, the model tended to collapse to only output pad tokens. After I reduced the weight of the loss of the pad token, although there was some relief, the model became easy to start outputting the token of the previous text at the position where the pad token should have been output. This ratio is very difficult to adjust. Even if I use focal loss to dynamically adjust it, it still doesn't work. May I ask if there are any special techniques for this situation during the training process.

can u share the finetuning code

Sign up or log in to comment