Tibetan DataSet

#1
by trabten - opened

Hello Tenzin,

I am working on a Tibetan translator App for my Buddhist Association in Brazil. It is for internal use, non-commercial. Here is a link to my portfolio: https://github.com/TashiRabten/BUDA_APPs_Port

I tried using a bi-encoder but Tibetan is such a low-resource language that my dataset, which I loosely had Claude use my database to generate, used english context and english glosses for the tibetan terms and the bi-enconder was not making any meaningful generalization. I moved to a cross-encoder and am generating the training data using chatgpt. However, the quality is not Buddhist specific and it is very expensive and low paced. The app uses the model to simply rank the terms as the database uses the available data on the internet which can be stored for non-commercial ends. I gathered about 500K terms in it that way which are Buddhist specific, and the model simply has to pick the best translations based on the tibetan sentence from it. A clean dataset with proper context would go a long way in terms of generating the glosses.

So far the best model I found is mBert. But I am getting 28% Precision at 1 with my current dataset. Can you share your MITRA database for non-commercial use? I will mention monlam.ai just as I mention the Namsel_OCR team to comply with the non-commercial license agreement. I will also try your modernBERT to see if it is better than mBERT.

Hello Tenzin,

since your modernBERT only had 4000 extra-tokens I suppose it is only alphabetized in Tibetan... I trained it in a couple of formats and it did not generalize at all.... so I trained your nllb gold model as a cross encoder, it is functional. I still have the dataset constraint, but the nllb seems to be the best path foward. Even the mBERT is scoring lower.

Sign up or log in to comment