Evaluation reproduction

by NohTow - opened May 23, 2025

LightOn AI org May 23, 2025

Hey,

PyLate is getting merged in MTEB soon, facilitating evaluation of PyLate models directly using MTEB.
In the mean time, people might want to reproduce the evaluation results, so as for GTE-ModernColBERT, I am sharing a boilerplate to reproduce the results reported in the model card.
The boilerplate can be found here.

Please note that the reported results are with a query length of 256 except for the Pony split, where we used a query length of 32 because bigger query length yields bad results (I am not sure why, this split is a bit odd).

abdoelsayed

8 days ago

Hey

Look, I tried to train the reason modern Bert is using the following script

https://gist.github.com/NohTow/d563244596548bf387f19fcd790664d3

It does not produce the same result as the published model, for example, in biology, gives ndcg@10: 0.28 i tired 10 times with different settings, and got the same result

Do I miss something?

NohTow

LightOn AI org 3 days ago

Hey,

What is your training setup?
The training has been run on nodes of 8 GPUs, which can lead to different training if you are using another amount of GPUs because in ST, the batch size is defining the per device batch size (except if you use split_batch=True).
I would scale the batch_size up or down to match the 8 * 258 used for training.

Other than that, I'm not sure what could be different, it's really directly the script I used. What are you using for evaluation?

abdoelsayed

3 days ago

hey

i understood what the problem is, we should use the same version of Transformer, SentenceTransformer and Torch.

"sentence_transformers": "4.0.2",
"transformers": "4.48.2",
"pytorch": "2.5.1+cu124"

NohTow

LightOn AI org 3 days ago

Mh, it should not affect the results
Do you have any idea what's different due to those versions?

abdoelsayed

3 days ago

•

edited 3 days ago

i am still investigating why, but I changed sentence_transformers from v5.4 to v 4.0.2, and also I changed to torch and transformers and enabled flash attention.

With the same training script and evaluation, both training models give different results. The NDCG drops too much with the new version of sentence_transformers

NohTow

LightOn AI org 3 days ago

and enabled flash attention.

imho this is the biggest leverage. ModernBERT behaves a bit differently with FSDA but most importantly, the ColBERT models with and without FA are different (non FA model can behave like FA models if you use do_query_expansion=False)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment