Recommended logit normalization

by wilwork - opened about 8 hours ago

Hi @tomaarsen ! First of all, thank you for the new rerankers, it’s performing incredibly well in our benchmarks.
I am using cross-encoder/ettin-reranker-400m-v1 within a search pipeline using sentence-transformers. I noticed in the your post that the training involved MSE loss on raw teacher logits with a range of approximately [−12, 22].

The teacher model (mxbai-rerank-large-v2) uses a specific sigmoid_normalize with an estimated_max of 12.0.

Is there a recommended estimated_max or temperature constant you suggest for mapping these to a 0-1 range while preserving the ranking distribution resolution?
Should we be using a centering shift (e.g., x−11.0) before applying the sigmoid to account for the training distribution?
Currently, I'm testing:
score = sigmoid((raw_logit - 11.0) )
I’d love to know if you have a more "official" recommendation for rescaling. Thanks again for the great work!

tomaarsen

Sentence Transformers - Cross-Encoders org about 5 hours ago

•

edited about 5 hours ago

Hello!

I'm glad to hear that it's working well for you!
Personally, I only care for the ranking and not so much the scores, so I don't tend to normalize anymore. However, if you do want your scores between 0...1, then Sigmoid does preserve the ranking, but you'll get a lot of scores near 0.0002 and 0.9998. Inspired by https://huggingface.co/zeroentropy/zerank-2-reranker, I think the cleanest solution is likely to divide the raw scores by e.g. 5 and then applying a Sigmoid. This means your scores won't be as grouped near 0 or 1, as the inputs are now likely in [-4, 4], and above 4 / below -4 you do lose a lot of clarity after a Sigmoid.

In short:

from sentence_transformers import CrossEncoder

# Download from the 🤗 Hub
model = CrossEncoder(
    "cross-encoder/ettin-reranker-400m-v1",
    model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"},  # Optional: pip install kernels
)

# Get scores for pairs of inputs
query = "Which planet is known as the Red Planet?"
passages = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet.",
    "Pluto is not considered a planet anymore, but it is still an interesting celestial body.",
    "Artificial intelligence is transforming many industries.",
    "Information retrieval is my passion",
]
scores = model.predict([(query, passage) for passage in passages], convert_to_tensor=True)
print(scores)
# tensor([ 3.6719, 11.7500,  4.7188,  9.3750,  4.2500, -4.9062, -4.4375],
#        device='cuda:0', dtype=torch.bfloat16)

processed_scores = (scores / 5).sigmoid()
print(processed_scores)
# tensor([0.6758, 0.9141, 0.7188, 0.8672, 0.6992, 0.2734, 0.2910],
#        device='cuda:0', dtype=torch.bfloat16)

Tom Aarsen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment